Introduction

Diabetes mellitus (DM) is a chronic metabolic disease caused by deficiency or diminished effectiveness of endogenous insulin. Poor glucose control can lead to complications in multiple organs resulting in increased rates of morbidity and mortality1. According to International Diabetes Federation (IDF), in the world, there are about 415 million patients suffering from diabetes in 20152, and this number is growing.

The efficacy of treatment in preventing diabetes complications has been confirmed by Diabetes Control and Complications Trial3. Specifically, the success4,5 in Continuous Glucose Monitoring system (CGM), an invasive device is used to measure and record patients’ glucose concentration every 5 minutes. Recently, the CGM has been introduced in the prediction of glucose concentration6,7. We consider it is worth investigating the classification according to CGM signal as tools for the management of DM8,9.

According to the pathogeny of diabetes, there are 4 types of DM. Type 1 diabetes and Type 2 diabetes are the main categories. Clinically, the type is usually determined by tests, such as fasting plasma insulin (FINS), insulin releasing test (INS), C-Peptide test, insulin autoantibodies (IAA) and islet cell autoantibodies (ICA). Some of the tests are only temporary and incomplete to diagnose diabetes for the limitation of cognition. The scheme proposed in this paper tries to provide an effective and a supplementary indicator for diabetes classification, which would have benefits of perfecting the framework, raising the precision, and offering a convenient and intelligent method of classifying diabetes.

Classification is one of the hottest issues in data mining. Various classification algorithms have been introduced in many fields, such as Sound recognition10, Bitcoin fraud11, and Tomato plant disease12. In this study, a novel scheme calculating the indicator of classifying diabetes is proposed and it consists of two stages: the first is feature extraction, in which 17 features13 are automatically extracted from the curves of glucose concentration using statistics methods, and the second is diabetes parameter regression based on an ensemble learning algorithm named double-Class AdaBoost. The scheme can give an intelligent and precise method to diagnose the type of diabetes.

The Scheme

Based on Adaboost and its variant, a diabetes classification indicator is proposed, and its processing steps are described below:

  1. 1)

    Utilize CGM, collect curves of diabetic glucose concentration.

  2. 2)

    Employ feature extraction model to achieve 17 features from the training curves of glucose concentration.

  3. 3)

    Build and train a classifier using the 17 features based on variants of AdaBoost.

  4. 4)

    Verify the classifier using the testing curves of glucose concentration.

  5. 5)

    Evaluate the indicator of the scheme to classify diabetes.

Methods

CGM

CGM is used in examination of how the blood glucose concentration reacts to insulin, exercise, food, and others. And it needs calibrating with traditional finger-stick measurements. A CGM acquires glucose concentration of patients on a continuous basis (every five minutes).

Feature Extraction

Feature extraction14 is based on the morphological characteristics of signals to obtain the intrinsic features. Features usually possess some physical significance and could be extracted from complicated multi-component signals such as a time-series of glucose concentration. Hence, the feature extraction is taking a glucose concentration signal as input and gives the features as the output. The model of features extraction is illustrated in Figure 1.

Figure 1
figure 1

Feature extraction from CGM signal. The 17 features of blood glucose concentration and glucose fluctuation are extracted when CGM signal of one day is inputted to the Mathematical models.

The first feature is the average of blood glucose on the whole day; it can be calculated by Equation (1).

$$\overline{x}=\frac{1}{n}\sum _{i=1}^{n}{x}_{i}$$
(1)

where x i is a discrete value of blood glucose concentration, n is the number of x i in a day. The subsequent six features are also the average in different periods including pre-meal average and post-meal average of three meals. All those averages can be calculated by Equation (1).

SDBE this feature is Standard Deviation of Blood Glucose; it can be calculated by Equation (2).

$$SDBE=\sqrt{\frac{1}{n}\sum _{i=1}^{n}{({x}_{i}-\overline{x})}^{2}}$$
(2)

LAGE this feature is Large Amplitude of plasma Glucose Excursions; it can be calculated by Equation (3).

$$LAGE={x}_{\max }-{x}_{\min }$$
(3)

where x max , x min are the maximum and minimum values of blood glucose in a day.

MODD this feature is absolute Mean Of Daily Differences, and it can be calculated by Equation (4).

$$MODD=\frac{1}{287}{\sum }_{i=1}^{288}|{v}_{1}(i)-{v}_{2}(i)|$$
(4)

where v 1, v 2 are arrays of glucose concentration with 288 values of one day, and they are the glucose concentrations of same diabetic in different day respectively.

Area Under the Curve of glucose concentration (AUC) indicates the area parceled by the glucose concentration-time curve and the threshold (upper threshold or lower threshold). The AUC should be calculated with two areas including under the curve of glucose concentration and the upper threshold, over the curve of glucose concentration and the lower threshold.

Mean amplitude of plasma glucose excursions (MAGE). This feature has been studied by many papers15. The MAGE can be calculated as follows: Step 1. Get all extreme points in the signal; Step 2. Find the first valid extreme point whose absolute differences of both adjacent extreme points are greater than the Standard Deviation of the signal; Step 3. Accumulate all differences of valid extreme point according to the left direction of the first valid extreme point in step 2; Step 4. MAGE is the average of sum counted in step 3.

$$MAGE=\frac{1}{n}\sum _{i=1,3,5,\mathrm{...}}^{2n-1}{ep}_{i}-{ep}_{i+1}$$
(5)

where ep i is the left adjacent extreme point of valid extreme point ep i+1 ; n is the number of valid extreme point.

Blood glucose Percentage of Time (PT) includes two main aspects: times and the percentage of the time. The features relating to times, including Times of High excursion (TH), Times of Low excursion (TL), are the number of the extreme points of glucose concentration curve over threshold line in one day. The features relating to percentage of the time, including duration above High Limit (HL), duration below Low Limit (LL), duration Within Limit (WL), are the percentage of the time of glucose concentration curve over the threshold line one day.

In our research, a dataset was built to store the features, as shown in Table 1.

Table 1 The features dataset (mmol/L).

Usually, age is an important factor related to diabetes16, and places heavy weight on the classification of type of diabetes, thus would cause under-fitting. Some other factors were not involved, such as exercise, food, and insulin or oral medicines, which are difficult to quantify as these factors are from different manufacturers and are difficult to homogenize. Furthermore the main purpose of our research is to provide an easy and approach available to diabetes diagnose.

AdaBoost

Boosting methods are iterative algorithms17. AdaBoost is a boosting method which united some simple “weak” classifiers to generate generalized models. It was proposed by Freund and Schapire to distinguish a binary classification18, and later various AdaBoost variants such as Real Adaboost were proposed19. AdaBoost and its variants have contributed to various real-world applications, such as face detection20 and human detection21. In our research, its variants Real Adaboost19, Gentle AdaBoost22, and Modest AdaBoost23, were applied to the model of diabetes parameter regression in our scheme.

Diabetes parameter regression based on AdaBoost

Let s = {(g1,y1), (g2,y2), …, (gm,ym)} be a set of training samples with initial weights D 1(g i ) = 1/m, and m is the number of training data. Each g i is a vector with 17 features which were extracted from CGM curve of glucose concentration, and each y i is the label of g i . In our research, the DM classification is a binary classification, so assuming label y i equals 1 if the sample belongs to type 1 diabetes, and otherwise equals −1 when the sample belongs to type 2 diabetes.

Diabetes classification based on AdaBoost algorithm is described as follows:

Input: training dataset s = {(g1,y1), (g2,y2), …, (gm,ym)}, initialize data weights D 1(g i ) = 1/m, i = 1, …, m;

$${{\rm{g}}}_{{\rm{i}}}\in {{\rm{G}}}^{17},{y}_{i}\in \{-1,1\}$$
$${\rm{For}}\,t=1,2,\ldots {\rm{T}}:$$

Step 1: train weak classifier h t using distribution D t .

Step 2: Calculate the error of the weak classifier h t : G → {−1,1}.

$${\varepsilon }_{t}={\sum }_{i=1}^{m}{D}_{t}({g}_{i})[{y}_{i}\ne {h}_{j}({g}_{i})]$$

Step 3: Calculate the weight a t  = (1/2)ln((1 − ε t )/ε t ).

Step 4: Update data weight D t and get new weights D t+1 by error.

$${D}_{t+1}=\frac{{D}_{t}({g}_{i})\exp (-{a}_{t}{y}_{i}\,{h}_{t}({g}_{i}))}{{z}_{t}}$$

where zt is a normalization factor.

Output: final classifier: \({H}_{final}(g)=sign({\sum }_{t=1}^{T}{a}_{t}{h}_{t}(g))\)

Diabetes parameter regression based on Variants algorithm of AdaBoost

Three variants algorithms of AdaBoost were used for diabetes parameter regression. Firstly, Real AdaBoost is a generalization of AdaBoost algorithm proposed by Schapire and Singer19. Its output is not binary, but a real number between + 1 and −1. Real AdaBoost algorithm seems to Adaboost, except the steps 1–3 summarized as below:

For each weak classifier h t

  1. a.

    every value space of features is divided into several disjoint blocks G1, …, Gn

  2. b.

    under the distribution Dt calculate

    $${p}_{l}^{j}(g)=p({g}_{i}\in {G}_{j},{y}_{i}=l)={\sum }_{i=g={G}_{j},{y}_{i}=l}{D}_{t}(i)\,l\in 1,-1$$
  3. c.

    set the output of h on each Gj as

    $${h}_{t}^{j}(g)\leftarrow \frac{1}{2}\,{\rm{l}}{\rm{o}}{\rm{g}}\,{p}_{l}^{j}/(1-{p}_{l}^{j})\in R\quad {h}_{t}({g}_{i})={h}_{t}(g)$$
  4. d.

    calculate the normalization factor

$$z=2\sum _{j}\sqrt{{p}_{+1}^{j}{p}_{-1}^{j}}\quad {z}_{t}=\,{\rm{\min }}\,z\quad {h}_{t}={\rm{\arg }}\,{\rm{\min }}\,z$$

The second variants algorithm of AdaBoost named Modest AdaBoost which complete steps could be found in paper23. Gentle AdaBoost is the most efficient boosting algorithm and it has been used in Cascades object detection24. In each epoch, Gentle AdaBoost does a weighted regression based on least square. It means that the regression function h t (g) is fit by weighted least-squares of y i to g i .

Model Evaluation

In order to evaluate classification results, the present study applied two performance indicators: ACC (accuracy) and MCC (Matthews correlation coefficient). P and N represent the positive class and negative class respectively. T and F denote True and False respectively, as described in Table 2.

Table 2 Confusion matrix. In this table, TP is the number of true positives, TN the number of the true negatives, FP the number of false positives and FN the number of false negatives.

The ACC is as the formula

$$ACC=\frac{TP+TN}{P+N}$$
(6)

The MCC is as the formula

$$MCC=\frac{TP\times TN-FP\times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$
(7)

Results

Patient Database

The diabetics were screened at ages ranging from 40 to 60 and the glucose concentration were acquired from the Department of Endocrinology in People’s Hospital of Henan Province of China. There are 1050 samples of diabetes glucose concentration, and each sample is a curve with more than 864 values.

Experiment and analyses

To demonstrate the performance of proposed indicator, 300 of the 1050 samples were used as training set to construct a diabetes classifier while the other 750 were used as testing set to evaluate the classifier. Besides, 7 mmol/L, 8 mmol/L, 9 mmol/L, 10 mmol/L and 11 mmol/L were set to the upper threshold of glucose target range in the progress of feature extraction from the curves of glucose concentration monitored by CGM.

The Committee Report of diabetes expert of WHO diagnoses DM with fasting blood glucose concentration between 6.1 mmol/L and 6.9 mmol/L and plasma glucose of 11.1 mmol/L 2 hours post glucose-load (2 h PPG). This is the reason why 7 mmol/L, 8 mmol/L, 9 mmol/L, 10mmo/L and 11 mmol/L were selected as the upper threshold when we extract the 17 features from glucose signal.

The models of Real AdaBoost, Modest AdaBoost and Gentle AdaBoost were applied to calculating the indicator of classification diabetes and the error rate was presented in Table 3. The error rate of Modest AdaBoost is 0.0970 when the upper threshold was set at 7 mmol/L and 8 mmol/L, which means that the coincidence rate of our scheme and clinical diagnosis is 90.3%.

Table 3 Comparison of error rate.

After training 100 iterations, the three models of Real AdaBoost, Modest AdaBoost, and Gentle AdaBoost were to calculate the indicator of classifying diabetes. The test misjudging rate of indicator and clinical diagnosis illustrated in Figure 2. The upper thresholds of Figure 2 (a)–(e) were set at 7, 8, 9, 10 and 11 mmol/L respectively. It shows that when the upper limit was set at 7 mmol/L and 8 mmol/L the misjudging rate of three models were lower, and the misjudging rate of Model AdaBoost depicted by the line with the mark ‘|’ is 0.0970. Furthermore, when the upper threshold was set at 10 mmol/L, the three models perform worst in diabetes classification. But when the upper threshold was set at 9 mmol/L or 11 mmol/L, the misjudging rate of Real AdaBoost is changing, and its largest error is greater than 0.12, therefore 9 mmol/L and 11 mmol/L are not suitable for regarding as the upper threshold. The value of upper threshold affects results of diabetes classification.

Figure 2
figure 2

Comparison error of diabetes classification. Test error of the three ensemble methods, Real AdaBoost, Modest AdaBoost, Gentle AdaBoost, at 7, 8, 9, 10 and 11 mmol/L of the upper threshold of glucose target range. The y-coordinate of each point gives the test error rate, and the x-coordinate gives the times of iterations.

5-fold cross-validations were used to further demonstrate the accuracy of our scheme and seek out the best of upper threshold, after training 100 iterations, the indicator of classifying diabetes based on Real AdaBoost, Modest AdaBoost and Gentle AdaBoost were calculated and the test misjudging rate of indicator and clinical diagnosis illustrated in Figure 3. It shows that when threshold was set at 7 mmol/L or 8 mmol/L, the performance of our scheme is better, and only a few misjudging rates were above 0.1. It indicates that the coincidence rate of indicator calculated by our scheme and clinical diagnosis is better and the indicator is useful for doctors to diagnose diabetes.

Figure 3
figure 3

5-fold cross-validation. The method of 5 fold cross-validationto validates the test error of Real AdaBoost, Model AdaBoost and Gentle AdaBoost at 7, 8, 9, 10 and 11 mmol/L of the upper limit of glucose target range. The y-coordinate of each point gives the test error rate, and the x-coordinate gives the times of iterations.

The performance of our scheme was evaluated when the threshold was set at 7, 8, 9, 10 and 11mmol/L respectively. The results are shown in Table 4. It shows that when threshold was set at 7 mmol/L or 8 mmol/L, the performance of our scheme is better.

Table 4 Comparison of accuracy and Matthews correlation coefficient.

Discussion

Due to the difference of epidemiology, etiology, pathogenesis and treatment of type 1 and type 2 DM, a knotty problem is how to effectively treat diabetes in clinic25. For a doctor, the reasonable solution is to classify the type of diabetes and suit the remedy to the case, so the diabetes can be in control. In fact, there are many clinical indicators to classify diabetes, such as the test results of Oral Glucose Tolerance Test (OGTT), INS, C-Peptide, IAA, ICA. The tests would contribute to providing guideline in treating diabetes, but the tests are incomplete and can’t precisely reflect the heterogeneity of the Type 1 diabetes and Type 2 diabetes. Moreover, some of original symptoms about Type 2 diabetes have emerged on patients with Type 1 diabetes. At the moment, CGM can monitor the curve representing the fluctuation of glucose concentration in patients with type 1 and 2 diabetes9,10, which is one of the most successful cases for diabetes controlling. In addition, the 17 features would be extracted from the curve of glucose concentration13. Those features can’t directly diagnose the type of DM, but we attempt to build a novel scheme calculating the indicator of classifying DM by using those features.

We have constructed an effective scheme, which consists of feature extraction and classification. The experimental results show when the upper threshold μ is correctly set, the misjudging rate of classification is less than 0.097, which suggests that the scheme achieves the best performance and the coincidence rate of our scheme and clinical diagnosis is up to 90.3%.

This experiment indicates that an indicator can be extracted from the curve of glucose concentration based on CGM and it is helpful for doctors to classify diabetes. In addition, more works should be considered, such as how to improve the precision of classifying diabetes, how to set a novel penalty to rectify the weight of diabetes samples according to the sampling distribution (Dt) of diabetes in the process of iteration, and our scheme should be validated whether it suffers data imbalance problems26,27.