Abstract
Because of the differences of treatment, it is extremely important to classify the types of diabetes, especially for the diagnosis made by clinician. In this study, we proposed a novel scheme calculating an indicator of classifying diabetes, which contains two stages: the first is a model of feature extraction, 17 features are automatically extracted from the curve of glucose concentration acquired by continuous glucose monitoring system (CGM); the second is a model of diabetes parameter regression based on an ensemble learning algorithm named double-Class AdaBoost. 1050 curves of glucose concentration of type 1 and type 2 diabetics were acquired at the Department of Endocrinology in People’s Hospital of Zhengzhou University China, and an upper threshold μ was set to 7 mmol/L, 8 mmol/L, 9 mmol/L, 10 mmo/L, and 11 mmol/L respectively according to the guideline of WHO. The experiments show that the coincidence rate of our scheme and clinical diagnosis is 90.3%. The novel indicator extends the criteria in diagnosing types of diabetes and provides doctors with a scalar to classify diabetes of type 1 and type 2.
Similar content being viewed by others
Introduction
Diabetes mellitus (DM) is a chronic metabolic disease caused by deficiency or diminished effectiveness of endogenous insulin. Poor glucose control can lead to complications in multiple organs resulting in increased rates of morbidity and mortality1. According to International Diabetes Federation (IDF), in the world, there are about 415 million patients suffering from diabetes in 20152, and this number is growing.
The efficacy of treatment in preventing diabetes complications has been confirmed by Diabetes Control and Complications Trial3. Specifically, the success4,5 in Continuous Glucose Monitoring system (CGM), an invasive device is used to measure and record patients’ glucose concentration every 5 minutes. Recently, the CGM has been introduced in the prediction of glucose concentration6,7. We consider it is worth investigating the classification according to CGM signal as tools for the management of DM8,9.
According to the pathogeny of diabetes, there are 4 types of DM. Type 1 diabetes and Type 2 diabetes are the main categories. Clinically, the type is usually determined by tests, such as fasting plasma insulin (FINS), insulin releasing test (INS), C-Peptide test, insulin autoantibodies (IAA) and islet cell autoantibodies (ICA). Some of the tests are only temporary and incomplete to diagnose diabetes for the limitation of cognition. The scheme proposed in this paper tries to provide an effective and a supplementary indicator for diabetes classification, which would have benefits of perfecting the framework, raising the precision, and offering a convenient and intelligent method of classifying diabetes.
Classification is one of the hottest issues in data mining. Various classification algorithms have been introduced in many fields, such as Sound recognition10, Bitcoin fraud11, and Tomato plant disease12. In this study, a novel scheme calculating the indicator of classifying diabetes is proposed and it consists of two stages: the first is feature extraction, in which 17 features13 are automatically extracted from the curves of glucose concentration using statistics methods, and the second is diabetes parameter regression based on an ensemble learning algorithm named double-Class AdaBoost. The scheme can give an intelligent and precise method to diagnose the type of diabetes.
The Scheme
Based on Adaboost and its variant, a diabetes classification indicator is proposed, and its processing steps are described below:
-
1)
Utilize CGM, collect curves of diabetic glucose concentration.
-
2)
Employ feature extraction model to achieve 17 features from the training curves of glucose concentration.
-
3)
Build and train a classifier using the 17 features based on variants of AdaBoost.
-
4)
Verify the classifier using the testing curves of glucose concentration.
-
5)
Evaluate the indicator of the scheme to classify diabetes.
Methods
CGM
CGM is used in examination of how the blood glucose concentration reacts to insulin, exercise, food, and others. And it needs calibrating with traditional finger-stick measurements. A CGM acquires glucose concentration of patients on a continuous basis (every five minutes).
Feature Extraction
Feature extraction14 is based on the morphological characteristics of signals to obtain the intrinsic features. Features usually possess some physical significance and could be extracted from complicated multi-component signals such as a time-series of glucose concentration. Hence, the feature extraction is taking a glucose concentration signal as input and gives the features as the output. The model of features extraction is illustrated in Figure 1.
The first feature is the average of blood glucose on the whole day; it can be calculated by Equation (1).
where x i is a discrete value of blood glucose concentration, n is the number of x i in a day. The subsequent six features are also the average in different periods including pre-meal average and post-meal average of three meals. All those averages can be calculated by Equation (1).
SDBE this feature is Standard Deviation of Blood Glucose; it can be calculated by Equation (2).
LAGE this feature is Large Amplitude of plasma Glucose Excursions; it can be calculated by Equation (3).
where x max , x min are the maximum and minimum values of blood glucose in a day.
MODD this feature is absolute Mean Of Daily Differences, and it can be calculated by Equation (4).
where v 1, v 2 are arrays of glucose concentration with 288 values of one day, and they are the glucose concentrations of same diabetic in different day respectively.
Area Under the Curve of glucose concentration (AUC) indicates the area parceled by the glucose concentration-time curve and the threshold (upper threshold or lower threshold). The AUC should be calculated with two areas including under the curve of glucose concentration and the upper threshold, over the curve of glucose concentration and the lower threshold.
Mean amplitude of plasma glucose excursions (MAGE). This feature has been studied by many papers15. The MAGE can be calculated as follows: Step 1. Get all extreme points in the signal; Step 2. Find the first valid extreme point whose absolute differences of both adjacent extreme points are greater than the Standard Deviation of the signal; Step 3. Accumulate all differences of valid extreme point according to the left direction of the first valid extreme point in step 2; Step 4. MAGE is the average of sum counted in step 3.
where ep i is the left adjacent extreme point of valid extreme point ep i+1 ; n is the number of valid extreme point.
Blood glucose Percentage of Time (PT) includes two main aspects: times and the percentage of the time. The features relating to times, including Times of High excursion (TH), Times of Low excursion (TL), are the number of the extreme points of glucose concentration curve over threshold line in one day. The features relating to percentage of the time, including duration above High Limit (HL), duration below Low Limit (LL), duration Within Limit (WL), are the percentage of the time of glucose concentration curve over the threshold line one day.
In our research, a dataset was built to store the features, as shown in Table 1.
Usually, age is an important factor related to diabetes16, and places heavy weight on the classification of type of diabetes, thus would cause under-fitting. Some other factors were not involved, such as exercise, food, and insulin or oral medicines, which are difficult to quantify as these factors are from different manufacturers and are difficult to homogenize. Furthermore the main purpose of our research is to provide an easy and approach available to diabetes diagnose.
AdaBoost
Boosting methods are iterative algorithms17. AdaBoost is a boosting method which united some simple “weak” classifiers to generate generalized models. It was proposed by Freund and Schapire to distinguish a binary classification18, and later various AdaBoost variants such as Real Adaboost were proposed19. AdaBoost and its variants have contributed to various real-world applications, such as face detection20 and human detection21. In our research, its variants Real Adaboost19, Gentle AdaBoost22, and Modest AdaBoost23, were applied to the model of diabetes parameter regression in our scheme.
Diabetes parameter regression based on AdaBoost
Let s = {(g1,y1), (g2,y2), …, (gm,ym)} be a set of training samples with initial weights D 1(g i ) = 1/m, and m is the number of training data. Each g i is a vector with 17 features which were extracted from CGM curve of glucose concentration, and each y i is the label of g i . In our research, the DM classification is a binary classification, so assuming label y i equals 1 if the sample belongs to type 1 diabetes, and otherwise equals −1 when the sample belongs to type 2 diabetes.
Diabetes classification based on AdaBoost algorithm is described as follows:
Input: training dataset s = {(g1,y1), (g2,y2), …, (gm,ym)}, initialize data weights D 1(g i ) = 1/m, i = 1, …, m;
Step 1: train weak classifier h t using distribution D t .
Step 2: Calculate the error of the weak classifier h t : G → {−1,1}.
Step 3: Calculate the weight a t = (1/2)ln((1 − ε t )/ε t ).
Step 4: Update data weight D t and get new weights D t+1 by error.
where zt is a normalization factor.
Output: final classifier: \({H}_{final}(g)=sign({\sum }_{t=1}^{T}{a}_{t}{h}_{t}(g))\)
Diabetes parameter regression based on Variants algorithm of AdaBoost
Three variants algorithms of AdaBoost were used for diabetes parameter regression. Firstly, Real AdaBoost is a generalization of AdaBoost algorithm proposed by Schapire and Singer19. Its output is not binary, but a real number between + 1 and −1. Real AdaBoost algorithm seems to Adaboost, except the steps 1–3 summarized as below:
For each weak classifier h t
-
a.
every value space of features is divided into several disjoint blocks G1, …, Gn
-
b.
under the distribution Dt calculate
$${p}_{l}^{j}(g)=p({g}_{i}\in {G}_{j},{y}_{i}=l)={\sum }_{i=g={G}_{j},{y}_{i}=l}{D}_{t}(i)\,l\in 1,-1$$ -
c.
set the output of h on each Gj as
$${h}_{t}^{j}(g)\leftarrow \frac{1}{2}\,{\rm{l}}{\rm{o}}{\rm{g}}\,{p}_{l}^{j}/(1-{p}_{l}^{j})\in R\quad {h}_{t}({g}_{i})={h}_{t}(g)$$ -
d.
calculate the normalization factor
The second variants algorithm of AdaBoost named Modest AdaBoost which complete steps could be found in paper23. Gentle AdaBoost is the most efficient boosting algorithm and it has been used in Cascades object detection24. In each epoch, Gentle AdaBoost does a weighted regression based on least square. It means that the regression function h t (g) is fit by weighted least-squares of y i to g i .
Model Evaluation
In order to evaluate classification results, the present study applied two performance indicators: ACC (accuracy) and MCC (Matthews correlation coefficient). P and N represent the positive class and negative class respectively. T and F denote True and False respectively, as described in Table 2.
The ACC is as the formula
The MCC is as the formula
Results
Patient Database
The diabetics were screened at ages ranging from 40 to 60 and the glucose concentration were acquired from the Department of Endocrinology in People’s Hospital of Henan Province of China. There are 1050 samples of diabetes glucose concentration, and each sample is a curve with more than 864 values.
Experiment and analyses
To demonstrate the performance of proposed indicator, 300 of the 1050 samples were used as training set to construct a diabetes classifier while the other 750 were used as testing set to evaluate the classifier. Besides, 7 mmol/L, 8 mmol/L, 9 mmol/L, 10 mmol/L and 11 mmol/L were set to the upper threshold of glucose target range in the progress of feature extraction from the curves of glucose concentration monitored by CGM.
The Committee Report of diabetes expert of WHO diagnoses DM with fasting blood glucose concentration between 6.1 mmol/L and 6.9 mmol/L and plasma glucose of 11.1 mmol/L 2 hours post glucose-load (2 h PPG). This is the reason why 7 mmol/L, 8 mmol/L, 9 mmol/L, 10mmo/L and 11 mmol/L were selected as the upper threshold when we extract the 17 features from glucose signal.
The models of Real AdaBoost, Modest AdaBoost and Gentle AdaBoost were applied to calculating the indicator of classification diabetes and the error rate was presented in Table 3. The error rate of Modest AdaBoost is 0.0970 when the upper threshold was set at 7 mmol/L and 8 mmol/L, which means that the coincidence rate of our scheme and clinical diagnosis is 90.3%.
After training 100 iterations, the three models of Real AdaBoost, Modest AdaBoost, and Gentle AdaBoost were to calculate the indicator of classifying diabetes. The test misjudging rate of indicator and clinical diagnosis illustrated in Figure 2. The upper thresholds of Figure 2 (a)–(e) were set at 7, 8, 9, 10 and 11 mmol/L respectively. It shows that when the upper limit was set at 7 mmol/L and 8 mmol/L the misjudging rate of three models were lower, and the misjudging rate of Model AdaBoost depicted by the line with the mark ‘|’ is 0.0970. Furthermore, when the upper threshold was set at 10 mmol/L, the three models perform worst in diabetes classification. But when the upper threshold was set at 9 mmol/L or 11 mmol/L, the misjudging rate of Real AdaBoost is changing, and its largest error is greater than 0.12, therefore 9 mmol/L and 11 mmol/L are not suitable for regarding as the upper threshold. The value of upper threshold affects results of diabetes classification.
5-fold cross-validations were used to further demonstrate the accuracy of our scheme and seek out the best of upper threshold, after training 100 iterations, the indicator of classifying diabetes based on Real AdaBoost, Modest AdaBoost and Gentle AdaBoost were calculated and the test misjudging rate of indicator and clinical diagnosis illustrated in Figure 3. It shows that when threshold was set at 7 mmol/L or 8 mmol/L, the performance of our scheme is better, and only a few misjudging rates were above 0.1. It indicates that the coincidence rate of indicator calculated by our scheme and clinical diagnosis is better and the indicator is useful for doctors to diagnose diabetes.
The performance of our scheme was evaluated when the threshold was set at 7, 8, 9, 10 and 11mmol/L respectively. The results are shown in Table 4. It shows that when threshold was set at 7 mmol/L or 8 mmol/L, the performance of our scheme is better.
Discussion
Due to the difference of epidemiology, etiology, pathogenesis and treatment of type 1 and type 2 DM, a knotty problem is how to effectively treat diabetes in clinic25. For a doctor, the reasonable solution is to classify the type of diabetes and suit the remedy to the case, so the diabetes can be in control. In fact, there are many clinical indicators to classify diabetes, such as the test results of Oral Glucose Tolerance Test (OGTT), INS, C-Peptide, IAA, ICA. The tests would contribute to providing guideline in treating diabetes, but the tests are incomplete and can’t precisely reflect the heterogeneity of the Type 1 diabetes and Type 2 diabetes. Moreover, some of original symptoms about Type 2 diabetes have emerged on patients with Type 1 diabetes. At the moment, CGM can monitor the curve representing the fluctuation of glucose concentration in patients with type 1 and 2 diabetes9,10, which is one of the most successful cases for diabetes controlling. In addition, the 17 features would be extracted from the curve of glucose concentration13. Those features can’t directly diagnose the type of DM, but we attempt to build a novel scheme calculating the indicator of classifying DM by using those features.
We have constructed an effective scheme, which consists of feature extraction and classification. The experimental results show when the upper threshold μ is correctly set, the misjudging rate of classification is less than 0.097, which suggests that the scheme achieves the best performance and the coincidence rate of our scheme and clinical diagnosis is up to 90.3%.
This experiment indicates that an indicator can be extracted from the curve of glucose concentration based on CGM and it is helpful for doctors to classify diabetes. In addition, more works should be considered, such as how to improve the precision of classifying diabetes, how to set a novel penalty to rectify the weight of diabetes samples according to the sampling distribution (Dt) of diabetes in the process of iteration, and our scheme should be validated whether it suffers data imbalance problems26,27.
References
Wilson, R.H., Foster, D. W., Kronenberg, H. N. & Larsen, P. R. William Textbook of Endocrinology, 9th ed. Philadelphia, PA: Saunders (1998).
Eyes on Diabetes. Word diabetes day. International Diabetes Federation. November 14 (2016).
Wang, Y., Wei, F., Sun, C. & Li, Q. The Research of Improved Grey GM (1, 1) Model to Predict the Postprandial Glucose in Type 2 Diabetes. Journal BioMed Research International 2016, ISSN: 2314-6133 (2016).
Cobelli, C. et al. Diabetes: models, signals, and control. IEEE reviews in biomedical engineering 2, 54–96 (2009).
Tildesley, H. D. et al. A comparison of internet monitoring with continuous glucose monitoring in insulin-requiring type 2 diabetes mellitus. Canadian journal of diabetes 37.5, 305–308 (2013).
Malik, S. et al. Gargling effect on salivary electrochemical parameters to predict blood glucose. Computational Techniques in Information and Communication Technologies (ICCTICT). New Delhi, India. 2016 International Conference on. IEEE. 11th−13th March (2016).
Eren-Oruklu, M. et al. Adaptive system identification for estimating future glucose concentrations and hypoglycemia alarms. Automatica 48.8, 1892–1897 (2012).
Zhang, J. & Zhang, M. The Preliminary analysis of blood glucose fluctuation in type II diabetes. Chinese Journal of Integrative Medicine on Cardio-/Cerebrovascular Disease 11.4, 500–501 (2013).
Wentholt, I. M. et al. Glucose fluctuations and activation of oxidative stress in patients with type 1 diabetes. Diabetologia 51.1, 183–190 (2008).
McLoughlin, I. et al. Robust sound event classification using deep neural networks. IEEE/ACM Transactions on. Audio, Speech, and Language Processing 23.3, 540–552 (2015).
Monamo, P., Marivate, V. & Twala, B. Unsupervised learning for robust Bitcoin fraud detection[C]//Information Security for South Africa (ISSA), 2016. IEEE, 129–134 (2016).
Sabrol, H. & Satish, K. Tomato plant disease classification in digital images using classification tree. Communication and Signal Processing (ICCSP). Melmaruvathur, TN, India. 2016 International Conference on. IEEE, 5th, 06–08 April (2016).
Jia, W. P. Clinical application guideline of dynamic blood glucose monitoring in china(version of 2009)[J]. National Medical Journal of China 89.48, 3388–3392 (2009).
Hu, J. Summarization on feature dimensionally reduction. Application Research Of Computers 25.9, 2601–2606 (2008).
Mo, Y. F. et al. Assessment index of glucose fluctuation—clinical significance and research progress of Average glucose fluctuations. Chinese Journal of Diabetes Mellitus 3.3, 259–263 (2011).
Jiang, L & Peng, L. of Diabetes with II type Classification and features extraction using SVM, Science Technology and Engineering, 7.5, 721–726(2007).
Mining, Data. Practical machine learning tools and techniques, -Witten. Frank and Hall (1913).
Freund, Y., Schapire, R. & Abe, N. A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence 14(1612), 771–780 (1999).
Schapire, R. E. & Singer, Y. Improved boosting algorithms using confidence-rated predictions. Machine learning 37.3, 297–336 (1999).
Guo, J.-M. et al. Complexity reduced face detection using probability-based face mask prefiltering and pixel-based hierarchical-feature adaboosting. IEEE Signal Processing Letters 18.8, 447–450 (2011).
Xu, J. et al. Fast and accurate human detection using a cascade of boosted MS-LBP features. IEEE Signal Processing Letters 19.10, 676–679 (2012).
Friedman, J., Hastie, T. & Tibshirani, R. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics 28.2, 337–407 (2000).
Vezhnevets, A. & Vezhnevets, V. Modest AdaBoost-teaching AdaBoost to generalize better. Graphicon. Vol. 12. No. 5 (2005).
Lienhart, R., Kuranov, A. & Pisarevsky, V. Empirical analysis of detection cascades of boosted classifiers for rapid object detection. Joint Pattern Recognition Symposium. Springer Berlin Heidelberg (2003).
Li, W., Zheng, H., Bukuru, J. & De Kimpe, N. Natural medicines used in the traditional Chinese medical system for therapy of diabetes mellitus. Journal of Ethnopharmacology 92, 1–21 (2004).
Provost, F. Machine learning from imbalanced data sets 101. Proceedings of the AAAI’2000 workshop on imbalanced data sets (2000).
Kotsiantis, S., Kanellopoulos, D. & Pintelas, P. Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering 30.1, 25–36 (2006).
Acknowledgements
This work is supported by Grants-in-Aid from the Henan Science and Technology Research Program (152102210250 and 162102310600), the Zhengzhou Science and Technology Research Program (131PPTGG409–8), and the Henan Medical Science and Technology Research Program (201403009).
Author information
Authors and Affiliations
Contributions
Conceived the idea: Y.N.W, S.S.L, Q.Z.L. Designed and did the experiments: S.S.L. Analyzed the data: Y.N.W, S.S.L. Wrote the manuscript: Y.N.W, S.S.L, J.L.Y. Revise the paper: S.S.L, R.X.C, Z.N.C. All authors read and approved the manuscript.
Corresponding authors
Ethics declarations
Competing Interests
The authors declare that they have no competing interests.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, Y., Liu, S., Chen, R. et al. A Novel Classification Indicator of Type 1 and Type 2 Diabetes in China. Sci Rep 7, 17420 (2017). https://doi.org/10.1038/s41598-017-17433-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-017-17433-8
This article is cited by
-
Type 2 diabetes mellitus classification using predictive supervised learning model
Soft Computing (2023)
-
Recent applications of machine learning and deep learning models in the prediction, diagnosis, and management of diabetes: a comprehensive review
Diabetology & Metabolic Syndrome (2022)
-
Eye-color and Type-2 diabetes phenotype prediction from genotype data using deep learning methods
BMC Bioinformatics (2021)
-
Pathophysiological characteristics in patients with latent autoimmune diabetes in adults using clamp tests: evidence of a continuous disease spectrum of diabetes
Acta Diabetologica (2019)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.