Chinese diabetes datasets for data-driven machine learning

Data of the diabetes mellitus patients is essential in the study of diabetes management, especially when employing the data-driven machine learning methods into the management. To promote and facilitate the research in diabetes management, we have developed the ShanghaiT1DM and ShanghaiT2DM Datasets and made them publicly available for research purposes. This paper describes the datasets, which was acquired on Type 1 (n = 12) and Type 2 (n = 100) diabetic patients in Shanghai, China. The acquisition has been made in real-life conditions. The datasets contain the clinical characteristics, laboratory measurements and medications of the patients. Moreover, the continuous glucose monitoring readings with 3 to 14 days as a period together with the daily dietary information are also provided. The datasets can contribute to the development of data-driven algorithms/models and diabetes monitoring/managing technologies.

www.nature.com/scientificdata www.nature.com/scientificdata/ review for T1DM 12 , 51.7% were with small samples, 29.3% with small-medium samples, 17.2% with simulated data and 1.7% with samples over 50 patients. Another limitation pointed out by the reviews was the low free access data availability. Most data are credentialed or not accessible due to ethical restrictions and data privacy. We summarized recently studied and popular T1DM and T2DM datasets in Table 1.
In T1DM, both real and simulated patient data in silico were well studied. Simulators can conveniently provide and customize detailed data of virtual diabetic patients from their dietary and treatment strategies. UVA/ Padova T1DM simulator 14 was widely employed, which was approved by Food and Drug Administration (FDA) and provided 30 different virtual patients freely. Virtual diabetes simulators were studied in tasks such as glycemic events identification, BG control 15 and predictions 14,[16][17][18] . The simulators were able to generate as many BG instances as possible for each patient 14 .
As a public dataset, OhioT1DM [18][19][20][21][22] was a comprehensive dataset of real T1DM patients in the United States, which was publicly released by Ohio University and contained data of 12 real patients. Compared to the OhioT1DM, D1NAMO 23 dataset focused on diabetes management. This dataset was composed of 20 real healthy people and nine real T1DM patients with additional patient information such as BG measurements, food pictures, breathing signals and accelerometer outputs. A clinical data 18,24 including 10 T1DM adults from the ABC4D project using CGM sensors was used in a deep learning framework for accurate glucose forecasting. Weinstock 25 collected diabetes-related data from adult type 1 diabetes (> = 60 years of age, diabetes duration > = 20 years). This dataset consisted of 14 days' CGM data, information of insulin, other medications and patient demographics from 201 patients. This dataset was proposed to analyze the risk factors that can cause severe hypoglycemia in old patients. Fox et al. 26 collected CGM records from 40 T1DM patients over three years (data size > = 1900 days of BG measurements, > = 550k distinct glucose measurements) and developed a deep multi-output forecasting algorithm.
T2DM datasets were less common than T1DM datasets 27,28 . A CGM data from both the T1DM and T2DM patients were employed to predict future BG levels for preventing hyperglycemia or hypoglycemia 29 , which was collected over a period ranging from 1.3 to 7 days. The Maryland data 27 contained 56,000 SMBG data points collected in a 1-year prospective study. In this study, patients were treated with a variety of medications, including oral antihyperglycemic agents and insulin. The Maastricht Study 28,30 , an observational, prospective, population-based cohort study, focused on the aetiology, pathophysiology, complications and comorbidities of T2DM, and was characterized by an extensive phenotyping approach.
The existing diabetes data are used not only in BG prediction 31 , but also in other diabetes-related fields, such as the generation of BG control strategies 15 and the study of the influence of external factors on blood glucose level. However, the limitations of many diabetes datasets in terms of the number of patients, the racial regions where they are collected, and the types of diabetes mellitus have led to the restrictions in diabetes-related research.
It is known that dietary intake, exercise and medication are the main factors affecting the BG level 32,33 . The collection on these external information is therefore essential in the datasets, which is a tedious task. More specifically, eating habits are quite influenced by ethnic groups and regions, e.g., the Chinese dietary habits are very complicated 34 . Therefore, two datasets from T1DM and T2DM patients in Shanghai, China with dietary information, clinical characteristics, laboratory measurements and medications of the patients were constructed. To the best of our knowledge, these are the first publicly available datasets to include rich information for people with T1DM and T2DM in China. The datasets could contribute to the research in data-driven machine learning. www.nature.com/scientificdata www.nature.com/scientificdata/ more than 18 years of age, willing to sign the informed consent form and with CGM recording for at least 3 days. Patients were excluded if they reported alcohol or drug abuse, were unable to comply with the study, or were not suitable to attend this study judged by the investigators. Data was anonymous to protect the sensitive information of the patients.

Methods
Clinical and laboratory measurements. A standard questionnaire was conducted by trained research staff to obtain demographic information. Information on diagnosis and treatment of diabetes, duration of diabetes, laboratory measurements, comorbidities and pharmacologic treatments were collected from medical records. Each patient underwent a physical examination including measurement of height and weight. Body mass index (BMI) was calculated as weight divided by height squared (kg/m 2 ). Each patient wore a flash glucose monitoring device (FreeStyle Libre H, Abbott Diabetes Care, Witney, UK) to measure interstitial glucose levels continuously for up to 14 days. CGM glucose data were automatically stored on the sensor every 15 minutes. The data can be obtained by scanning the glucose sensor with the reader and uploaded using the device software. Available laboratory measurements (≤6 months before or after CGM) including glucose metabolism, lipid profile and renal function were obtained from medical records. Any dietary intake including the exact time at consumption and weighed food record was reported by the patients. Hypoglycemic medications during CGM were also recorded.
This study was approved by the Ethics Committee of Shanghai Fourth People's Hospital and Shanghai East Hospital affiliated to Tongji University in accordance with the Declaration of Helsinki. The informed consent was obtained from all the patients. CGM parameters. Time in range (TIR), one of the critical CGM-derived metrics, reflects the glucose variability and evaluates the quality of glycemic control 35 . It is associated with microvascular complications and macrovascular outcomes of diabetes. TIR is defined as the percentage of time spent in the target glucose range of 70-180 mg/dL. Time below range (TBR) and time above range (TAR) are the percentage of time when blood glucose is below 70 mg/dL and above 180 mg/dL, respectively. For most patients with T1DM or T2DM, the recommended CGM targets by the Advanced Technologies & Treatments for Diabetes (ATTD) consensus were ≥70% for TIR, ≤25% for TAR and ≤4% for TBR 36 .
Analysis for CGM data. A clinical important task in diabetes management is the prevention of hypo/hyperglycemic events 37 . The algorithms to prevent the hpyo/hyperglycemic events can be obtained by generating hpyo/ hyperalerts on the basis of ahead-of-time prediction of glucose concentration by using past CGM data and suitable time-series models.
Auto-correlation 38 represents the degree of similarity between a given time series and a lagged version of itself over successive time intervals. It can help to uncover hidden patterns in data. Additionally, analyzing the autocorrelation function (ACF) and partial autocorrelation function (PACF) in conjunction is necessary for selecting the appropriate time-series models, e.g., ARIMA 39 .
where x t is the observation at time t, k is lag, E is the expected value operator, μ is the mean and σ 2 is the variance of the time series. ρ k can show the correlation between two observations with a lag k in the time series.

Data Records
The datasets ShanghaiT1DM and ShanghaiT2DM comprise two folders named "Shanghai_T1DM" and "Shanghai_T2DM" and two summary sheets named "Shanghai_T1DM_Summary" and "Shanghai_T2DM_ Summary". The datasets can be downloaded through Figshare repository 40 .
The "Shanghai_T1DM" folder and "Shanghai_T2DM" folder contain 3 to 14 days of CGM data corresponding to 12 patients with T1DM and 100 patients with T2DM, respectively. Of note, for one patient, there might be multiple periods of CGM recordings due to different visits to the hospital, which were stored in different excel tables. In fact, collecting data from different periods in one patient can reflect the changes of diabetes status during the follow-up. The excel table is named by the patient ID, period number and the start date of the CGM recording. Thus, for 12 patients with T1DM, there are 8 patients with 1 period of the CGM recording and 2 patients with 3 periods, totally equal to 16 excel tables in the "Shanghai_T1DM" folder. As for 100 patients with T2DM, there are 94 patients with 1 period of CGM recording, 6 patients with 2 periods, and 1 patient with 3 periods, amounting to 109 excel tables in the "Shanghai_T2DM" folder. Overall, the excel tables include CGM BG values every 15 minutes, capillary blood glucose (CBG) values, blood ketone, self-reported dietary intake, insulin doses and non-insulin hypoglycemic agents. The blood ketone was measured when diabetic ketoacidosis was suspected with a considerably high glucose level. Insulin administration includes continuous subcutaneous insulin infusion using insulin pump, multiple daily injections with insulin pen, and insulin that were given intravenously in case of an extremely high BG level.
Each excel table in the "Shanghai_T1DM" folder and "Shanghai_T2DM" folder contains the following data fields: <Date> Recording time of the CGM data. <CGM> CGM data recorded every 15 minutes. <CBG> CBG level measured by the glucose meter. <Blood ketone> Plasma-hydroxybutyrate measured with ketone test strips (Abbott Laboratories, Abbott Park, Illinois, USA). <Dietary intake> Self-reported time and weighed www.nature.com/scientificdata www.nature.com/scientificdata/ food intake <Insulin dose-s.c.> Subcutaneous insulin injection with insulin pen. <Insulin dose-i.v.> Dose of intravenous insulin infusion. <Non-insulin hypoglycemic agents> Hypoglycemic agents other than insulin. <CSII-bolus insulin> Dose of insulin delivered before a meal through insulin pump. <CSII-basal insulin> The rate (iu/per hour) at which basal insulin was continuously infused through insulin pump.
The summary sheets summarize the clinical characteristics, laboratory measurements and medications of the patients included in this study, with each row corresponding to one excel table in "Shanghai_T1DM" and "Shanghai_T2DM" folders. Clinical characteristics include patient ID, gender, age, height, weight, BMI, smoking and drinking history, type of diabetes, duration of diabetes, diabetic complications, comorbidities as well as occurrence of hypoglycemia. Laboratory measurements contain fasting and 2-hour postprandial plasma glucose/C-peptide/insulin, hemoglobin A1c (HbA1c), glycated albumin, total cholesterol, triglyceride, high-density lipoprotein cholesterol, low-density lipoprotein cholesterol, creatinine, estimated glomerular filtration rate, uric acid and blood urea nitrogen. Both hypoglycemic agents and medications given for other diseases before the CGM reading were also recorded.

technical Validation
The characteristics of the Chinese diabetes datasets. The detailed characteristics of the patients in the ShanghaiT1DM and ShanghaiT2DM datasets were summarized in Table 2. The age of the ShanghaiT1DM group and the ShanghaiT2DM group was 57.8 ± 11.1 and 60.2 ± 13.7 years, respectively. There was no statistically significant difference in age between the ShanghaiT1DM group and ShanghaiT2DM group. This is because most of the patients (10/12) in the ShanghaiT1DM group belonged to a subtype of T1DM called "latent autoimmune diabetes in adults", which is characterized by slow autoimmune β-cell destruction and an older mean age at onset of diabetes 1 . Women accounted for 58.3% of the ShanghaiT1DM group and 44% of the ShanghaiT2DM group, respectively. Besides, data concerning fasting plasma glucose, 2-hour postprandial plasma glucose and HbA1c were comparable between the two groups. However, the ShanghaiT2DM group had higher BMI values than the ShanghaiT1DM group (p < 0.05).
To show the size of these two datasets more intuitively, we listed the patient's type, the study period, sampling interval of CGM devices, number of patients, total number of recording files and total CGM measurements of the ShanghaiT1DM and ShanghaiT2DM in Table 3. For a given patient, he or she may have more than one recording period. In Fig. 1, we showed the number of recording files with different CGM data size in days in the ShanghaiT1DM and ShanghaiT2DM. The collected CGM data size varied from 3 days to 14 days.
We summarized the hypo/hyperglycemia events and calculated the auto-correlation coefficient on the BG values of the two datasets in time series. Hypoglycemia and hyperglycemia events are two potential risk factors for complications in diabetes. Hence, the time percentages of hypoglycemia (TBR) and hyperglycemia (TAR) events for each patient were calculated in Fig. 2. The horizontal axis represented each recording file of the patients with an order of TBR increasing, while the vertical axis represented the percentage of time (TAR, TIR and TBR) during the data collection period. The higher values of the TAR and TBR indicated that the patient's condition was more serious. To give a clearer view of the TBR, TIR and TAR in the two datasets, we calculated the mean ± standard deviation of these values for the two datasets. For the ShanghaiT1DM, the mean ± standard  www.nature.com/scientificdata www.nature.com/scientificdata/ deviation of the TIR were 54.7 ± 14.5% and 77.7 ± 18.1% for the ShanghaiT2DM. We noted that the average TIR was higher in T2DM patients than in T1DM patients (Fig. 2).
Besides, as the collection on individual patient's behavior information in each dataset was different, we randomly chose three patients from each dataset for the auto-correlation graph of the BG time series in Fig. 3. The auto-correlation coefficients identify seasonality and trend in time series data. It can be found that patients in ShanghaiT2DM (Fig. 3b) showed a more noticeable 24-hour periodic pattern than those in ShanghaiT1DM (Fig. 3a).
Since there might be discrepancy in BG levels by different blood glucose monitoring methods, we conducted a comparative analysis of the blood glucose measured by the CGM and CBG in Fig. 4, 5. The collection of the CBG was more sparse than that of the CGM, we only plotted the time stamps with both of the measurements. Two patients were randomly selected from each dataset. The results showed that the CBG values were usually greater than those of CGM readings.
Comparison to other datasets. There have been widely used datasets such as the SimulatorT1DM and the OhioT1DM (see Table 3). In order to show more specifically the difference between the newly constructed datasets and other existing data, the comparisons were performed in Table 3, figs. 3c,d & 6.
The auto-correlation coefficients of the ShanghaiT1DM (Fig. 3a) and OhioT1DM (Fig. 3d) indicated that the two real T1DM datasets shared similar trend and periodic pattern, which made it possible to combine the two datasets together in certain research. The SimulatorT1DM (Fig. 3c) had strong regularity as it was simulated.
Achieving higher TIR has been shown to reduce the percentages of time in the hypoglycemic and hyperglycemic range and complications of diabetes. In Fig. 6, we found that the patients in the OhioT1DM had lower mean TBR values compared to those in the ShanghaiT1DM (Fig. 2), which means that they have better control of hypoglycemia. In addition, patients in the ShanghaiT2DM (Fig. 2) had the highest mean TIR values, which suggests that people with T2D have better glycemic control overall than people with T1D. The virtual patients from the UVA/Padova (Fig. 6) had worse control of hypoglycemia, which may be due to the fact that the glycemic control strategy of the virtual patients was based on a fixed formula and therefore could not produce a  www.nature.com/scientificdata www.nature.com/scientificdata/   www.nature.com/scientificdata www.nature.com/scientificdata/ timely response to the hypoglycemia. By comparing the ShanghaiT1DM and OhioT1DM (Fig. 6), we found that the standard deviations of TBR, TIR and TAR in the ShanghaiT1DM were higher than those in the OhioT1DM.

Code availability
The code for the analysis of the datasets and the generation of the figures and tables can be accessed in the Figshare repository 40 , which is a JUPYTER notebook named "data_analysis.ipynb". The script can be executed with Python 3.6 and allows for reproducibility and code reuse.