DiaTrend: A dataset from advanced diabetes technology to enable development of novel analytic solutions

Objective digital data is scarce yet needed in many domains to enable research that can transform the standard of healthcare. While data from consumer-grade wearables and smartphones is more accessible, there is critical need for similar data from clinical-grade devices used by patients with a diagnosed condition. The prevalence of wearable medical devices in the diabetes domain sets the stage for unique research and development within this field and beyond. However, the scarcity of open-source datasets presents a major barrier to progress. To facilitate broader research on diabetes-relevant problems and accelerate development of robust computational solutions, we provide the DiaTrend dataset. The DiaTrend dataset is composed of intensive longitudinal data from wearable medical devices, including a total of 27,561 days of continuous glucose monitor data and 8,220 days of insulin pump data from 54 patients with diabetes. This dataset is useful for developing novel analytic solutions that can reduce the disease burden for people living with diabetes and increase knowledge on chronic condition management in outpatient settings.


Background & Summary
Advanced technologies like continuous glucose monitors (CGMs) and insulin pumps are transforming the standard of care for diabetes management [1][2][3] .The ubiquitous nature of these devices enables real-time monitoring and treatment in daily living; this is a huge advantage over single point-in-time alternatives like glucose meters and insulin pens.Research shows that many patients with diabetes achieve better outcomes with CGMs and insulin pumps 4,5 .However, research also shows that digital data from these devices is significantly underutilized to optimize outcomes 6,7 .Meanwhile, the next generation of solutions needed to advance diabetes care, such as the hybrid and fully closed-loop artificial pancreas 8,9 , depend substantially on continuous data from CGMs and insulin pumps.A major barrier to progress in this field centers around access to rich datasets that facilitate the development of novel analytic solutions.In addition, there is a large amount of related but disconnected data streams that is not often reviewed or analyzed together, which further limits our understanding of diabetes management and even prevention 10,11 .To advance research and development of robust analytic solutions for the growing population of people with diabetes, there is a critical need for open datasets to understand outpatient management, develop interventions, and build clinically-relevant decision-support solutions.
Despite the recognized need for open datasets to enable research 12 , there are limited datasets for data-driven research in the diabetes domain.One is the OhioT1DM dataset 13 , which consists of eight weeks of CGM, insulin pump, physiological sensor, and self-reported events from 12 people with type 1 diabetes, while another is an N-of-1 dataset, which consists of two weeks of blood glucose, insulin, and carbohydrate intake logs 14 .To broaden the scope of research on diabetes and chronic conditions in general, and accelerate development of robust computational solutions, we provide the DiaTrend dataset.The DiaTrend dataset includes CGM and insulin pump data from 54 patients with type 1 diabetes.This dataset is created from a subset of two larger studies focused on: 1) developing computational tools for self-management of diabetes 6 , and 2) evaluating a digital intervention for young adults with type 1 diabetes 15 .The provided dataset includes time-aligned blood glucose samples recorded on average every 5 minutes with FDA-approved CGMs by Dexcom 16 , Abbott 17 , and Medtronic 18 , and insulin pump data comprising basal and bolus insulin doses, carbohydrate intake logs, and other pump settings such as insulin-carb ratio and more.Figure 1 presents an overview of the data collection process and data provided.

Participants
The DiaTrend dataset includes CGM and insulin pump data from a total of 54 patients with type 1 diabetes (age: 19 -74 years, gender: 17 males, 37 females).Table 1 provides an overview of the demographic and clinical characteristics of patients in this dataset, including the distribution across age groups, gender, race, diabetes type, and hemoglobin A1C.Participants were recruited through two independent studies.Study 1 (also known as Digital SMD) recruited patients from Dartmouth Health in 2019, while study 2 (also known as SweetGoals 15 ) is an ongoing randomized control trial that recruits patients through social media and online platforms.Both studies were approved by the Committee for Protection of Human Subjects at Dartmouth College and all participants provided verbal and written consent prior to joining either study.In addition, participants provided consent to share their data openly to the broader research community.
Cohort 1 (n=17), from the Digital SMD study 6 , includes persons with type 1 diabetes between the ages of 25 to 74 years old who use a CGM and insulin pump for daily management of their condition and consented to share their retrospective device data for research.Meanwhile, cohort 2 (n=37), from the SweetGoals study 15 , includes persons with type 1 diabetes for longer than 18 months between the ages of 19 to 29 years old who use a Glooko compatible glucometer or CGM, reported a clinical visit within the previous 6 months from the recruitment date, and self-reported their most recent Hemoglobin A1C (HA1C) value as >7.5%.It is important to note that all device data included in the DiaTrend dataset was collected at baseline (i.e., prior to any intervention).Additionally, each individual's dataset spans varying time periods based on the available retrospective data at the time of recruitment.Given our focus on advanced diabetes technology for novel analytic solutions, only participants who use CGMs (with <30% missing data) and insulin pumps for daily management are included in the provided dataset.

Dataset Description
The DiaTrend dataset includes a total of 27,561 days of CGM data and 8,220 days of insulin pump data from 54 patients with type 1 diabetes.In addition, the DiaTrend dataset includes demographic and clinical characteristics for each subject, including metrics such as age, gender, race, diabetes type and HA1C -see Table 1.There is an average of 510 days (range: 31 -1885 days) of CGM data per subject, and an average of 152 days (range: 31 -780 days) of insulin pump data per subject -see Fig. 2. Within the insulin pump data, there is an average of 993 total bolus doses per subject (range: 132 -4939 doses) and an average of 438 total carb inputs per subject (range: 1 -2310 input) -see Fig. 3.These data were collected as part of the Digital SMD 6 and SweetGoals 15 studies during which each patient's retrospective CGM and insulin pump data was downloaded through a third-party application (i.e., Tidepool 50 and Glooko 51 ).It is important to note that since the SweetGoals study is a randomized control trial, only retrospective baseline data collected during the initial screening is included as part of the DiaTrend dataset (i.e., the provided data does not include sensor data from the intervention period of that study).In addition, HA1C -the primary clinically-validated metric for accessing glycemic control -was collected via the patient's electronic health record (i.e., the most recent HA1C) in the Digital SMD study and via a mail-in home test in the SweetGoals study at the time of the baseline assessment (approximately the endpoint of the device data).

Data Records
All data records in the DiaTrend dataset are stored and accessible via the Synapse platform 52 .The deposited data consists of 54 Excel files-one file for each subject.Each file has a CGM sheet that provides blood glucose data that was collected by the CGM.The CGM sheet includes 2 columns, namely, date and mg/dL.In addition, each subjects' file also has a Bolus sheet, which describes bolus insulin doses and meal announcements (i.e., user-entered estimates of carbohydrate content in meals logged to calculate bolus insulin needed to metabolize glucose from the meal consumed 53

Technical Validation
For each patient included in the DiaTrend dataset, we provide an overview of their blood glucose data using clinically-validated metrics for assessing glycemic control 54,55 .This includes the percentage of all blood glucose readings in 5 clinically-relevant categories, namely, very low (< 54 mg/dL), low (54 -69 mg/dL), target range (70 -180 mg/dL), high (181 -250 mg/dL), and very high (> 250 mg/dL).From Fig. 4a, we can observe that blood glucose is highly variable and only a minority of patients living with diabetes (less than 10% in our dataset) meet the clinical target of maintaining blood glucose within the target range of 70 -180 mg/dL for more than 70% of the time 54 .Fig. 4b presents histograms for daily mean blood glucose (mean = 187 mg/dL), daily glycemic variability (mean = 0.33), and daily time in range (mean = 47%).From this figure, we can observe a normal distribution for each clinically-relevant metric in the DiaTrend dataset.Similarly, we provide an overview of each patient's insulin pump data using box plots and histograms.Fig. 5a and 5b show box plots with descriptive statistics associated with bolus insulin doses and carb inputs, respectively, for each subject.Additionally, Fig. 5c shows the distributions of total daily bolus insulin doses (units) and total daily carb inputs (g), respectively.From this figure, we can observe a mean total daily bolus of 24 units and a mean total daily carb input of 115 g, both with a positively skewed distribution.In particular, we observe a high number of days (∼1400 days) with very low carb inputs (∼0g); this could be indicative of missed mealtime boluses (i.e., no bolus insulin used during mealtimes) -this is a common contributor to poor glycemic outcomes [56][57][58] .

Limitations
There are some important considerations and limitations associated with the DiaTrend dataset provided in this paper.First, there is imbalance in the representation of subjects across the dimensions of race, gender, and age.More specifically, majority of patients whose CGM and insulin pump data is provided (i.e., 48 out of 54 or 89%) are non-Hispanic White/Caucasian.Also, 5/11  this dataset includes a lower representation of males (n=17 out of 54 or 32%) compared to females, and a lower representation of older adults (e.g., for age ≥ 45 years old, n=12 or 22%).The limitation with regards to race (i.e., low representation of participants from non-White/Caucasian races including Asian and Black/African Americans) is partly due to the geographical location (i.e., New Hampshire) from which some participants were recruited.However, the imbalance in representation also underscores racial disparities which have been identified in prior literature relating to access and use of advanced diabetes technologies 59 , particularly CGMs and insulin pumps.Additionally, the limitation with regards to age (i.e., low representation of older adults and higher representation of young adults) is primarily due to the targeted focus on young adults with type 1 diabetes in the SweetGoals study 15 .A second limitation of the DiaTrend dataset is that it lacks full temporal alignment in the CGM and insulin pump data for each participant.This difference is apparent from Fig. 2 which shows more CGM data than insulin pump data for a number of subjects.While the reason for this is unknown, we suspect that it is primarily due lower data storage capacity on insulin pumps compared to CGMs, which in turn limits the amount of retrospective data available for download from insulin pumps.Third, basal insulin data is not available for subjects from cohort 2 (37 out of 54).This missing data stream might limit research efforts that require basal rate for analysis.However, despite the aforementioned limitations, the DiaTrend dataset represents one of the largest open-source datasets currently available in the diabetes domain.This critical resource provides a unique opportunity to advance development of novel data-driven solutions that can improve the lives of people living with diabetes.In addition, this dataset provides a necessary benchmark to evaluate the generalizability of numerous diabetes-relevant algorithms in literature [19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36] .

Usage Notes
The DiaTrend dataset is provided for research and educational purposes that support the development of novel data-driven solutions for the diabetes community and beyond.Consistent with exemplar studies 13,60,61 , we have set governance structures in place to balance the need for open datasets that advance research and protect the privacy of participants.
Researchers interested in accessing the DiaTrend dataset should complete the following steps: 1. Register for a Synapse account (www.synapse.org).
2. Become a Synapse Certified User with a validated user profile.
3. Submit an Intended Data Use statement.
4. Agree to the Conditions of Use.
The conditions of use are as follows: • You confirm that you will not attempt to re-identify research participants for any reason, including for re-identification theory research.
• You commit to keeping the DiaTrend dataset confidential and secure.
• You understand that these data may not be used for commercial advertisement or to re-contact research participants.
• You agree to acknowledge the research participants as data contributors, study investigators, and this paper on all publications or presentations which results from using the DiaTrend dataset.

Figure 1 .
Figure 1.Overview of the data collection process and data provided in the DiaTrend dataset.

Figure 4 .Figure 5 .
Figure 4. Descriptive summary of CGM data in the DiaTrend dataset.(a) The percent of blood glucose samples in 5 clinically-relevant categories.(b) The distributions of daily mean blood glucose, daily glycemic variability, and daily time in [target] range.

Table 1 .
DiaTrend dataset: Demographic and clinical characteristics of patients with diabetes.

Table 2 .
Overview of the number of days of sensor data per patient in the DiaTrend dataset.Overview of the data records, format, and description in the DiaTrend dataset.
).The Bolus sheet includes the following 7 columns: date, normal, carbInput, insulinCarbRatio, bgInput, recommended.carb,andrecommended.net.It is important to note that 37 (out of 54) Bolus sheets include 4 more columns, namely, recommended.correction,insulinSensitivityFactor,targetBloodGlucose,and insulinOnBoard.Additionally, there are only 17 subject files that have a Basal sheet, which describes the subject's basal infusions in 3 columns, namely, date, duration, and rate.Each row in all three of the Excel sheets refers to one record collected at a given timestamp in the column titled 'date'.Excluding the date column, the rest of the data can be Figure 3. Overview of the total number of bolus and carb input data per patient in the DiaTrend dataset.read as floating point numbers.Table2provides a detailed breakdown of each data record, the format, and a description.