Background & Summary

Advanced technologies like continuous glucose monitors (CGMs) and insulin pumps are transforming the standard of care for diabetes management1,2,3. The ubiquitous nature of these devices enables real-time monitoring and treatment in daily living; this is a huge advantage over single point-in-time alternatives like glucose meters and insulin pens. Research shows that many patients with diabetes achieve better outcomes with CGMs and insulin pumps4,5. However, research also shows that digital data from these devices is significantly underutilized to optimize outcomes6,7. Meanwhile, the next generation of solutions needed to advance diabetes care, such as the hybrid and fully closed-loop artificial pancreas8,9, depend substantially on continuous data from CGMs and insulin pumps. A major barrier to progress in this field centers around access to rich datasets that facilitate the development of novel analytic solutions. In addition, there is a large amount of related but disconnected data streams that is not often reviewed or analyzed together, which further limits our understanding of diabetes management and even prevention10,11. To advance research and development of robust analytic solutions for the growing population of people with diabetes, there is a critical need for open datasets to understand outpatient management, develop interventions, and build clinically-relevant decision-support solutions.

Despite the recognized need for open datasets to enable research12, there are limited datasets for data-driven research in the diabetes domain. One is the OhioT1DM dataset13, which consists of eight weeks of CGM, insulin pump, physiological sensor, and self-reported events from 12 people with type 1 diabetes, while another is an N-of-1 dataset, which consists of two weeks of blood glucose, insulin, and carbohydrate intake logs14. To broaden the scope of research on diabetes and chronic conditions in general, and accelerate development of robust computational solutions, we provide the DiaTrend dataset. The DiaTrend dataset includes CGM and insulin pump data from 54 patients with type 1 diabetes. This dataset is created from a subset of two larger studies focused on: (1) developing computational tools for self-management of diabetes6, and (2) evaluating a digital intervention for young adults with type 1 diabetes15. The provided dataset includes time-aligned blood glucose samples recorded on average every 5 minutes with FDA-approved CGMs by Dexcom16, Abbott17, and Medtronic18, and insulin pump data comprising basal and bolus insulin doses, carbohydrate intake logs, and other pump settings such as insulin-carb ratio and more. Figure 1 presents an overview of the data collection process and data provided.

Fig. 1
figure 1

Overview of the data collection process and data provided in the DiaTrend dataset.

The DiaTrend dataset is useful for several research directions including more common tasks like blood glucose prediction19,20,21,22,23,24,25,26, prediction of adverse glycemic events (i.e., hypoglycemia and hyperglycemia)27,28,29,30, detection of unannounced meals31,32,33,34,35, and algorithm development for insulin delivery36,37. However, this dataset is also useful to support further research on less studied topics like discovering digital biomarkers of glycemic control7, mining patterns/trends in diabetes management6,38,39, understanding adherence to wearable medical devices and patterns of missing data40,41, developing novel visual analytic and data visualization solutions42, and designing decision-support tools through user-centered studies43,44,45,46. Additionally, prospective researchers can find more opportunities for artificial intelligence in the diabetes domain through recent reviews in literature47,48,49.



The DiaTrend dataset includes CGM and insulin pump data from a total of 54 patients with type 1 diabetes (age: 19–74 years, gender: 17 males, 37 females). Table 1 provides an overview of the demographic and clinical characteristics of patients in this dataset, including the distribution across age groups, gender, race, diabetes type, and hemoglobin A1C. Participants were recruited through two independent studies. Study 1 (also known as Digital SMD) recruited patients from Dartmouth Health in 2019, while study 2 (also known as SweetGoals15) is an ongoing randomized control trial that recruits patients through social media and online platforms. Both studies were approved by the Committee for Protection of Human Subjects at Dartmouth College (STUDY00031632 and STUDY00023559, respectively) and all participants provided verbal and written consent prior to joining either study. In addition, participants provided consent to share their data openly to the broader research community. To protect the privacy of study participants and minimize the risk of patient re-identification, the DiaTrend dataset is provided via a controlled access mechanism50, similar to related datasets in the field13.

Table 1 DiaTrend dataset: Demographic and clinical characteristics of patients with diabetes.

Cohort 1 (n = 17), from the Digital SMD study6, includes persons with type 1 diabetes between the ages of 25 to 74 years old who use a CGM and insulin pump for daily management of their condition and consented to share their retrospective device data for research. Meanwhile, cohort 2 (n = 37), from the SweetGoals study15, includes persons with type 1 diabetes for longer than 18 months between the ages of 19 to 29 years old who use a Glooko compatible glucometer or CGM, reported a clinical visit within the previous 6 months from the recruitment date, and self-reported their most recent Hemoglobin A1C (HA1C) value as >7.5%. It is important to note that all device data included in the DiaTrend dataset was collected at baseline (i.e., prior to any intervention). Additionally, each individual’s dataset spans varying time periods based on the available retrospective data at the time of recruitment. Given our focus on advanced diabetes technology for novel analytic solutions, only participants who use CGMs (with <30% missing data) and insulin pumps for daily management are included in the provided dataset.

Dataset description

The DiaTrend dataset includes a total of 27,561 days of CGM data and 8,220 days of insulin pump data from 54 patients with type 1 diabetes. In addition, the DiaTrend dataset includes demographic and clinical characteristics for each subject, including metrics such as age, gender, race, diabetes type and HA1C - see Table 1. There is an average of 510 days (range: 31–1885 days) of CGM data per subject, and an average of 152 days (range: 31–780 days) of insulin pump data per subject - see Fig. 2. Within the insulin pump data, there is an average of 993 total bolus doses per subject (range: 132–4939 doses) and an average of 438 total carb inputs per subject (range: 1–2310 input) - see Fig. 3. These data were collected as part of the Digital SMD6 and SweetGoals15 studies during which each patient’s retrospective CGM and insulin pump data was downloaded through a third-party application (i.e., Tidepool51 or Glooko52). It is important to note that since the SweetGoals study is a randomized control trial, only retrospective baseline data collected during the initial screening is included as part of the DiaTrend dataset (i.e., the provided data does not include sensor data from the intervention period of that study). In addition, HA1C - the primary clinically-validated metric for accessing glycemic control - was collected via the patient’s electronic health record (i.e., the most recent HA1C) in the Digital SMD study and via a mail-in home test in the SweetGoals study at the time of the baseline assessment (approximately the endpoint of the device data).

Fig. 2
figure 2

Overview of the number of days of sensor data per patient in the DiaTrend dataset.

Fig. 3
figure 3

Overview of the total number of bolus and carb input data per patient in the DiaTrend dataset.

Data Records

All data records in the DiaTrend dataset are stored and accessible via the Synapse platform50. The deposited data consists of 54 Excel files–one file for each subject. Each file has a CGM sheet that provides blood glucose data that was collected by the CGM. The CGM sheet includes 2 columns, namely, date and mg/dL. In addition, each subjects’ file also has a Bolus sheet, which describes bolus insulin doses and meal announcements (i.e., user-entered estimates of carbohydrate content in meals logged to calculate bolus insulin needed to metabolize glucose from the meal consumed53). The Bolus sheet includes the following 7 columns: date, normal, carbInput, insulinCarbRatio, bgInput, recommended.carb, and It is important to note that only 17 subject files that have a Basal sheet, which describes the subject’s basal infusions in 3 columns, namely, date, duration, and rate. The subject files that have basal data are as follows: S29-S31, S36-S39, S42, S45-S47, S49-S54. In addition, 37 (out of 54) Bolus sheets include 4 more columns, namely, recommended.correction, insulinSensitivityFactor, targetBloodGlucose, and insulinOnBoard. The subject files that have the 4 additional columns in the bolus sheets are as follows: S1-S28, S32-S35, S40, S41, S43, S44, and S48. Each row in all three of the Excel sheets refers to one record collected at a given timestamp in the column titled ‘date’. All data records in each subject file are time-ordered according to the device log. More specifically, CGMs record a blood glucose sample approximately every 5 minutes, meanwhile insulin pumps have irregularly sampled data records because they depend on user triggers for bolus insulin doses and user settings for basal insulin doses. Excluding the date column, the rest of the data can be read as floating point numbers. Table 2 provides a detailed breakdown of each data record, the format, and a description.

Table 2 Overview of the data records, format, and description in the DiaTrend dataset.

Technical Validation

For each patient included in the DiaTrend dataset, we provide an overview of their blood glucose data using clinically-validated metrics for assessing glycemic control54,55. This includes the percentage of all blood glucose readings in 5 clinically-relevant categories, namely, very low (<54 mg/dL), low (54–69 mg/dL), target range (70–180 mg/dL), high (181–250 mg/dL), and very high (>250 mg/dL). From Fig. 4, we can observe that blood glucose is highly variable and only a minority of patients living with diabetes (less than 10% in our dataset) meet the clinical target of maintaining blood glucose within the target range of 70–180 mg/dL for more than 70% of the time54. Fig. 4b presents histograms for daily mean blood glucose (mean = 187 mg/dL), daily glycemic variability (mean = 0.33), and daily time in range (mean = 47%). From this figure, we can observe a normal distribution for each clinically-relevant metric in the DiaTrend dataset.

Fig. 4
figure 4

Descriptive summary of CGM data in the DiaTrend dataset. (a) The percent of blood glucose samples in 5 clinically-relevant categories. (b) The distributions of daily mean blood glucose, daily glycemic variability, and daily time in [target] range.

Similarly, we provide an overview of each patient’s insulin pump data using box plots and histograms. Figure 5a,b show box plots with descriptive statistics associated with bolus insulin doses and carb inputs, respectively, for each subject. Additionally, Fig. 5c shows the distributions of total daily bolus insulin doses (units) and total daily carb inputs (g), respectively. From this figure, we can observe a mean total daily bolus of 24 units and a mean total daily carb input of 115 g, both with a positively skewed distribution. In particular, we observe a high number of days (~1400 days) with very low carb inputs (~0 g); this could be indicative of missed mealtime boluses (i.e., no bolus insulin used during mealtimes)–this is a common contributor to poor glycemic outcomes56,57,58.

Fig. 5
figure 5

Descriptive summary of insulin pump data in the DiaTrend dataset. (a) A box plot of all bolus insulin doses per subject. (b) A box plot of all carb input entries per subject. (c) The distributions of total daily bolus insulin and total daily carb inputs across all subjects.


There are some important considerations and limitations associated with the DiaTrend dataset provided in this paper. First, there is imbalance in the representation of subjects across the dimensions of race, gender, and age. More specifically, majority of patients whose CGM and insulin pump data is provided (i.e., 48 out of 54 or 89%) are non-Hispanic White/Caucasian. Also, this dataset includes a lower representation of males (n = 17 out of 54 or 32%) compared to females, and a lower representation of older adults (e.g., for age ≥45 years old, n = 12 or 22%). The limitation with regards to race (i.e., low representation of participants from non-White/Caucasian races, including Hispanics, non-Hispanic Black/African Americans, and Asians) is partly due to the geographical location (i.e., New Hampshire) from which some participants (17 out of 54) were recruited. However, the imbalance in representation also underscores racial disparities that have been identified in prior literature relating to access and use of advanced diabetes technologies, particularly CGMs and insulin pumps59,60. Additionally, the limitation with regards to age (i.e., low representation of older adults and higher representation of young adults) is primarily due to the targeted focus on young adults with type 1 diabetes in the SweetGoals study15. A second limitation of the DiaTrend dataset is that it lacks full temporal alignment in the CGM and insulin pump data for each participant. This difference is apparent from Fig. 2 which shows more CGM data than insulin pump data for a number of subjects. While the reason for this is unknown, we suspect that it is primarily due lower data storage capacity on insulin pumps compared to CGMs, which in turn limits the amount of retrospective data available for download from insulin pumps or patients’ switching insulin delivery systems (e.g., to multiple daily injections or other devices that are not compatible with the third-party platform). Third, there are various forms of missing data associated with the provided dataset. As previously mentioned, all data provided in this paper represents retrospective data collected directly from the user’s devices (i.e., CGMs and insulin pumps) and downloaded through a third-party application (i.e., Tidepool51 or Glooko52). Given this, missing data in the data files are due to either missing data in the user’s device or technical issues with the third-party platform used for download. For example, basal insulin data is not available for subjects from cohort 2 (37 out of 54) due to technical issues with Glooko not providing basal data from the insulin pumps at the time of data collection for this study. These forms of missing data might limit some research efforts with the provided dataset. However, despite the aforementioned limitations, the DiaTrend dataset represents one of the largest open-source datasets currently available in the diabetes domain. This critical resource provides a unique opportunity to advance development of novel data-driven solutions that can improve the lives of people living with diabetes. In addition, this dataset provides a necessary benchmark to evaluate the generalizability of numerous diabetes-relevant algorithms in literature19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36.

Usage Notes

The DiaTrend dataset is provided for research and educational purposes that support the development of novel data-driven solutions for the diabetes community and beyond. Consistent with exemplar studies13,61,62, we have set governance structures in place to balance the need for open datasets that advance research and protect the privacy of participants.

Researchers interested in accessing the DiaTrend dataset should complete the following steps:

  1. 1.

    Register for a Synapse account (

  2. 2.

    Become a Synapse Certified User with a validated user profile.

  3. 3.

    Submit an Intended Data Use statement.

  4. 4.

    Agree to the Conditions of Use.