Application of machine learning methods for the prediction of true fasting status in patients performing blood tests

The fasting blood glucose (FBG) values extracted from electronic medical records (EMR) are assumed valid in existing research, which may cause diagnostic bias due to misclassification of fasting status. We proposed a machine learning (ML) algorithm to predict the fasting status of blood samples. This cross-sectional study was conducted using the EMR of a medical center from 2003 to 2018 and a total of 2,196,833 ontological FBGs from the outpatient service were enrolled. The theoretical true fasting status are identified by comparing the values of ontological FBG with average glucose levels derived from concomitant tested HbA1c based on multi-criteria. In addition to multiple logistic regression, we extracted 67 features to predict the fasting status by eXtreme Gradient Boosting (XGBoost). The discrimination and calibration of the prediction models were also assessed. Real-world performance was gauged by the prevalence of ineffective glucose measurement (IGM). Of the 784,340 ontologically labeled fasting samples, 77.1% were considered theoretical FBGs. The median (IQR) glucose and HbA1c level of ontological and theoretical fasting samples in patients without diabetes mellitus (DM) were 94.0 (87.0, 102.0) mg/dL and 5.6 (5.4, 5.9)%, and 92.0 (86.0, 99.0) mg/dL and 5.6 (5.4, 5.9)%, respectively. The XGBoost showed comparable calibration and AUROC of 0.887 than that of 0.868 in multiple logistic regression in the parsimonious approach and identified important predictors of glucose level, home-to-hospital distance, age, and concomitantly serum creatinine and lipid testing. The prevalence of IGM dropped from 27.8% based on ontological FBGs to 0.48% by using algorithm-verified FBGs. The proposed ML algorithm or multiple logistic regression model aids in verification of the fasting status.

With the universal implementation of electronic medical records (EMRs), researchers have actively leveraged real-world EMR data in diabetes research and management in clinical practice 1 . The development of algorithms to identify patients with prediabetes and diabetes mellitus (DM) with high validity has become increasingly fundamental in improving the patients' quality of care and preventing complications associated with DM. In current clinical practice, the phenotypes of DM are defined by various combinations of the different components of the EMR, such as diagnostic codes, medication data, and laboratory values related to glucose homeostasis 2 . Thus, current diagnostic algorithms have yielded significant variation in the validity for identification of DM 2 . In recent years, several studies have indicated that machine learning (ML) algorithms may better identify diabetic status in EMRs for cohort establishment 3,4 . Other studies applied ML techniques to predict DM or undiagnosed DM based on clinical information [5][6][7][8] . Systematic reviews reported that most ML studies used the supervised www.nature.com/scientificreports/ learning approach and a comparison of the approaches indicated that support vector machine (SVM) was the most widely used algorithm 9,10 . Deep learning (DL) models such as artificial neural networks (ANNs) and deep neural networks (DNNs) have been applied and reported in some studies showing superior performance than conventional ML approaches in predicting DM-related phenotypes 11,12 . However, these studies usually assumed that fasting blood glucose (FBG) values are valid if labeled as such by the clinical laboratory, which may lead to potential overestimation of fasting status 13 . As demonstrated in a survey conducted by Tseng et al., only approximately half of the patients reported to have adequately fasted before phlebotomy at a large academically affiliated hospital 13 . Another study surveyed around 150 outpatients and stated that 40% did not fast before going to the hospital for laboratory blood work 14 . Both studies pointed out that documentation of fasting state before phlebotomy was often non-existent as these data are not routinely collected by healthcare providers or the laboratory team and recorded in the EMR. Similarly, information regarding whether patients had been given instructions to fast before phlebotomy was also not recorded 13,14 . Despite the importance of the fasting status in patients undergoing phlebotomy, there has been relatively few research conducted in the current literature to verify the fasting status of patients before blood work. The lack of knowledge of the fasting state of patients presents a challenge for healthcare providers in determining whether patients had truly fasted before laboratory blood testing and may prohibit them from interpreting the results in accordance with diabetes screening guidelines, resulting in missed diagnoses of prediabetes and type 2 diabetes. Misclassification of fasting status negatively influences the clinical accuracy of conventional or ML models in screening DM or predicting the risk of DM 15 . Verification of fasting blood samples is therefore a significant challenge in analyzing real-world EMR data for epidemiological research, particularly when the disease diagnostic criteria are based on fasting blood samples. The current reference standard for confirmation of the fasting status relies on self-reported information from the patients during phlebotomy, which may be influenced by recall and awareness biases. To the best of our knowledge, no studies have used EMR data to investigate the discordance between prescribed and actual fasting status based on the distribution of BG and concomitant HbA1c values. Using a large clinical data repository of more than 2.75 million patient records from a tertiary medical center in central Taiwan, we systematically evaluated the distribution of BG values. We used the HbA1c-estimated average glucose level to define fasting status, followed by the development of prediction models using ML.

Study data source and sample selection. The China Medical University Hospital (CMUH) Clinical
Research Data Repository (CRDR) carefully validated the EMRs of 2,873,887 patients who had sought care at CMUH between January 1, 2003, and December 31, 2018. The methodologic details have been published elsewhere [16][17][18][19] . Of the 2,873,887 patients, 945,792 underwent glucose measurements using sera samples from inpatient and outpatient services. The sample selection flow is summarized in Fig. 1 Sociodemographic and clinical variables. The covariables of interest were obtained from the CRDR, including patient demographics, specifically age and sex, and body mass index, which was calculated as the weight in kilograms divided by the height in square meters. The presence of hypertension or type 2 DM was captured based on associated ICD-9/-10 codes or the use of glucose-lowering medications or antihypertensive agents. A history of cardiovascular disease was also documented if the patients had a record of coronary artery disease, myocardial infarction, stroke or congestive heart failure in EMRs based on International Classification of Diseases (ICD) 9th and 10th edition codes. All other coexisting comorbidities were also captured based on ICD-9/-10 codes from the repository or EMR data. Additional provider-or patient-level factors such as medication records, health care provider specialty, and biochemical measures were obtained from repository data or the EMRs within a 1-year window prior to enrollment into the study cohort.
Another patient-level factor that we included was the distance from the patients' home to the hospital as we hypothesized that fasting status might be associated with the travel time to the healthcare facility. Currently, no studies have investigated the association of distance between healthcare facilities and homes and fasting status. However, a few studies have provided evidence that increasing travel distance to the primary care provider may affect and decrease glycemic control [20][21][22] .Therefore, we calculated the straight-line distance between hospital to home as it is the most common method for this type of calculation 23 . The home-to-hospital distance was calculated in two steps. First, a geocoding application programming interface developed by Google Maps was used to transform the map coordinates of the entire study population's home addresses and locations. The distance between the homes and the hospital was calculated using the geographic information system (ArcGIS version 10; ESRI, Redlands, CA, USA).
Determination of glucose and HbA1c levels. Blood  www.nature.com/scientificreports/ From the CMUH-CRDR laboratory database, we selected the glucose measurements specified as fasting glucose (AC, ante cibum), postprandial glucose (PC, post cibum), and random glucose. We excluded data recorded as nonnumerical values, values higher than 1000 mg/dL, or zero values. All glucose measurements could also be classified as inpatient, outpatient clinic, and emergency department services. Only measurements obtained in the outpatient setting were included in the final analysis. The HbA1c-derived averaged glucose level (AC average ) was defined based on Nathan et al. 's formula as a theoretical upper limit of fasting glucose 24 . Data conditioning steps to determine ontological fasting glucose. To investigate the "true" ontological fasting status on blood glucose measurements, we filtered glucose measurements that were highly likely nonfasting in the outpatient setting to derive ontological fasting glucose (AC ontological ) as follows. Glucose measurements were reclassified as non-AC ontological if: 1. the data were labeled as post cibum glucose or random glucose, 2. the glucose measurement included additional descriptions/labels such as "one-touch", "bedside check", or "PC" or contained descriptions indicating active food intake before phlebotomy, regardless of the laboratory test prescribed (e.g., fasting glucose), www.nature.com/scientificreports/ 3. patients had multiple fasting glucose measurements on the same day; only the first measurement was considered as non-AC ontological .
Definition of theoretical fasting status. Three criteria were used to define the theoretical fasting status (AC theoretical ) of patients who underwent concomitant AC ontological and HbA1c measurements on the same day: (1) an AC ontological < 100 md/dL in patients without DM with HbA1c < 5.5%; (2) an AC ontological < AC average − 1 standard deviation of AC ontological glucose in patients without DM with an HbA1c between 5.5 and 6.4%; and (3) an AC ontological < AC average in patients with DM. Once the patients' glucose AC was defined as AC theoretical , the corresponding blood samples were defined as fasting samples. Otherwise, they were considered nonfasting samples. These criteria are based on the physiological profiling of glucose and insulin variation over 24 h in individuals with and without diabetes 25,26 . The A1c-derived estimated average glucose (AC average ) summarizes the daily glucose variation over the past 90 days, depicting an averaged value between the lowest and the highest glucose level in this time window among patients with a stable metabolic state. Therefore, if truly obtained in the fasting status, the glucose level should be theoretically less than the level of AC average 27 . To verify the validity of our proposed criteria, we used the glucose AC from 4519 patients who provided morning fasting samples before the procedure of pan-endoscopy in CRDR as the true fasting glucose AC and only 314 measurements (6.95%) were misclassified as nonfasting based on our criteria.

Statistical analysis.
The clinical characteristics of patients with a theoretical fasting sample and those with a theoretical nonfasting sample were compared. The probability densities of glucose levels between fasting and nonfasting status were examined based on the diabetic status. We also assessed whether the levels of fasting glucose differed if the glucose measurements were taken at the same time with lipid profiles. Conventional logistic regression and ML were applied to develop a tool for predicting whether the glucose measurements were fasting measures. We tested model discrimination and calibration using area under the receiver operating characteristic (AUROC) statistics and calibration curves.
Machine learning approach and evaluation. To use ML for predicting whether the blood samples were obtained in the fasting state, a balanced dataset was curated to obtain a 1:1 ratio of AC ontological and AC theoretical , which was composed of 93,958 patients ( Fig. 1). Patients within this balanced dataset were separated into training and testing sets at an 80/20 proportion while maintaining a 1:1 ratio of AC ontological and AC theoretical . The demographic, clinical, and biochemical information of the patients, such as age, ICD-9 or -10 codes, medication histories, and laboratory test results, was then extracted from the CMUH-CRDR. We applied logistic regression and eXtreme Gradient Boosting model (XGBoost), a scalable end-to-end tree boosting model proposed by Chen and Guestrin 28 , to evaluate the performance of predicting fasting status. We additionally experimented with two efficient algorithms, CatBoost and ensemble models with H2O AutoML, to better handle the categorical variables and explore the predictive performance using multiple learning algorithms 29,30 . The objective function of this binary classification problem was to minimize binary entropy loss; the hyperparameters of our XGBoost model were determined using the Tree of Parzen Estimators (TPE) method 31 . Taking the implementation of XGBoost in Python as an example, the finalized hyperparameters were set as tree depth = 8, learning rate = 0.1, gamma = 0.5, minimum sum of instance weight = 7, number of estimators = 300, and the remaining parameters were set using the default setting. Detailed parameter ranges for grid search were summarized in Supplementary  Table 1. To implement ensemble models with H2O AutoML in Python, we stacked various algorithms, such as XGBoost, Random Forest, and Gradient Boosting Machines. The model output of XGBoost, CatBoost, or Ensemble models was the probability of AC theoretical . The performance quantification of each ML algorithm was evaluated in terms of AUROC, accuracy, precision, recall, and F1 score using a fivefold cross-validation scheme. We used the bootstrapping method with 2000 repetitions to statistically test the difference between the paired AUROCs 32 . Finally, we compared the proportion of glucose AC ≥ 126 mg/dL calculated with or without the ML algorithm to classify the fasting status. We also classified glucose AC ≥ 126 mg/dL, regardless of ontological or predicted fasting samples, which did not lead to the diagnosis of diabetes over the study periods as ineffective glucose measurements (IGM). All statistical analyses were performed using SAS version 9.4 (SAS Institute Inc., Cary, NC, USA), R version 3.

Results
Distribution of glucose level by fasting and diabetic status. A total of 359,402 AC ontological data points were included in the final analysis, with a mean sample age of 59.7 ± 14.6 years. Approximately half of the sample population were female (46.1%). When restricting to only the patients' first sample in the CRDR (n = 93,958), the average age was 54.4 ± 15.5 years, and 45.9% of these patients were female. Of these 93,958 patients, 29.2% had been diagnosed with DM at the first AC ontological . Blood glucose measurements considered to be collected during fasting state were observed in younger non-DM patients but not among younger patients with DM. Nonfasting samples were more likely to be provided by male patients, regardless of their diabetic status. Moreover, samples were more likely to be fasting measures if the lipid profiles of the patients were concomitantly examined. Statistical differences were observed for the majority of the biochemical measures between  Table 1). The peak of the density curves of AC ontological with and without same-day HbA1c measures was similar at approximately 100 mg/dL. However, the width of the distribution of AC ontological with HbA1c measures was wider than those without the HbA1c measures ( Fig. 2A). Peaks of the density curves of AC theoretical and nonfasting AC ontological were separated with a peak value slightly lower than 100 and slightly above 126 mg/dL, respectively (Fig. 2B). Among patients without DM, a peak shift to the left to < 126 mg/dL was noted in the nonfasting samples compared with the entire sample with concomitant HbA1c (Fig. 2C). By contrast, among patients with DM, the peak of the fasting samples shifted right to approximately 126 mg/dL (right-shifting; Fig. 2D). Figure 3 shows the scatter plots of AC ontological and HbA1c based on diabetic status and highlights the distribution of AC theoretical .
Factors associated with nonfasting status. The entire dataset consisted of 67 attributes (Supplementary Table 2) and details of relevant missingness are provided in Supplementary Table 3. In multiple logistic regression, age, male sex, distance from the home to the hospital, the timing of blood sampling, and the cumulative frequency of outpatient visits 1 year prior to the blood sampling were associated with a higher probability of being in a nonfasted state. Patients with a history of DM, hypertension or coronary artery disease, statin medication, and concomitant lipid and glucose testing were significantly associated with the fasting status. Comparing the odds of nonfasting status among patients who visited the Health Management Center, those who were ordered glucose measurement in the departments of metabolism and endocrinology, general medicine, and nephrology were twice as likely to be in a nonfasted state ( Table 2). In addition, patients who underwent concomitant glucose and lipid testing were more likely to follow the fasting instruction, with the odds ratio of being in a nonfasted state of 0.78 (95% CI 0.76-0.80).

Machine learning performance in fasting status identification.
We conducted experiments on feature selection by building XGBoost models with the top 10, 25, 35, 45 features and found that using all 67 features generated the most accurate result. Compared with the predictive performance of multiple logistic regression for nonfasting status in the testing dataset, XGBoost with full features showed better sensitivity (77.8% vs. 76.1%), accuracy (80.9% vs. 78.5%), and F1 score (81.6% vs. 78.0%; Table 3). The top 45 scoring variables are summarized in Fig. 4. The level of the AC ontological , the distance from home to the hospital, age, height, and the level of serum creatinine were the most important features. When we used 14 features of the parsimonious model (model 2 in Table 2) in the XGBoost algorithm, the predictive performance was statistically better than that of the predictive model derived from multiple logistic regression. By contrast, the precision of the conventional logistic regression model was marginally better than the ML-based models ( Table 3). The AUROC and calibration performance of our proposed ML methods were generally better than those of the multiple logistic regression model (AUROC 0.887 vs. 0.868, p < 0.001; Fig. 5). In the sensitivity analysis of other ML algorithms, the predictive performance was consistent with the original XGBoost (Table 3). However, the overall predictive performance difference between ML-based and conventional logistic regression models was not clinically relevant. The performance of different ML methods in the training dataset is provided in Supplementary Table 4.
Impact on the prevalence of ineffective glucose measurements. On average, the prevalence of glucose measurement ≥ 126 mg/dL dropped from 14.2 to 10.1% by applying algorithm-verified FBGs over the years, and this difference was constant throughout the study period ( Table 4). The prevalence of IGM dropped from 27.8% based on AC ontological ≥ 126 mg/dL to 0.48% by using algorithm-verified FBGs ≥ 126 mg/dL. The difference consistently ranged between 25.9 and 28.5% from 2003 to 2018 (Table 4).

Discussion
Our findings support that fasting status can be well predicted in real-world settings by using parsimonious computation models based on ML or conventional statistical approaches in clinical practice. Using the ML model, we found that 78.0% of the 604,639 blood samples could be theoretically classified as fasting samples when we defined the fasting status as AC ontological less than AC average . The most important features to predict fasting status were the levels of AC ontological , distance from the home to the hospital, age, height, and concomitant testing of serum creatinine. XGBoost yielded statistically better performance in predicting fasting status than conventional logistic regression modeling did, with an AUC of 0.892 and an F1 score of 80.5%. The prevalence of IGM decreased from 6.44 to 0.06% among those without DM history. This change is noteworthy, as the prevalence of DM was 16.6% regardless of the fasting status and 11.8% when patients with nonfasting status verified by ML algorithms were excluded from the sample. ML algorithms, such as XGBoost, may be particularly useful as their robustness to missing data, can address one of the most pervasive barriers of real-world data analysis.
In clinical practice and diabetes research, it is common to assume that AC ontological is from a fasting sample in EMR 33 . Our results suggest that implementing fasting status verification algorithms based on a ML or conventional statistical approach is essential for an automated diabetes screening algorithm to better predict DM, which may help the regional and national diabetes screening policy and improve care management. There is no standardized method to assess whether patients have truly fasted before phlebotomy is performed. When patients were asked about their fasting status prior to phlebotomy in a survey study, only 50% reported having actually fasted 13 . As there is no objective biomarker to verify fasting status, the current reference standard merely relies on patients' self-reports which are inevitably affected by recall bias. Thus, the self-report data pose persistent challenges to assessing the epidemiology of DM 13 . From the perspective of point-of-care testing, it is likely that the current literature has overestimated the prevalence and incidence of prediabetes and DM based on the EMR data, particularly the so-called "undiagnosed prediabetes or diabetes. " Information bias, specifically www.nature.com/scientificreports/ misclassification bias, caused by treating nonfasting glucose as fasting glucose, underestimates the effects of glucose on health outcomes. The findings of a recent study, in which six diabetes phenotyping methods in EMR were compared, suggested that solely using abnormal glucose values would overestimate the number of prevalent DM cases by approximately 1.5 times 34 . This magnitude of overestimation cannot be entirely explained by analytical variation in glucose measurement; therefore, overestimation of actual fasting status should be considered and thoroughly investigated 35 .  www.nature.com/scientificreports/ Our results showed that lipid profiles, except triglyceride level, were not affected by fasting status. Especially among patients without DM, levels of fasting TCHO, LDL-C, and HDL-C were counterintuitively higher than those from the nonfasting samples. This finding supports the trends of using nonfasting lipid profiles to facilitate risk assessment of atherosclerotic cardiovascular disease and assures the feasibility of our algorithm in classifying fasting status by comparing the difference between AC ontological and AC average . Our ML approach in identifying fasting status can serve as a complementary tool to the questionnaire-based survey and enable clinicians to provide personalized instructions for fasting to patients based on their prior fasting records, thereby increasing the accuracy of the true fasting rate improving the precision in identifying DM and monitoring its control. We also observed some major contributing factors in predicting fasting status, such as distance from the home to the hospital, age, and serum creatinine level, which can provide another perspective in understanding the adherence behavior of staying in a fasting state. Furthermore, our proposed fasting status prediction algorithm helps enhance the validity of an automated diabetes phenotyping algorithm. In the entire population of CMUH-CRDR, we found that the prevalence trend of diabetes mellitus based on algorithm-verified FBGs was 11.8% lower than that based on AC ontological (23.1%), and the corresponding trend of prevalent prediabetes based on algorithm-verified FBGs was also 24.1% lower than that based on AC ontological (40.2%). Although the difference was not radical, the absolute misclassified number from DM to nondiabetes can be significant, depending on the population size. Indeed, due to the increasing interest and use of digital health tools to detect abnormal blood glucose levels, misclassification of nonfasting glucose measures as fasting may lead to potential overdiagnosis and treatment of patients without DM.
The concept of IGM is worthy of broader discussion as it stands for a measurement of FBG that did not change the clinical course of glucose metabolism even when the level was greater than 126 mg/dL among patients without a history of DM. Several reasons could help explain this observation, such as clinician knowledge of the nonfasting status or a missed interpretation of the result. Nonetheless, a potential consequence of IGM is missing the detection of diabetes, leading to complications and increased healthcare utilization in the long run. Failing to obtain a truly FBG may be problematic for diabetes screening. Our proposed algorithms drastically reduced the proportion of IGM, supporting their use in the real-world care flow to trigger actionable screening of diabetes. These algorithms also help generate a warning upon detecting the discrepancy between AC ontological and algorithm-verified nonfasting glucose, which could serve as a checkpoint and reminder in the automatically digital phenotyping process for DM screening. Future research on clinical effectiveness and automatic fasting status prediction implementation in the flow of digital diabetic phenotyping systems is necessary to strengthen the public health impact.
The present study has several limitations. First, the actual fasting status of the patients was not available. However, it is challenging, if not impossible, to obtain the actual fasting status. We assumed that AC ontological should be less than AC average in the fasting status among outpatients with stable dietary habits and a steady level of carbohydrate metabolism. In the crude analysis, we found that patients from the Health Management Center were more likely to be in the fasting state before phlebotomy. This observance corresponds to our clinical experience, www.nature.com/scientificreports/ where patients who were relatively healthy and willing to attend health checkups typically have a higher motivation to provide fasting samples. Specifically, patients who undergo health checkups usually receive detailed instructions for fasting 36 . Furthermore, over 93% of FBGs obtained from patients prepared for a pan-endoscopy were accurately classified as AC theoretical . Second, the algorithm was developed in a tertiary hospital under universal health care coverage; thus, it may not be generalizable to other settings. Further research with additional data from different populations is required to train and solidify our proposed algorithm. More importantly, integrating our algorithm into the clinical workflow is critical to verify its performance in the real-world setting.

Conclusions
To the best of our knowledge, this is the first attempt at using a ML approach to evaluate the reliability of fasting samples in a large tertiary hospital. Only 65.3% of ontologically AC samples could be classified as algorithmverified fasting status. Despite its moderate performance in predicting the fasting status among outpatients, our algorithms provide an innovative approach to clean medical data and facilitate true fasting BG detection. Notably, this study has introduced an essential step towards establishing automated phenotyping in EMR for effective diabetic screening and more accurate estimation of the global and local epidemiology of DM.