Abstract
Laboratory data from Electronic Health Records (EHR) are often used in prediction models where estimation bias and model performance from missingness can be mitigated using imputation methods. We demonstrate the utility of imputation in two real-world EHR-derived cohorts of ischemic stroke from Geisinger and of heart failure from Sutter Health to: (1) characterize the patterns of missingness in laboratory variables; (2) simulate two missing mechanisms, arbitrary and monotone; (3) compare cross-sectional and multi-level multivariate missing imputation algorithms applied to laboratory data; (4) assess whether incorporation of latent information, derived from comorbidity data, can improve the performance of the algorithms. The latter was based on a case study of hemoglobin A1c under a univariate missing imputation framework. Overall, the pattern of missingness in EHR laboratory variables was not at random and was highly associated with patients’ comorbidity data; and the multi-level imputation algorithm showed smaller imputation error than the cross-sectional method.
Similar content being viewed by others
Introduction
Laboratory data are often used in machine-learning-enabled EHR-based clinical decision support systems1,2,3,4 and significantly improve disease modeling and outcome prediction3,5,6,7,8. However, laboratory data are often missing for intentional (e.g., the patient does not need certain laboratory tests) or unintentional (e.g., lack of routine checkup or follow-up) reasons, and this missingness can result in loss of power, biased estimates9,10, and models that underperform. Notably, imputing missing values for EHR laboratory variables, which includes irregular time-series data, is a persistent challenge. Missing patterns and missingness mechanisms for laboratory data have not been well characterized. Moreover, the imputation strategy that is optimal given a defined missingness pattern has not been studied.
Within clinical trial frameworks or observational studies, various imputation models have been successfully applied and these include mean substitution, regression, hot deck11, tree-based12, as well as advanced statistical methods, such as expectation maximization (EM)13, full information maximum likelihood (FIML)14, and multiple imputations (MI)15,16. In general, imputation algorithms that rely on inter-attribute correlations perform better. The data correlation could exist within a time point across all samples (cross-sectional) or between time points at an individual level (longitudinal), within a single variable (univariate) or between variables (multivariate), and missing in one variable correlated to observation in other variables and vice versa. MI, the commonly used imputation method, assumes that each missing value has a distribution of plausible values, which reflect the uncertainty of the missing value. MI is usually conducted using three procedures, fully conditional specification (FCS)17,18,19, joint model (JM)20, and monotone imputation21. Multivariate Imputation by Chained Equations (MICE)22—a widely used open-source imputation software with built-in cross-sectional and multi-level univariate or multivariate algorithms ̶ is applied to laboratory variables from EHR. Previous studies applying MICE or other methods to impute one laboratory variable with common laboratory variables in cross-sectional studies have achieved some promising results23,24,25,26.
The key questions when deciding on imputation techniques for laboratory variables are the following. (1) What is the pattern or mechanism of missingness in these variables; (2) How to choose the algorithms and procedures for imputation of missingness; (3) How well to impute laboratory data in a cross-sectional design compared to a longitudinal design; (4) Can auxiliary variables, based on comorbidity information, be useful in the imputation model; and (5) How well the conclusion made from a single dataset is applied to an independent dataset with different setup or missingness pattern—namely generalizability. In this study, we determine patterns and explore mechanisms of missingness in laboratory variables in Geisinger Healthcare System in Pennsylvania, and Sutter Health in California (Fig. 1) for two distinct cases. We evaluate the performance of commonly used imputation algorithms with a focus on model-based MI frameworks that could accommodate high missingness rates (>50%). We simulate two mechanisms of missingness, arbitrary and monotone, by randomly holding-out laboratory values (HV) and complete patient records (HC), to mimic different patterns of missingness observed in EHRs (Fig. 2a–d), and evaluate the performance of the algorithms. Finally, we use a case study to assess the value of applying latent information derived from comorbidity as auxiliary variables to predict hemoglobin A1c (HbA1c).
Result
Laboratory measures characteristics
Overall, 45 quantitative laboratory variables from GNSIS (n = 9037) and 38 from HF (n = 5192) with <75% missingness were analyzed in this study. Kernel density plot was used to illustrate the data distribution for each variable before the index date (Supplementary Fig. 1). The laboratory variables from two EHR datasets were summarized in Table 1 (See supplementary Table 1 for detailed information).
For variables collected as a panel (e.g., CBC, electrolyte, liver function, kidney function, lipid panel, and metabolic panel), their missingness usually occurred concurrently (Fig. 2e, f). The selection of laboratory variables for imputation was determined by the correlation matrix and the connection between missingness and observation among the variables visualized by a fluxplot. The pairwise correlation between two observations (before or after the index date) was moderate (|R | ≈ 0.5) across all variables (Supplementary Table 1). On the other hand, there were low correlation coefficients (|R | < 0.2) between selected variables from each test panel, however, this correlation was still statistically significant (Supplementary Fig. 6). According to the fluxplot (Fig. 2e, f), electrolyte and glucose levels had the highest Ojk, suggesting their observed data connected to the missing data of other variables, whereas HbA1c and coagulation related variables have a highest Ijk, suggesting their missingness was connected to the observed data from other variables. All these laboratory variables were included in the MI procedure.
Analyses of missingness patterns and mechanisms
Missingness before (Fig. 2a, c) or after (Fig. 2b, d) the index date, was likely to be “monotone” with some degree of randomness. As summarized in Fig. 2, we noticed that (1) the missingness was higher before the index date than after in the GNSIS dataset, (2) the HF data had a higher percent of missingness for both before and after the index date compared to the GNSIS dataset, (3) only a small portion of patients have repeated measurements (see Fig. 2g, h for the percentage of subjects with greater than one measurement), and (4) the missingness level was reduced by combining data from before and after only in the GNSIS dataset (Fig. 2g, h).
Further analysis of the pattern of missingness was performed using margin plots. We assessed the missingness pattern between “before the index date” and “after the index date” or between two different laboratory variables (Supplementary Fig. 2). We randomly selected four laboratory variables, one from each panel with a different level of missingness. The pattern of laboratory measures did not violate MAR. Under the MAR assumption, the distributions showed in the side boxplot of one laboratory variable, conditioned on the status of observed (blue) or missing in the other laboratory variable, could be different, both in location (median) and spread (IQR). However, clusters were not formed in the scatterplots, and no significant shift in the boxplot between missing (red) and observed (blue) values were detected. (Supplementary Fig. 2).
The co-analysis of patient comorbidities and missingness of laboratory measurements revealed that the missingness was related to disease burden, and the patients with higher disease burden had less missingness in both the GNSIS and HF datasets. For each laboratory variable, the association between missingness and each main PC (labeled as Dim) was extracted from the comorbidity matrix (Fig. 3a, b). Patients with observed laboratory values had significantly higher PC values (red dots) than patients with a missing value.
We studied two simulation policies, by holding-out 50 random laboratory values (HV) and 50 complete patient records (HC), to mimic different patterns of missingness. The GNSIS dataset included 393 completed cases, while the HF included 777 complete cases from which 50 HC were randomly drawn for each cohort. When analyzing the 50 HV per variable our results showed no significant association with any of the PCs across all the variables (blue dots), suggesting MAR pattern; however, analysis of the 50 HC showed a significantly higher PC value (green dots) at least for the first main PC (labeled as Dim.1) in the GNSIS (Fig. 3a and Supplementary Fig. 7a) but not in the HF dataset (Fig. 3b and Supplementary Fig. 7b). These observations highlight the fact that patterns of missingness can have unique attributes based on the originating centers and associated phenotypes.
Coverage rate comparison among different imputation models
The variability (95%CI) of the mean coverage rate (CR) was generally higher for PMM than 2l.pan algorithms. The mean CR for 50 holdouts representing the proportion of confidence intervals (CI) that contain the true value in the two simulation policies was evaluated for both datasets and all the imputation algorithms included in this study (Fig. 4). For both policies (HV and HC), the 2l.pan-FCS and 2l.pan-monotone showed better CR than cross-sectional PMM-FCS and PMM-monotone imputation (see Fig. 4a for GNSIS and Fig. 4b for HF). Finally, the results obtained from the average width were consistent with CR (Supplementary Fig. 8).
Uncertainty propagation
The nRMSE after repeated MI was dynamically assessed to determine the level and speed of the uncertainty that was propagated after 5, 10, 20, 30, 40, 50 repeated imputation. Our results (Fig. 5) showed that (1) the mean and standard error of nRMSE for HC were generally larger when compared to HV; (2) for HV, mean nRMSE stabilized after 30 repeats for most of the variables, and the standard error of nRMSE stabilized after 20 or 30 repeats as well; (3) in general, FCS performed better than monotone, and 2l.pan performed better than PMM for the majority of variables in both datasets; and (4) for HC, mean nRMSE did not converge for some of the variables even with 50 runs. This latter observation highlights that when missing follows an MNAR pattern a higher number of runs are needed to ensure that the nRMSE are stabilized.
Performance evaluation of model
We assessed the model performance for different algorithms and simulation policies. Overall, for the HV simulation policy, FCS performed better than monotone in both datasets (Supplementary Fig. 5a, b). However, the improved performance of FCS over the monotone procedure was unclear for the HC simulation policy, particularly in the HF dataset. The multi-level (2l.pan) imputation outperformed the cross-sectional PMM, as indicated by a significantly lower nRMSE (after correction for multiple testing) in Fig. 6. Given 50 HVs in GNSIS, we observed that 21 out of the 45 (46.7%) variables showed a significantly lower nRMSE for 2l.pan-FCS than that for 2l.pan-monotone. This number was 8 out of 45 (17.8%) for the 50 HCs. Similarly, we identified 10 out of 45 (22.2%) variables having significantly lower nRMSE for 2l.pan-FCS than that for PMM-FCS, while 15 out of 45 (33.3%) variables for 50 HCs showed similar results. Analysis of the HF dataset corroborated similar observations; in particular, 17 out of the 38 (44.7%) variables showed a significantly lower nRMSE for 2l.pan-FCS than that for 2l.pan-monotone. We also identified 8 out of the 38 variables having significantly lower nRMSE for 2l.pan-FCS than that for PMM-FCS. The laboratory variables in the same panel (e.g., electrolyte, lipid panel, CBC) showed similar patterns (Fig. 6).
Finally, our comprehensive analysis, including uncertainty assessment, showed that the standard error of imputed values and their deviation from the regression line, estimated by the correlation coefficient (R), was higher in HC simulation policy across all laboratory variables for both datasets. The latter was shown by the over-imputation plots (Supplementary Fig. 4). This observation emphasizes the need for a more careful assessment of uncertainty when analyzing laboratory variables with MNAR patterns.
A case study for hemoglobin A1c
We designed a case study to assess the practical value of improvement in imputation for the laboratory measurement of Hemoglobin A1c (HbA1c, LOINC ID: 17856-6), which had the highest missingness level. HbA1c has also the highest Ijk, suggesting its missing data connects to the observed data from other variables in a multivariate MI model.
The over-imputation plot (Fig. 7 for FCS and Supplementary Fig. 9 for Monotone) demonstrated the correlation between 50 holdouts and imputed mean values after 50 repeated MI. The R-value labeled in each panel represented the optimal correlation coefficient that could be reached by different imputation algorithms under different settings (multivariate or univariate missing).
Within the multivariate missing framework, our results showed that the 2l.pan outperformed PMM for this variable with a larger average Correlation Coefficient between imputed and holdout values under two simulated missingness patterns as shown in Table 2. The average correlation coefficient (R) was higher when using multivariate 2l.pan (e.g., R = 0.536 for 50 HVs using FCS) than multivariate PMM (e.g., R = 0.401 for 50 HVs using FCS), regardless of the imputation procedure (FCS or monotone) or simulation policy (HV or HC). Imputation performance slightly improved with increased average R, decreased variance (Standard Error) and coefficient of variance (CV) when using univariate 2l.pan including PCs that were derived from comorbidity information as latent variables (e.g., R = 0.473, SE = 0.012, CV = 0.179, compared to R = 0.462, SE = 0.014, CV = 0.214 for 50 HVs; R = 0.3, SE = 0.016, CV = 0.377, compared to R = 0.271, SE = 0.019, CV = 0.496, for 50 HCs). In all of our simulation experiments, HC consistently showed lower correlation (average R) and larger SE than HV, suggesting increased variance of imputation.
Discussion
The laboratory values in this study were collected from two different diseases cohorts, ischemic stroke, and heart failure, respectively, and data were acquired from the EHR from two large health care systems from different geographical areas with a distinct ethnic distribution. Using these datasets, our study (1) improved the understanding of missingness patterns in real-world EHRs, (2) assessed and compared the performance of commonly used imputation algorithms when applied to a broad range of laboratory variables, and (3) identified strategies for enhancing imputation performance by leveraging auxiliary information from patient’s comorbidity data.
Our analysis of quantitative laboratory variables from two datasets indicates an MNAR, which the margin plots were not able to show unless an in-depth knowledge of the cohort such as comorbidities was provided10. MNAR is a type of missing when the value of the variable that is missing is related to the reason it is missing, alternatively, the missingness is dependent on the missing values themselves given the observed data. MNAR was recognized in clinical trial data16,27 as well as EHR data from this study. Missingness in the repeated measurement in the clinical trial data is related to the patients’ responsiveness to the treatment, resulting in compliance and dropout issues. Similarly, the missingness in all common laboratory variables was related to individual disease burden in this study. This nonintentional missing was disguisable and other known (insurance, social-economic status, educational background) or unknown factors might also contribute to MNAR. Our analysis showed that the probabilities of missingness for all laboratory variables were related to disease burden. Patients with missing values are more likely to have a laboratory value within a normal range rather than within a range of observed data.
Our data also showed that when one test result was missing for a patient, other tests with a higher missingness rate for that patient were also likely to be missing, suggesting a “monotone” pattern of missingness. A “monotone” pattern may imply that missingness likely happens to a group of patients who do not seek health care regularly. Both datasets had a combination of monotone missingness and varying degrees of random missingness. The missing rates before and after the index date were similar in the case of HF; however, less missing after the index date compared to before the index date was observed in the GNSIS dataset (Fig. 2g, h). The difference between GNSIS and HF could be due, in part, to the higher mortality in HF, differences in social-economic status (e.g., insurance). Nonetheless, one should not assume that the rate of missingness will be always lower after a patient has an acute event.
Our simulation policy experiments (HC and HV) were designed to mimic different patterns of missingness in EHR. Using these simulations, we were able to identify experimental design strategies to improve the model performance and the stability of the nRMSE. To determine how many repeated imputations are necessary to reach an unbiased conclusion on the performance of commonly used MI algorithms, we evaluated nRMSE and compared the level and speed of the uncertainty propagated after 5, 10, 20, 30, 40, 50 repeated imputations. At the first 5 to 10 complete imputed sets, mean nRMSE from 2l.pan-FCS may not show a statistically significant difference from the mean from 2l.pan_MONOTONE; however, after 50 repeated imputations, the mean nRMSE reached a plateau for most of the laboratory variables in the HV design. However, in the HC design, the nRMSE error metric did not reach a plateau for some variables even after 50 repeats, irrespective of the imputation algorithms. The latter suggests that the uncertainty brought by MI was larger but propagated slower on the most informative cases when missingness was monotone. This observation corroborates that the monotone missingness in informative cases is the worst type of missingness, which translates to a lack of routine checkups or follow-up in at-risk patients.
Our simulation results indicate that the cross-sectional PMM may not be an optimal algorithm for a small dataset with a high proportion of missing values when compared to multi-level imputation (e.g., 2l.pan). The 2l.pan leverages both level 1 and 2 variables and allows switching regression imputation between level 1 and level 2 data28. In fact, in the HV simulation experiments, we observed that the 2l.pan showed better CR than PMM. The PMM algorithm was developed to provide a semi-parametric approach to imputation for settings where the normal distribution is not an appropriate assumption. Thus, PMM—as a mosaic form of donor-based and regression-based algorithms—was compared with the multi-level imputation. The uncertainty for missing values could be underestimated by PMM, resulting in poor coverage with increased variability of CR. This is because only a few similar observed cases were available for some variables with a high-level of missingness.
However, the advantage of multi-level over cross-sectional imputation was observed primarily in the GNSIS cohort. The lack of improvement in the HF cohort was likely because no substantial improvement in the percentage of subjects with at least one measure after the event was observed (see Fig. 2h). The multi-level imputation has limited ability to leverage post-event information to make a better prediction of pre-event missing values.
When comparing monotone to FCS imputation with the Monte Carlo iterative procedure, we always observed better performance with FCS. We also compared the cross-sectional imputation (PMM) to multi-level multivariate imputation such as 2l.pan (FCS-LMM) or 2 l.norm (FCS-LMM-het), which was based on an assumption of homogenous or heterogeneous within-group variances respectively18,29. Our analysis showed that when the imputed data was out of the normal range, higher variation may have increased the within- and between -imputation variance but did not improve the prediction accuracy. This leads to an important aspect in the utility of laboratory measurements; in most realistic clinical settings, a diagnosis is based on values that are outside of the normal range23.
Evaluating the performance gain when incorporating auxiliary information from patient’s comorbidity data was done by co-analyzing patient diagnosis patterns in conjunction with their laboratory measurements. We introduced PCs derived from PCA of comorbidity matrix to the multi-level univariate imputation algorithms such as 2l.pan. Using this design strategy we were able to add latent variables to the final prediction model30.
Including proper auxiliary variables mitigates the bias in maximum likelihood estimates caused by MAR or MNAR mechanism, particularly when an imputed variable and auxiliary variables are nonlinearly related31. Our result from univariate imputation by including PCs, as auxiliary (latent) variables, also reduced bias in the estimates. However, including auxiliary variables in the imputation may also increase the standard errors of the estimates substantially when the sample size is small, and the proportion of missing data is not trivial. Such an adverse effect may also occur when including some auxiliary variables to make the MAR assumption more plausible, especially when the auxiliary variables are not normally distributed31. Finally, when the outcome variable is the outcome of interest in the analytic model, this variable is highly recommended to be included in the imputation model to improve the performance in the analytic model32.
The strengths of this study lie in the followings: (1) Description of the missing pattern and exploration of the mechanism of missing in the laboratory variables of EHR database; (2) Simulation of two missingness patterns recognized in this study—monotone and arbitrary missingness; (3) Comparative assessment of well-established commonly used cross-sectional and multi-level imputation methods, integrated with two imputation procedures (monotone or FCS); (4) Use of latent information extracted from comorbidity matrix as auxiliary variables in the imputation model; and (5) Evaluation of the generalizability of the findings by analyzing two datasets with distinct disease cohorts from two different healthcare systems. Furthermore, the phenotype definition for each of the conditions was carefully assessed and validated in the previous publications33,34,35,36.
Since simulating the missingness pattern of laboratory variables in EHR is challenging, the pattern of missingness and imputation strategy used in this study may not apply directly to other diseases or datasets. Furthermore, imputation of missingness was based on using quantitative variables and may not translate to categorical variables or derived/modified variables (ratio, converted values). Given that our understanding of realistic missing patterns is still limited, in this study only two simulated missingness scenarios were evaluated and these patterns may not occur or fully represent the pattern of missingness in realistic settings. Finally, multiple MI methods including FCS, joint model (JM), EM-based algorithms, and their extended forms, were also applied to longitudinal and clustered data29,32. As our goal was to align ourselves with current standard practices in EHR-mining, our study did not include all regression-based algorithms, JM, or EM-based algorithms.
As future directions, we are exploring how the inclusion of the auxiliary variables affects the bias and precision of the imputation models. In this analysis, we are assessing the various parameters such as the cohort sample size, number of imputations, missing rate, number of iterations, as well as the correlation between variables. The EHR dataset could also be nested hierarchically by the healthcare center. Having a healthcare center as an additional level of data clustering will be considered in the multi-level imputation model, especially when data from different centers are pooled together for analysis. Finally, our study is part of a larger effort to improve risk stratification for heart failure and ischemic stroke, using machine-learning applied to data from EHRs.
In conclusion, the pattern of missingness in EHR laboratory variables was not random and was highly associated with patients’ comorbidity data. Multi-level imputation (2l.pan) showed smaller nRMSE for most variables compared to cross-sectional methods. MI with Markov Chain iterations such as FCS performed better than the monotone procedure. In the case study of HbA1c, univariate imputation using a multi-level model with FCS, which leveraged comorbidity as latent variables in the imputation, had superior performance compared to the same method without these auxiliary variables.
Finally, the missing pattern and mechanism for a given dataset should first be recognized. Whether the competition is favoring a certain method or procedure has to be determined in the “real-world” data with “real-world” missingness by considering recognized and unrecognized missing pattern/mechanism, as well as the plausible distribution of missing data. Our study provides benchmarking and practice recommendations based on common algorithms for imputing laboratory variables if these variables follow similar missingness patterns.
Methods
This study was approved by both Geisinger and Sutter Health Institutional Review Board and a waiver of consent was granted because of using de-identified EHR data. Ordered and resulted laboratory tests completed within the index date ± 2 years for Sutter Health Heart Failure (HF) or index date ± 3 years for “Geisinger NeuroScience Ischemic Stroke (GNSIS)” were used for imputation, where the index date was defined as the first time the disease of interest (i.e., ischemic stroke or heart failure) meet the diagnosis criteria33,34,35,36. Only quantitative laboratory values were considered for imputation. Similar to a moving time window and stepwise regression procedure37,38, the last valid observation before and the first observation after the index date were extracted from the corresponding time blocks. Imputation of missing values in each laboratory variable was based on the information of observed values from this and other laboratory variables. We first assessed the missing pattern between variables, time blocks, or cohorts. We studied two missing patterns by randomly holding-out 50 laboratory values (HV) and 50 complete patient records (HC). To mimic Missing-completely-at-random (MCAR) or Missing-at-random (MAR) we used the HV and to mimic monotone missing we used the HC simulation. We selected commonly used error metrics, to assess the performance of the algorithms. In a case study, we imputed hemoglobin A1c with and without comorbidity-derived latent information to evaluate the utility of auxiliary variables in a univariate imputation framework.
Data sources
Two distinct datasets were used: the GNSIS cohort33,34,35 and the Sutter Health heart failure cohort (HF)36. All investigators in this study had no control of missingness in EHR data collection.
The GNSIS database is composed of EHR data for patients with well-defined ischemic stroke from September 2003 to May 201933,34, The ICD-9-CM/ICD-10-CM diagnostic criteria for phenotypes were previously published34,35. The comorbidity information based on ICD-9-CM or ICD-10-CM diagnosis was extracted within index data ±3 years. Comorbidity was defined as a qualified diagnosis associated with either two outpatient visits or one inpatient visit. The entire laboratory data, based on Logical Observation Identifiers Names and Codes (LOINC), for the cohort, were extracted and included in this study.
The Sutter Health HF database includes incidence heart failure cases identified from Sutter Health primary care population36. Longitudinal EHR data were extracted on incidence cases diagnosed between January 1, 2010, to December 31, 2017. Encounter-based laboratory results with the corresponding LOINC identifiers within a 2-year window before or after the index date were extracted. For the diagnosis domain, all ICD10 codes had been converted to ICD9 codes first. ICD-9 codes from outpatient office visits or phone visits were grouped using Clinical Classifications Software (CCS) [https://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp]. The CCS level 3 was adopted to group 5379 ICD-9 codes into 363 unique CCS groups.
To minimize data sparsity, ICD and CCS codes were only used if they were observed in at least 20% of the patients.
Recognition of missing pattern and mechanism
Missing values were defined as either not tested or tested but with values outside of the three interquartile range (IQR). Analysis of missingness was limited to laboratory variables (see Supplementary Table 1) where the proportion of missingness was <75%39.
We created a fluxplot30 to capture the relationship between variables. In particular, the fluxplot can facilitate the identification of the relationship of missing and observed data between variables using influx and outflux coefficients and the tradeoff between them. The influx coefficient (Ij) of a variable quantifies how well the missing data is connected to the observed data on other variables (see Eq. (1)); the outflux coefficient (Oj) of a variable quantifies how well the observed data is connected to the missing data on other variables17 (see Eq. (2)). In general, variables that are located closer to the sub-diagonal tend to be better connected than those farther away.
The influx coefficient Ij is defined as30
The coefficient is equal to the number of variable pairs (Yj,Yk) with Yj missing and Yk observed, divided by the total number of observed data cells. R is an n by p matrix filled with 0 or 1 as a response indicator. Y and R are denoted by yij and rij, respectively, where subject index i = 1, 2, …, n and variable index j = 1, 2, …, p. If yij is observed, then ri j = 1, and if yij is missing, then rij = 0. So did rik.
The outflux coefficient Oj is defined in an analogous way as30
The quantity Oj is the number of variable pairs with Yj observed and Yk missing, divided by the total number of incomplete data cells.
We explored the pattern of missingness by the Rubin40 classification—Missing-completely-at-random(MCAR), Missing-at-random(MAR), and Missing-not-at-random (MNAR). We used the margin plot (Supplementary Fig. 2) to capture the missingness pattern between “before the index date” and “after the index date” or between two laboratory variables.
Simulation of missingness
For holdout values (HV), the randomly selected 50 holdout values per variable came solely from observed data, thus the data were MAR. The probability of being missing was the same for all cases when the selection of 50 holdouts was made by a random pick from a variable without missing value. This variable was said to be MCAR. Thus, HV represented MAR or MCAR, defined by Rubin40 and others30.
For holdout cases (HC), we held out entire laboratory values for 50 cases, which were randomly selected from all complete cases. Under HC, we maintained the sequence of the missing level across all variables and kept the original connection between missingness in one variable and observation in the other variable throughout the dataset except for holdout cases. The simulation of missingness created using this procedure reflected the theory of monotone missingness41, namely ordering one laboratory test was dependent upon other tests, or missing in other variables resulted in missing in one variable.
Imputation strategy
Monotone Multiple Imputation: Multiple imputation (MI) is featured by a missing measure to be imputed multiple times. We utilized the latest implementation of the monotone MI algorithm in MICE30,41 to impute each missing value, where a missing pattern is said to be “monotone” if the variables Yj (j = 1, 2, …, k) can be ordered such that if Yj is missing then all variables Y−j with k > j are also missing.
The procedure for multivariate monotone imputation30
-
1.
Create a short format of GNSIS or HF dataset and choose a single level (PMM) or multilevel (2L.PAN) imputation models;
-
2.
Sort from low to high for p incomplete variables (j = 1, 2, …, p) according to the frequency of missingness, Y denotes the n by p matrix containing the data values on p variables for all n units in the sample; \(Y_j^{{\mathrm{obs}}}\) represents a vector of observed value for the j variable;
-
3.
Draw temporary parameter ϕ1 (\(\dot \phi _1\)) from a univariate conditional density function, P(\(Y_1^{{\mathrm{obs}}}\)| X), where X represents the completely observed covariates such as TIME, SEX;
-
4.
Impute temporary Y1 (\(\dot Y_1\)) based on P(\(Y_1^{{\mathrm{mis}}}\)|X,\(\dot \phi _1\));
-
5.
Draw \(\dot \phi _2\)∼P(\(Y_2^{{\mathrm{obs}}}\)| X,\(\dot Y_1\));
-
6.
Impute \(\dot Y_2\sim P\)(\(Y_1^{{\mathrm{mis}}}\)|X,\(\dot Y_1\),\(\dot \phi _2\));
-
7.
⋮⋮;
-
8.
Draw \(\dot \phi _p\)∼P(\(Y_p^{{\mathrm{obs}}}\)|X,\(\dot Y_1\) , …, \(\dot Y_{p - 1}\));
-
9.
Impute \(\dot Y_p\sim P\)(\(Y_p^{{\mathrm{mis}}}\)|X,\(\dot Y_1\), …,\(\dot Y_{p - 1}\),\(\dot \phi _p\));
-
10.
Repeat steps 3–9 for m – 1 times to obtain m complete sets.
-
11.
(optional) apply to the analysis model (LMM) and calculate estimates (exponentiate) and variance.
-
12.
(optional) Combine the results by Rubin’s rule to obtain mean estimates (exponentiate), variance (including within-imputation variance and between-imputation variance)
Note: this algorithm not only incorporates the uncertainty due to deviations around the regression line (step 3) but also reflects the variation of the regression line itself due to finite sampling (step 8).
Fully conditional specification
Fully conditional specification (FCS), also known as chained equations and sequential regressions, is an iterative Markov Chain method that can be used when the pattern of missing data is arbitrary or a mixture of monotone and arbitrary. FCS draws missing values iteratively from a specified set of conditional probabilistic distributions, \(P(Y_j|X,Y_{ - j},R,\phi _j)\)30, compared to monotone imputation with a fixed sequence of MI. When applying this iterative procedure to update the parameters (intercept, slope, and error) for a given number of iterations (for instance, n = 500), one imputed complete set is generated. When the entire process of imputation has been repeated m – 1 times, m imputed complete sets are reached. Therefore, MI can help “fill in” the missing data with plausible values by adding variability to the analyses— facilitating parameter estimation for each incomplete variable.
The procedure for multivariate FCS imputation30
-
1.
Create a short format of GNSIS or HF dataset and choose a single level (PMM) or multilevel (2L.PAN) imputation model;
-
2.
Specify an imputation model P(\(Y_j^{{\mathrm{mis}}}\)| \(Y_j^{{\mathrm{obs}}},Y_{ - j}\), R, ϕj) for Yj with variable index j = 1, 2, …, p without sorting the sequence of variables by frequency of missingness, Y−j represents other variables but not j variable; R is a n by p matrix filled with 0 or 1 as response indicator. The elements of Y and R are denoted by yij and rij, respectively, where subject index i = 1, 2, …, n and variable index j = 1, 2, …, p. If yij is observed, then rij = 1, and if yij is missing, then rij = 0. ϕj is unknown regression model parameters for j variable (see Schafer et al. for PAN model parameter)42; t represents t number of MCMC iterations.
-
3.
For each j, fill in starting imputations \(Y_j^{ \cdot 0}\) by random draws from \(Y_j^{{\mathrm{obs}}}\) and fill in starting value for ϕ0 by Gibbs sampler in MCMC procedure.
-
4.
For t ← 1 to N. N is 500 burn-in iterations
-
5.
Repeat
-
6.
For j ← 1 to p. p is 45 or 38 for GNSIS or HF respectively.
-
7.
Define \(\dot Y_{ - j}^t\) = (\(\dot Y_1^t\), …, \(\dot Y_{j - 1}^t\), \(\dot Y_{j + 1}^{t - 1}\), …, \(\dot Y_p^{t - 1}\)) as the currently complete data expect Yj;
-
8.
Draw \(\dot \phi _j^t\) ~ P(\(\phi _j^t\)|\(Y_j^{{\mathrm{obs}}}\), \(\dot Y_{ - j}^t\), R); This is a step to get a new regression model parameter.
-
9.
Draw imputations \(\dot Y_j^t\) ~ P(\(Y_j^{{\mathrm{miss}}}\)|\(Y_j^{{\mathrm{obs}}}\), \(\dot Y_{ - j}^t\), R,\(\dot \phi _j^t\)). This is imputation step
-
10.
End repeat j
-
11.
End repeat t
-
12.
Repeat steps 3–9 for m – 1 times to obtain m complete sets.
-
13.
(optional) apply to analysis model (linear mixed-effects regression model) and calculate estimates (exponentiate) and variance.
-
14.
(optional) Combine the results by Rubin’s rule to obtain mean estimates (exponentiate), variance (including within-imputation variance and between-imputation variance)
We chose predictive mean matching (PMM) as the benchmark method for the cross-sectional imputation of continuous variables because it is a hot-deck method, where values are imputed using existing values from the complete cases matched with respect to some metric15. In this study, we used Type 1 matching with a Bayesian β and a stochastic matching distance30. For each missing value, PMM finds a set of observed values (e.g., five donors) from all complete cases that have predicted values closest to the predicted value for the missing entry and considers the donor with the closest predicted mean as the imputed value for that missing entry. Therefore, imputed values from PMM are restricted to the observed values. For this PMM-FCS approach, we also evaluated the mean and standard deviation for each laboratory value after each round of iteration (n = 10 for GNSIS; n = 15 for HF due to a higher level of missingness) to ensure statistical convergence. However, the Monte Carlo iterative procedure does not apply to monotone imputation. For FCS, the default iteration of 500 was selected. We utilized the latest implementation of the PMM-FCS and PMM-MONOTONE algorithms in MICE30,41.
Multi-level multivariate missing imputation
EHR data can be regarded as multi-level time-series data. We considered the repeated measure at the individual level as level one data (see below “Level one model” in Eq. (3)). The covariates such as TIME (i.e., before or after the index date, which was dummy coded) can be treated either as level one (Level one model) or both level one and level two (i.e., a random intercept in Eq. (4) and a random slope in Eq. (5)).
We used the MICE 2l.pan42 or 2l.norm43 for the imputation based on an assumption of homogenous or heterogeneous within-group (i.e., patient ID) variances in level one data, respectively43. We defined the cluster variable (C) as “ID”. We compared the two-level model to the cross-sectional PMM model to determine if there is any significant improvement in the prediction of missingness with this mixed model.
Level one model:
Level two model with a random intercept:
Level two model with a random intercept and a random slope for TIME (optional):
where, εic is a value drawn from a Normal random vector with mean = 0, variance = \(\delta _\varepsilon ^2\) for the imputed variable j in each cluster (C); β0c represents the constant value of intercept with additional random error for a random intercept model. β0c = α00 + u0c represents a constant intercept modified by a random error, u0c, which is a value drawn from a Normal random vector with mean = 0, variance = \(\delta _{u_0}^2\) for a random intercept in each cluster (C); and β-jcLAB-jc represents a group of additive terms derived from variables but not the j variable in each cluster (C) in a stochastic linear regression model. Some variables, e.g., TIME, can have both fixed (β1c) as well as a random effect (α01) in the multi-level imputation model. Thus, β0c represents a fixed intercept with random error plus a random slope for TIME.
Multi-level univariate missing imputation
The multi-level univariate imputation was considered as an alternative approach only when one continuous variable was assumed to have missing values (univariate missing data). The comorbidity information (in the form of CCS for HF cohort or ICD for the GNSIS cohort) was used in the principal component analysis (PCA). Based on the scree plot, the major five PCs, which explained more than 60% of the variance, were selected as auxiliary variables for the univariate imputation.
We applied multi-level univariate imputation to each missing lab value at a time, along with the PCs extracted from the comorbidity matrix. Similar to the above multivariate imputation, this method can have a level one model (Eq. (6)) and level two model (a random intercept in Eq. (7) and a random slope in Eq. (8)).
Level one model for incomplete quantitative variables:
Level two model with a random intercept:
Level two model with a random intercept and a random slope for TIME (optional):
Where, all βs are estimates based on complete cases; εjc is determined by the variance of the residual ε, which can be a random draw from the set of sample residuals for the complete cases with mean = 0, variance = \(\delta _\varepsilon ^2\) for the imputed variable j in each cluster (C); β0c represents the constant value of intercept with additional random error for a random intercept model. α00 + u0c represents a constant intercept modified by a random error, u0c, which is a value drawn from a Normal random vector with mean = 0, variance = \(\delta _{u_0}^2\) for a random intercept in each cluster (C); Some variables, e.g., TIME can have both fixed (β1c) as well as random effect (α01) in the multi-level imputation model. Thus, β0c represents fixed intercept with random error plus TIME with a random slope.
A case of hemoglobin A1c
HbA1C has been included as one of the major predictive variables in many diagnostic and prognostic models for cardiometabolic diseases and related complications. High-level of missingness in HbA1C in EHR limits its application in the prediction model due to the sample size. The inclusion of imputed HbA1C in the prediction model for the post-ischemic stroke mortality has shown to be important in our previous study using the GNSIS dataset44. Missing hemoglobin A1c (HbA1c) was imputed by the Multi-level multivariate imputation approach as well as the multi-level univariate imputation approach where the comorbidities were taken as latent variables. HbA1c has been connected to other metabolic diseases (comorbidities)45 and could be an ideal laboratory variable for univariate imputation using PCs from the comorbidity matrix as latent variables.
Model evaluation
In both HV and HC experiments, we heldout 50 observed values for each laboratory variable before the index date and calculate the errors between observed and predicted values. We repeated each process up to 50 times and calculated the mean, standard error (SE), and 95% confidence interval of predicted values for each holdout and calculated the coverage rate (CR) and average width (AW). We used normalized RMSE (nRMSE) to ensure this error metric was on the same scale across different laboratory variables. The stability of the mean and SE of nRMSE, which reflected the propagation of uncertainty in those imputed holdouts after a sequential number of MI, were also assessed. Levene’s test was utilized to determine an equal variance of the nRMSE from two compared imputation algorithms, e.g., 2l.pan-FCS and PMM-FCS. Shapiro–Wilk test was applied for the normality test of the difference of nRMSE for each comparison. The nRMSEs derived from the different algorithms were compared using an unpaired t-test with Bonferroni correction for multiple tests. The algorithm that resulted in the smallest RMSE was the optimal approach for that laboratory variable.
The evaluation metrics include the following measures:
-
1.
Root mean square error (RMSE)—RMSE penalizes the larger errors and is sensitive to extreme values. We normalized RMSE by standard deviation, δ (See Eq. (9)).
$$n{\mathrm{RMSE}} = \sqrt {\frac{{\mathop {\sum }\nolimits_{i = 1}^n \left( {\hat Y_i - Y_i} \right)^2}}{n}} \delta ^{ - 1}$$(9)Note: Yi represents holdout values and \(\hat Y_i\) is the corresponding imputed value
-
2.
Coverage rate (CR)—CR represents the proportion of confidence intervals that contain the imputed value. We calculated the mean of CR for each subject after 50 repeated imputations.
-
3.
Average width (AW)—AW represents the average width of the confidence intervals and is an indicator of statistical efficiency. We calculated the mean AW for each subject after 50 repeated imputations.
Over-imputation scatter plots for each laboratory variable are generated as graphical diagnostic tools46 to assess the suitability of different imputation algorithms.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The data analyzed in this study is not publicly available due to privacy and security concerns. The data may be shared with a third party upon execution of data sharing agreement for reasonable requests, such requests should be addressed to V. Abedi or S.Y.
Code availability
Codes and additional meta-data, summary plots, and information can be found at
https://github.com/TheDecodeLab/Imputation-LaboratoryValues-EHR_v2.0.
References
Abedi, V. et al. Novel screening tool for stroke using artificial neural network. Stroke 48, 1678–1681 (2017).
Abedi, V. et al. Using artificial intelligence for improving stroke diagnosis in emergency departments: a practical framework. Ther. Adv. Neurol. Disord. 13, 1756286420938962 (2020).
Chen, D. et al. Deep learning and alternative learning strategies for retrospective real-world clinical data. NPJ Digit. Med. 2, 43 (2019).
Noorbakhsh-Sabet, N., Zand, R., Zhang, Y. & Abedi, V. Artificial intelligence transforms the future of health care. Am. J. Med. 132, 795–801 (2019).
Razavian, N. et al. A validated, real-time prediction model for favorable outcomes in hospitalized COVID-19 patients. NPJ Digit. Med. 3, 130 (2020).
Konerman, M. A. et al. Machine learning models to predict disease progression among veterans with hepatitis C virus. PLoS ONE 14, e0208141 (2019).
Abedi, V. et al. Prediction of long-term stroke recurrence using machine learning models. J. Clin. Med. 10, https://doi.org/10.3390/jcm10061286 (2021).
Misra, D. et al. Early detection of septic shock onset using interpretable machine learners. J. Clin. Med. 10, https://doi.org/10.3390/jcm10020301 (2021).
Ayilara, O. F. et al. Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry. Health Qual. Life Outcomes 17, 106 (2019).
van Ginkel, J. R., Linting, M., Rippe, R. C. A. & van der Voort, A. Rebutting existing misconceptions about multiple imputation as a method for handling missing data. J. Pers. Assess. 102, 297–308 (2020).
Ford, B. in Incomplete Data in Sample Surveys, Theory and Bibliographies Vol. 2 (Part IV) (eds. W. Madow, H. Nisselson, & I. Olkin) 185–207 (Academic Press, 1983).
Doove, L., Van Buuren, S. & Dusseldorp, E. Recursive partitioning for missing data imputation in the presence of interaction effects. Comput Stat. Data Anal. 72, 12 (2014).
Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39, 38 (1977).
Arbuckle, J. L. in Advanced structural equation modeling: Issues and Techniques (eds. G. A. Marcoulides & R. E. Schumacker) (Lawrence Erlbaum Associates, 1996).
Rubin, D. B. Multiple Imputation for Nonresponse in Surveys. (Wiley, 1987).
Yoshikawa, A., Li, J. & Meltzer, H. Y. A functional HTR1A polymorphism, rs6295, predicts short-term response to lurasidone: confirmation with meta-analysis of other antipsychotic drugs. Pharmacogenomics J. 20, 260–270 (2020).
van Buuren, S., Boshuizen, H. C. & Knook, D. L. Multiple imputation of missing blood pressure covariates in survival analysis. Stat. Med. 18, 681–694 (1999).
van Buuren, S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med. Res. 16, 219–242 (2007).
Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J. & Solenberger, P. A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 27, 11 (2001).
Schafer, J. L. Analysis of Incomplete Multivariate Data. (Chapman & Hall, 1997).
Frank Liu, G. & Zhan, X. Comparisons of methods for analysis of repeated binary responses with missing data. J. Biopharm. Stat. 21, 371–392 (2011).
Buuren, S. V. & Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. J. Stat. Software 45, https://doi.org/10.18637/jss.v045.i03 (2011).
Luo, Y., Szolovits, P., Dighe, A. S. & Baron, J. M. Using machine learning to predict laboratory test results. Am. J. Clin. Pathol. 145, 778–788 (2016).
Waljee, A. K. et al. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3, https://doi.org/10.1136/bmjopen-2013-002847 (2013).
Hu, Z. et al. Strategies for handling missing clinical data for automated surgical site infection detection from the electronic health record. J. Biomed. Inf. 68, 112–120 (2017).
Luo, Y., Szolovits, P., Dighe, A. S. & Baron, J. M. 3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data. J. Am. Med. Inf. Assoc. 25, 645–653 (2018).
Cook, N. R. Imputation strategies for blood pressure data nonignorably missing due to medication use. Clin. Trials 3, 411–420 (2006).
Yucel, R. M. Multiple imputation inference for multivariate multilevel continuous data with ignorable non-response. Philos. Trans. A Math. Phys. Eng. Sci. 366, 2389–2403 (2008).
Huque, M. H. et al. Multiple imputation methods for handling incomplete longitudinal and clustered data where the target analysis is a linear mixed effects model. Biom. J. 62, 444–466 (2020).
van Buuren, S. Flexible Imputation of Missing Data. 2nd edn, (Chapman & Hall/CRC, 2018).
Yuan, K.-H. & Savalei, V. Consistency, bias and efficiency of the normal-distribution-based MLE: The role of auxiliary variables. J. Multivar. Anal. 124, 353–370 (2014).
Lee, K. J. & Carlin, J. B. Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. Am. J. Epidemiol. 171, 624–632 (2010).
Chaudhary, D. et al. Obesity and mortality after the first ischemic stroke: Is obesity paradox real? PLoS ONE 16, e0246877 (2021).
Chaudhary, D. et al. Trends in ischemic stroke outcomes in a rural population in the United States. J. Neurol. Sci. 422, 117339 (2021).
Li, J. et al. Polygenic risk scores augment stroke subtyping. Neurol. Genet. 7, https://doi.org/10.1212/NXG.0000000000000560 (2021).
Chen, R., Stewart, W. F., Sun, J., Ng, K. & Yan, X. Recurrent neural networks for early detection of heart failure from longitudinal electronic health record data: implications for temporal modeling with respect to time before diagnosis, data density, data quantity, and data type. Circ. Cardiovasc. Qual. Outcomes 12, e005114 (2019).
Welch, C. A. et al. Evaluation of two-fold fully conditional specification multiple imputation for longitudinal electronic health record data. Stat. Med. 33, 3725–3737 (2014).
Nevalainen, J., Kenward, M. G. & Virtanen, S. M. Missing values in longitudinal dietary data: a multiple imputation approach based on a fully conditional specification. Stat. Med. 28, 3657–3669 (2009).
Abedi, V. et al. Increasing the density of laboratory measures for machine learning applications. J. Clin. Med. 10, https://doi.org/10.3390/jcm10010103 (2020).
Rubin, D. B. Inference with missing data. Biometrika 63, 11 (1976).
Van Buuren, S. & Groothuis-Oudshoorn, K. mice: multivariate imputation by chained equations in R. J. Stat. Softw. 45, 67 (2011).
Schafer, J. L. & Yucel, R. M. Computational strategies for multivariate linear mixed-effects models with missing values. J. Computational Graph. Stat. 11, 21 (2002).
Kasim, R. M. & Raudenbush, S. W. Application of Gibbs sampling to nested variance components models with heterogeneous within-group variance. J. Educ. Behav. Stat. 23, https://doi.org/10.2307/1165316 (1998).
Abedi, V. et al. Predicting short and long-term mortality after acute ischemic stroke using EHR. J. Neurol. Sci. 427, https://doi.org/10.1016/j.jns.2021.117560 (2021).
Grundy, S. M. et al. Diagnosis and management of the metabolic syndrome: an American Heart Association/National Heart, Lung, and Blood Institute Scientific Statement. Circulation 112, 2735–2752 (2005).
Bondarenko, I. & Raghunathan, T. Graphical and numerical diagnostic tools to assess suitability of multiple imputations and imputation models. Stat. Med. 35, 3007–3020 (2016).
Acknowledgements
This study was partially funded by the National Institute of Health (NIH) grant No. R56HL116832 and by support from Geisinger Health System. V. Abedi and RZ had financial research support from Bucknell University Initiative Program, Roche–Genentech Biotechnology Company, and the Geisinger Health Plan Quality fund during the study period. The author would like to extend thanks to Dr. Donna M. Wolk, Division Chief of the Diagnostic Medicine Institute at Geisinger Health System for insightful discussion around the use of laboratory data variables.
Author information
Authors and Affiliations
Contributions
Conception and design of the study: J.L. and V. Abedi. Acquisition and analysis of data: J.L., D.C., S.M., and A.A. Developing the code: J.L. Interpretation of the findings: J.L, V. Abedi, X.S.Y., D.C., V. Avula, S.S., M.Y., H.H., S.F.W., A.A., R.Z., Drafting a significant portion of the manuscript or figures: J.L., V. Abedi, and X.S.Y. Mr. Ardavan passed away (10/2020) during the last phases of the study and was only able to review the earlier version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, J., Yan, X.S., Chaudhary, D. et al. Imputation of missing values for electronic health record laboratory data. npj Digit. Med. 4, 147 (2021). https://doi.org/10.1038/s41746-021-00518-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-021-00518-0
This article is cited by
-
Incorporating informatively collected laboratory data from EHR in clinical prediction models
BMC Medical Informatics and Decision Making (2024)
-
Studying missingness in spinal cord injury data: challenges and impact of data imputation
BMC Medical Research Methodology (2024)
-
Biases in Electronic Health Records Data for Generating Real-World Evidence: An Overview
Journal of Healthcare Informatics Research (2024)
-
Methodological issues of the electronic health records’ use in the context of epidemiological investigations, in light of missing data: a review of the recent literature
BMC Medical Research Methodology (2023)
-
Multimodal data fusion for cancer biomarker discovery with deep learning
Nature Machine Intelligence (2023)