Prediction of chronic kidney disease progression using recurrent neural network and electronic health records

Chronic kidney disease (CKD) is a progressive loss in kidney function. Early detection of patients who will progress to late-stage CKD is of paramount importance for patient care. To address this, we develop a pipeline to process longitudinal electronic heath records (EHRs) and construct recurrent neural network (RNN) models to predict CKD progression from stages II/III to stages IV/V. The RNN model generates predictions based on time-series records of patients, including repeated lab tests and other clinical variables. Our investigation reveals that using a single variable, the recorded estimated glomerular filtration rate (eGFR) over time, the RNN model achieves an average area under the receiver operating characteristic curve (AUROC) of 0.957 for predicting future CKD progression. When additional clinical variables, such as demographics, vital information, lab test results, and health behaviors, are incorporated, the average AUROC increases to 0.967. In both scenarios, the standard deviation of the AUROC across cross-validation trials is less than 0.01, indicating a stable and high prediction accuracy. Our analysis results demonstrate the proposed RNN model outperforms existing standard approaches, including static and dynamic Cox proportional hazards models, random forest, and LightGBM. The utilization of the RNN model and the time-series data of previous eGFR measurements underscores its potential as a straightforward and effective tool for assessing the clinical risk of CKD patients concerning their disease progression.

of the study with minimal risk to participants.Longitudinal patient EHRs from 01/01/2006 to 07/01/2019 are retrieved from the University of Chicago Medical Center.We include all patients who have experienced at least one stable stage II/III period, defined as having two eGFR values ≥ 30 and ≤ 89 that are at least 90 days apart and that any eGFR value between them must also fall within this range.A total of 82,667 patients meets the criterion.For each patient, the extracted EHRs include four types of clinical variables including demographics, vital information, lab tests, and health behaviors.We choose clinical variables that are likely to be available and can potentially be linked to CKD progression.Table 1 lists all the variables in the retrieved data.These variables are associated with major CKD risk factors and co-morbidities that may convey additional risks.Transformations have been performed for some variables, which are detailed in Sect. 1 of the Supplementary Information.Predictive models are constructed using two sets of variables: all variables (see Table 1) and essential variables (indicated by italic font in Table 1).All demographics and vital information variables are included in the essential set.Eight lab test variables with the highest frequency are also included in the essential set.In the case of health behavior variables, only the primary indicators of drug, alcohol, and smoking usage are part of the essential set, whereas variables specifying the types of drugs and smoking are excluded.These essential variables are mainly associated with CKD risk factors.
In our study, all patients were at stages II/III at the beginning of their longitudinal EHRs.Some patients later progressed to stage IV/V and are considered cases.In contrast, the controls are those who never progressed to late stages by the end of the study time (07/01/2019).To qualify as a case, a patient must have experienced at least one stable stage IV/V period, as indicated by having two eGFR values ≤ 29, separated by at least 90 days, and with any eGFR value in between also being ≤ 29.For case patients, the start time of the first stable stage IV/V period is taken as the transition point to the late stage, denoted as T tra .Case patients are also required to have a minimum of four eGFR values, with the first two values not belonging to a stable stage IV/V period.Control patients, on the other hand, do not have any stable stage IV/V period and are also required to have at least four eGFR values.Based on these selection criteria, 1,968 case patients and 70,877 control patients have been identified.
The longitudinal EHRs of each patient are transformed into time series, structured as feature vectors at sequential time points.Figure 1 provides an illustration of a patient's time series.The time points are denoted as T 1 , . . ., T N , each representing a time interval of t w days (e.g., t w = 7 ).T n is the midpoint of the nth time interval, where n ∈ {1, . . ., N} .At each time point, a feature vector is constructed, comprising the values of all variables.T n is also included in the vector as an additional feature.If multiple readings of a lab test or vital information Table 1.EHR variables collected for CKD progression prediction.The italic variables form the essential set.variable are recorded within a time interval, their average is used to create the feature vector.For health behavior variables, the latest reading in a time interval is used in the feature vector.Please refer to Sect. 2 of the Supplementary Information for more details on constructing time series and imputing missing values.

Vital information
In our analysis, we focus on the prediction of CKD progression from early to late stages within a specific time period denoted as t pre (e.g., 90 or 365 days).For a case patient, only feature vectors representing patient information before the time T tra − t gap are included in the analysis, where t gap is the length of a gap period (e.g., 7 days).See Fig. 2 for an illustration of t gap , which is inserted between the prediction window (indicated by the blue block in Fig. 2b) and the time points used for predictive modeling (indicated by green blocks in Fig. 2b).The gap period is introduced to prevent potential cases where CKD progression is determined based on lab results measured at the last time point (T M in Fig. 2b).In addition, we require that valid case patients must have the transition point falling within the prediction window and have at least five time points in their time series.If a patient's record contains more than 100 time points, only the latest 100 time points are used for analysis.For control patient time series, we remove the latest feature vectors spanning a period of t pre to ensure that the remaining feature vectors correspond to a disease process that will not progress during t pre .To create a dataset for predictive analysis, we match each case patient with four control patients based on patient race, sex, age, and time series length whenever sufficient qualified control patients are available.Please refer to Sect. 3 in the Supplementary Information for detailed information about matching case and control patients.Supplementary Table 1 summarizes the matched data generated using all variables, with t w = 7 days, t gap = 7 days, and t pre = 365 days.Supplementary Fig. 1a shows the distribution of the time series length of case patients, with a peak at 100 due to the truncation of long patient time series to 100 time points.The control patient sequence length follows the same distribution (not shown) because each case patient is matched with four control patients of the same sequence length.Supplementary Fig. 1b shows the distribution of the time duration between the transition point and the last time point of feature vectors used in the analysis for case patients.The distribution shows a peak at the lower end but also has a long thick tail extending beyond 100 days.
Stage III CKD can be further divided into two sub-stages: stage IIIa (45 ≤ eGFR ≤ 59) and stage IIIb (30 ≤ eGFR ≤ 44).A more challenging task is predicting the progression of patients from Stage II or Stage IIIa to Stages IV and V.This prediction would enable the earlier detection of high-risk patients before they reach stage IIIb.To apply our modeling approach to this task, we have generated data while excluding patient information in stage IIIb.For more details about the data generation process, please refer to Sect. 4 of the Supplementary Information.
As a result of the data preprocessing, we generate multiple datasets that include matched case and control patients.These datasets may or may not include patient stage IIIb information and may contain all variables or only essential variables.We also create datasets with multiple prediction window sizes.All these datasets are generated using t w = t gap = 7 days.Supplementary Table 2 provides information on the number of case and control patients in the matched data generated under different settings.It is worth noting that excluding stage IIIb data significantly reduces the number of valid case and control patients.To prepare the data for predictive modeling, we assign a binary classification label to each patient.Patients whose CKD condition progresses to stages IV/V within the prediction window (i.e., case patients) receive a label of 1, while patients whose CKD condition does not progress to stages IV/V within the prediction window (i.e., control patients) receive a label of 0. for training the prediction model, one for validation to select the dropout rate and implement early stopping to avoid overfitting, and one for testing.The testing data fold is used to assess prediction performance using the area under the receiver operating characteristic curve (AUROC), the area under the precision-recall curve (AUPRC), and the Matthew's correlation coefficient (MCC) as performance metrics.AUROC evaluates the accuracy of ranking case patients above control patients based on prediction results.AUPRC has been demonstrated to be more informative when evaluating prediction performance on imbalanced data 34 .MCC provides a balanced assessment of type I and type II errors following classification through thresholding the probabilistic prediction outcomes using a cutoff of 0.5.A total of 100 random cross-validation trials are performed.See Sect. 5 in the Supplementary Information for a detailed explanation of LSTM RNN model implementation and training.

Construction and evaluation of LSTM RNNs and baseline prediction models
For comparison, we also implement the static and dynamic Cox proportional hazards models, which are widely used as a standard in practice 5,18,19 .In both models, the time-to-CKD progression is used as the response variable, and a Cox model is applied to regress the time-to-CKD progression on a set of covariates.Regression coefficients are estimated via the maximum likelihood estimation.These coefficients are used to compute the estimated hazard function for each patient by plugging in the patient's covariate values.The estimated hazard function is then used to compute the survival function at any future time T for the patient, which is the prob- ability of CKD progression at time T .By applying a probability threshold, we can predict the binary status of CKD progression at time T for the patient.In the static Cox model, the covariate values at the first time point are used to fit the model, following the approach used in the original publication 5 .Conversely, the dynamic Cox model incorporates both time-dependent and time-independent covariates 19 .Demographics variables are taken as time-independent covariates, while all other variables are taken as time-dependent covariates.For timedependent covariates, their longitudinal time-course measurements are used to fit the dynamic Cox model, a crucial distinction from the static Cox model.Estimation of the dynamic model is conducted through maximum likelihood, and the input data must be organized in a counting-process style 35 .For predicting CKD progression under the dynamic model, the process remains similar to the static Cox model, except that the dynamic model incorporates values of time-dependent covariates at sequential time points.Both Cox models are trained and evaluated using the same cross-validation data partitions as those used for the analysis of LSTM RNNs.However, the training and validation sets are combined for fitting the Cox models since there is no requirement for hyperparameter tuning or early stopping of model training.The prediction performance of Cox models is evaluated on the testing set using the C-statistic (equivalent to AUROC), AUPRC, and MCC.A 0.5 cutoff is applied on the predicted probability of CKD progression to calculate the MCC.Our implementation of the Cox proportional hazards models follows the methodologies detailed in the original publications 5,18,19 , which provide more information about the methodology.
We also include two conventional machine learning methods, namely random forest 36 and LightGBM 37 , as baseline models for comparison.Random forest constructs multiple decision trees on random subsets of data and employs the average of their predictions as the final outcome 36 .LightGBM is a fast gradient boosting decision tree algorithm that uses techniques of gradient-based one-side sampling and exclusive feature bundling to expedite model training 37 .Both of these models are trained using patient data at the latest time point to predict the progression risk.We use the same data partitions of training, validation, and testing sets as those used in the analysis of LSTM RNNs to perform cross-validation evaluations for random forest and LightGBM.For random forest, the training set and the validation set are combined for model training, as no hyperparameter tuning or early stopping of model training is necessary.AUROC, AUPRC, and MCC measurements are calculated to evaluate the prediction performance.A cutoff of 0.5 is applied on the probabilistic prediction outcome to compute MCC.See Sect. 5 in the Supplementary Information for more details of implementing and training random forest and LightGBM models.

Evaluation and comparison of prediction performance
We conduct an evaluation and comparison of prediction performance between LSTM RNN and baseline methods across two distinct prediction windows: 365 days and 90 days.Table 2 shows the mean and standard deviation of prediction performance derived from cross-validation trials.In all different settings of prediction window and variable usage, LSTM RNNs consistently achieve an average AUROC exceeding 0.95, accompanied by a standard deviation of less than 0.01, which indicates a high and robust prediction performance.Moreover, the obtained average AUPRC always exceeds 0.83, while the average MCC always exceeds 0.72, further affirming the accuracy of classification outcomes.Importantly, in all analysis schemes, our proposed LSTM RNN models consistently outperform all baseline methods evaluated using all three performance metrics.The only exception is that the dynamic Cox proportional hazards model achieves the same average prediction performance as the LSTM RNN model does for predicting disease progression within 90 days when evaluated using AUPRC.Comparing the use of all variables against only essential variables, employing all variables yields a better prediction performance for LSTM RNN and the two Cox proportional hazards models, but not for random forest and LightGBM.Interestingly, the performance enhancement achieved by LSTM RNNs over the Cox models is more pronounced in the context of the 365-day prediction window compared to the 90-day prediction window.Between the two Cox proportional hazards models, the dynamic model exhibits superior prediction performance relative to the static model.This observation aligns with prior findings suggesting that longitudinal EHRs furnish more predictive information than cross-sectional data for CKD progression prediction 19 .Furthermore, the prediction performances of random forest and LightGBM surpass those of both static and dynamic Cox proportional hazards models when evaluated using AUROC and MCC.
For the more challenging task of predicting CKD progression with the exclusion of stage IIIb data, we explore prediction windows spanning 365 days (1 year), 1095 days (3 years), and 1825 days (5 years).Table 3 shows the mean and standard deviation of prediction performance obtained through cross-validation.Our proposed LSTM RNN consistently achieves an average AUROC of around 0.9, an average MCC surpassing 0.53, and an average AUPRC over 0.68, across different prediction window lengths and variable utilization scenarios.Its performance remains comparable whether employing all variables or essential ones exclusively.Importantly, LSTM RNNs consistently outperform all baseline methods across different configurations of prediction windows and variable usage, as evidenced by its higher average AUROC, MCC, and AUPRC measurements.Compared with the previous prediction task that incorporates stage IIIb data, the performance of all models drops, likely due to the more difficult prediction task and the constraint posed by the smaller dataset available for modeling analysis (Supplementary Table 2).Between the two Cox models, the dynamic model once again generally outperforms the static model.Comparison between random forest and LightGBM reveals that random forest achieves higher average AUROC and AUPRC, while LightGBM achieves a better average MCC.This discrepancy potentially indicates their different capabilities in rank-ordering case and control patients, as well as in making classification predictions.
The results presented in Tables 2 and 3 show that our LSTM RNN model performs better when the prediction window becomes larger.One possible explanation for this trend is that with a larger prediction window, a greater number of valid case patients can be identified and included in the analysis.Consequently, more control patients can be matched with the case patients and included in the analysis.Supplementary Table 2 illustrates that an increased amount of data is available for predictive modeling as the prediction window size increases.As a result, the predictive performance of LSTM RNNs improves.www.nature.com/scientificreports/

Variable selection and importance evaluation
The results in Tables 2 and 3 show that the prediction performance of LSTM RNN remains similar or slightly improved when using all variables compared to using only the essential variables.This leads to two subsequent questions for further investigation.(1) What prediction performance can be achieved using even fewer variables?
(2) What are the most important features?To answer these questions, we focus on the prediction task with a 365day prediction window and stage IIIb data included.We apply a sequential forward variable selection procedure to identify the most predictive variables.This procedure begins by identifying the most predictive single variable, followed by the incremental addition of one variable at a time to the model.At each step, an exhaustive search is applied to identify the variable that, when added, yields the highest prediction performance.Time is always included as an input feature since our prediction is based on time series data.For a detailed description of the variable selection procedure, please refer to Sect.6 of the Supplementary Information.Figure 4 shows the average AUROC of models built using different numbers of variables and evaluated via cross-validation.Notably, eGFR is consistently identified as the most predictive single variable in 100% of the cross-validation trials (see Supplementary Table 3).It achieves an average AUROC of 0.957 with a standard deviation of 0.009 in cross-validation, indicating a high and stable accuracy when using this single variable.The prediction performance increases slightly as additional variables are added, eventually reaching a level of accuracy comparable to that achieved by the model employing all variables (as indicated by blue line in Fig. 4).To assess the difference in prediction performance between selected variables and all variables, we employ a two-sided pairwise t-test.The obtained p-value is adjusted for multiple tests using the Benjamini-Hochberg method 38 .Our analysis reveals that the performance difference is statistically significant (adjusted p-value ≤ 0.05) when utiliz- ing no more than 16 variables (as illustrated in Fig. 4).Supplementary Table 3 shows the frequencies of variable selection across cross-validation trials during the sequential forward search process.When selecting two variables for prediction, smoking status emerges as the second most frequently selected variable, with a frequency of 26.67%, which is significantly lower than the 100% selection frequency of eGFR.In summary, longitudinal eGFR is the best predictor enabling our models to achieve a high level of accuracy.The inclusion of additional variables yields only a marginal improvement in predictive accuracy.

Race-specific prediction performance
We divide the patient cohort into three racial groups, including African American, white, and others, to investigate the variation in prediction performance among these racial groups.We calculate the prediction performance (AUROC) for each racial group in all analysis schemes.Table 4 presents the mean and standard deviation of AUROC values across cross-validation trials for each racial group.The maximum performance difference between racial groups is also calculated and shown in Table 4.In the analyses that include stage IIIb data, the difference in average AUROC between racial groups is ≤ 0.008, indicating only small variations among races.In the analyses without stage IIIb data, the difference in average AUROC between racial groups is ≤ 0.064.The bigger difference in prediction performance is probably due to the smaller sample size when stage IIIb data are excluded, which may lead to increased variation between racial groups due to random effects in small sample size cases.

Conclusion and discussion
We have introduced a novel approach for predicting the progression of CKD from stages II/III to stages IV/V, based on longitudinal patient EHRs.Our model can identify high-risk patients using commonly available clinical variables.This method combines an EHR preprocessing pipeline with an AI predictive model, collectively yielding a commendable accuracy in forecasting CKD progression.The EHR preprocessing steps integrate various clinical variables and convert them into time series data suitable for modeling using RNNs.Our analysis reveals that using the time series of a single variable, eGFR, the RNN model achieves an average AUROC of 0.957 for predicting disease progression within a year.With the inclusion of additional clinical variables for  Compared with existing methods for predicting CKD progression, our approach offers advantages in three aspects.Firstly, we utilize longitudinal patient EHRs to predict CKD progression, in contrast to the prevalent use of cross-sectional observations in most previous studies.Longitudinal EHRs provide more abundant information regarding disease progression than snapshot data.To support effective disease modeling based on longitudinal EHRs, we have developed a novel data processing pipeline to process and integrate various EHR variables, converting them into well-structured time series data suitable for subsequent modeling analysis.It is worth noting that this EHR processing pipeline is not specific to the task of CKD progression prediction.It is a general framework applicable to processing any longitudinal EHR data of demographics, vital information, lab tests, and health behaviors, for various analysis purposes.Secondly, to the best of our knowledge, our work is the first attempt of using LSTM RNNs for predicting CKD progression.Existing related works usually use Cox proportional hazards models and conventional machine learning methods for predictions.LSTM RNNs are powerful tools for detecting temporal relationships and making predictions about future events.Our analysis results, as presented in Tables 2 and 3, demonstrate the superior performance of LSTM RNNs over other baseline methods across various CKD progression prediction scenarios.Thirdly, our variable selection analysis reveals that LSTM RNNs constructed on the longitudinal records of a single clinical variable, eGFR, can achieve an average AUROC of 0.957.In contrast to previous studies that rely on multiple clinical variables for predicting CKD progression 5,[18][19][20]23,26 , the simplicity of our approach, demanding information of only one clinical variable, renders its adoption for practical purposes both achievable and straightforward.
Our work achieves a high accuracy for predicting CKD progression from stages II/III to stages IV/V, which is of paramount importance to patient care.Our prediction target is CKD progression from early stages to late stages, which is underexplored in previous works.The majority of existing works within this domain have primarily concentrated on predicting progression to end-stage kidney disease or kidney failure 5,[18][19][20][21][22][23][24][25] .Our study holds the potential to significantly enhance early detection of high-risk patients, facilitating timely medical interventions to potentially stop disease advancement in its initial phases.All patients with CKD should receive guideline recommended care such as urine testing for proteinuria, blood pressure control, angiotensin blockade, and testing for glycemic control.Patients with high risk of progression to late stages can be started earlier on newer medications proven to reduce CKD progression, like SGLT-2 inhibitors, finerenone, and GLP-1 agonists.Despite their proven efficacy, these medications are often underutilized due to elevated costs or clinical inertia [39][40][41] .An accurate prediction of CKD progression to late stages not only refines risk stratification, but also provides evidences for clinicians to initiate high-risk patients on these medications earlier, given the anticipated greater benefit 42 .Such predictions may motivate patients at a high risk of CKD progression to persist with these new medications, despite potential regimen complexities, side effects, and higher co-pays 43 .This can also encourage patients to make lifestyle changes essential for reducing the risk of CKD progression 44 .Lastly, given a shortage of nephrology appointments, primary care clinicians can leverage the predicted risk to decide whether to refer patients to nephrology early or delay referral 45,46 .With an accurate prediction of CKD progression, physicians can avoid making ad hoc clinical decisions that can result in either insufficient care for those who will ultimately progress to more advanced CKD or unnecessary treatments and costs for patients who will not progress.Accurate prediction of disease progression risk could significantly facilitate individualized decision making, enabling early and appropriate patient care, and substantially reduce healthcare costs.
A major advantage of the EHR dataset used in our analysis is its enriched population of minority patients, especially African Americans.Some previous prediction analyses had a low representation of African Americans despite their higher risk of CKD progression 5,18,20 .This distinctive characteristic introduces a novel perspective on the potential application of the proposed model within these specific patient populations.However, the limitation inherent to utilizing only one dataset as an illustration lies in the absence of validation across additional patient cohorts.To address this limitation, our future endeavors will involve collecting more EHR data of CKD patients, including data of other racial groups, to make our models more representative and evaluate the generalizability of our prediction models to other patient cohorts.
Additional future work has been considered and planned to follow the current study.Firstly, other types of EHRs can be added for predictive modeling, such as patient diagnostic ICD (International Classification of Diseases) codes, procedures, medications, and clinical notes, which may provide additional power for modeling disease progression.The utilization of embeddings for ICD codes, procedure codes, medications, and clinical notes, as explored by prior research 30,32,33,[47][48][49][50][51] , can be adapted to enhance our neural network models.Secondly, the proposed EHR processing pipeline and prediction model need to be integrated into the EHR system at medical centers for an ideal application environment.This integration would facilitate real-time predictions, thereby providing valuable decision support for patient care.Thirdly, to validate the clinical efficacy and prediction capabilities of our model, a prospective study, such as a clinical trial, is imperative.Such an endeavor would substantiate the practical utility of our model and its potential benefits for patient care.Fourthly, the proposed data processing and modeling approach can also be extended to other prediction targets, such as kidney failure and patient mortality.Relevant data need to be collected, and the data processing pipeline and prediction models need to be implemented for predicting these prognosis and progression markers.Lastly, leveraging the predicted CKD progression risks and patient medical records, we can develop methods to identify patients who have received insufficient care or unnecessary treatments.After identifying these patients, the influence of risk prediction performance on their population size can also be investigated.

Figure 1 .
Figure 1.Illustration of assembling a time series of feature vectors based on EHRs.

Figure 3 Figure 2 .
Figure 3 shows the architecture of two LSTM RNNs used for modeling using either all variables or essential variables only.Both models consist of a single LSTM layer and multiple dense layers.The model that utilizes only essential variables has fewer dense layers and nodes compared with the model using all variables, due to the reduced number of input features for essential variables.All dense layers in both models use ReLU activation functions and dropout mechanisms.The dropout rate remains the same across all dense layers within a model, which is the only hyper-parameter subject to tuning during model training.The dropout rate is selected from [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.7].The output layer comprises two nodes corresponding to the two classes in the data, i.e., case and control patients.It employs the softmax activation function, ensuring that the output values of the two nodes sum to one and form probabilistic prediction outcomes.The weighted cross-entropy loss function is used for model training.In this context, the weight for control patients is set to 1, while the weight for case patients is determined by the sample size ratio between control patients and case patients in the training set.We conduct tenfold cross-validation for training and evaluating the prediction models.Eight data folds are used

Figure 3 .
Figure 3. Architectures of LSTM RNN models used in the analysis.(a) Model architecture used for analysis based on all variables.(b) Model architecture used for analysis based on essential variables.

Figure 4 .
Figure 4. Prediction performance of using different numbers of variables identified in the sequential forward search.Dots indicate the average AUROC obtained using selected variables across cross-validation trials.The blue line indicates the average AUROC obtained using all variables.Green indicates that the performance difference between selected variables and all variables is statistically significant (adjusted p-value ≤ 0.05), while red indicates the performance difference is not statistically significant (adjusted p-value > 0.05).

Table 2 .
Comparison on prediction performance of LSTM RNN and baseline methods.Numbers before and in the parentheses are mean and standard deviation, respectively.

Table 3 .
Comparison on prediction performance of LSTM RNN and baseline methods when stage IIIb data are excluded.Numbers before and in the parentheses are mean and standard deviation, respectively.

Table 4 .
Race-specific prediction performance and its difference among racial groups.For prediction performance, numbers before parentheses are average AUROC and numbers in parentheses are standard deviations.The maximum prediction performance difference is the largest difference in average AUROC among three pairwise comparisons, i.e., African American vs. white, African American vs. others, and white vs. others.

data excluded Prediction window (in days) Prediction performance (AUROC) Maximum performance difference between groups African American White Others
prediction, the average AUROC increases to 0.967.In both scenarios, the standard deviation of AUROC across cross-validation trials remains below 0.01, indicating a consistently high prediction performance.Our models outperform existing methods, including static and dynamic Cox proportional hazards models, random forest, and lightGBM, assessed by the AUROC, AUPRC, and MCC performance metrics.When stage IIIb data are excluded to simulate the challenge of early detection of high-risk patients, our proposed models maintain an average AUROC hovering around 0.9.