Prediction of the Mortality Risk in Peritoneal Dialysis Patients using Machine Learning Models: A Nation-wide Prospective Cohort in Korea

Herein, we aim to assess mortality risk prediction in peritoneal dialysis patients using machine-learning algorithms for proper prognosis prediction. A total of 1,730 peritoneal dialysis patients in the CRC for ESRD prospective cohort from 2008 to 2014 were enrolled in this study. Classification algorithms were used for prediction of N-year mortality including neural network. The survival hazard ratio was presented by machine-learning algorithms using survival statistics and was compared to conventional algorithms. A survival-tree algorithm presented the most accurate prediction model and outperformed a conventional method such as Cox regression (concordance index 0.769 vs 0.745). Among various survival decision-tree models, the modified Charlson Comorbidity index (mCCI) was selected as the best predictor of mortality. If peritoneal dialysis patients with high mCCI (>4) were aged ≥70.5 years old, the survival hazard ratio was predicted as 4.61 compared to the overall study population. Among the various algorithm using longitudinal data, the AUC value of logistic regression was augmented at 0.804. In addition, the deep neural network significantly improved performance to 0.841. We propose machine learning-based final model, mCCI and age were interrelated as notable risk factors for mortality in Korean peritoneal dialysis patients.

The Table of Contents for Supplemental Material   Treatment of missing values   Table S1. The number of missed values in each observation Modelling process with data splitting Table S2. Longitudinal measurement for time-sequential information in the study cohort using deep neural network algorithm Approach to classification problems using individual learners and ensemble variants Table S3. Performance of the 5-year prediction model by conventional decision tree with imputation, and without weighting methods in PD patients Table S4. Performance of the 5-year prediction model by conventional decision tree with weighting methods in PD patients

Weighting method for classification Logistic regression
Decision-tree

Neural network
Approach to survival problems using individual learners and ensemble variants Table S5. Performance of the prediction models for mortality by survival statistics with imputation methods in PD patients Deep leaning algorithm process including recurrent neural network with autoencoder imputation Figure S1. The longitudinal data management for the recurrent neural network (RNN) and long-and short-term memory network (LSTM) Figure S2. The missing value learning (a) utilizing the auto encoder, (b) the inference process, (c) form combined with RNN 3

Treatment of missing values
Our data were collected over 7 years from 2008-2014, so it is inevitable to have some missing values. Of the total 1,730 observations, 38.5% contained at least one missing value. Table S1 shows numbers of observations for the given numbers of missing values. To manage these missing values, we used the following two methods: complete case analysis only, using existing attributes; and imputation with, in our case, Multivariate Imputation by Chained Equation ( values in each  observation  0  1  2  3  4  5  6  7  8  9  10  11  12   Frequency  1063  226  251  81  43  13  9  4  1  1  1  22  14   4   Cumulative frequency  1063  1289  1540  1621  1664  1677  1686  1690  1691  1692  1693  1715 1729

Modelling process with data splitting
For experiments, we split our data into training (70%) and test (30%) sets. Due to the limited quantity of data, we performed a 5-fold crossvalidation to prevent our model from being overfitted. After the cross-validation, we evaluated our model using the test sets. We set five different seeds to measure model performance using concordance index as a main criterion. We applied deep learning using longitudinal data.
The repeated measured data include 24-hour urine volume, RAAS blockade use, and dialysis efficiency (weekly KT/V). The study protocol of this study cohort was measured KT/V after 3 months of study enrollment, and the detailed protocol was presented in Table S2.

Approach to classification problems using individual learners and ensemble variants
We employed widely used, individual learning models (classification and regression trees, and logistic regression 2,3 and ensemble learning models [bagging and random forest] 1,4 ). Our methods were detailed in a recently published study 5 . To predict survival at N years after PD initiation, we conducted various experiments using different algorithms. In this section, we introduce several classification algorithms and their ensemble variants. Besides the classification models, we also present machine-learning algorithms based on survival statistics. The performance of the machine-learning algorithm for classification is compared in Tables 2, Table S3, and S4, according to test performance using the area under the curve (AUC) with different settings.

Weighting method for classification
It was necessary to determine period to characterize the survival analysis problem as a classification problem. We set the period to 5 years.
Thus, we redefined our problem as "whether a patient survives 5 years after PD initiation". This definition produced right-censored data 5 , which were handled by either dropping, or applying the weighted method proposed by Zupan et al. 6 . The weighting method created two copies, 0 and 1, for each piece of right-censored data and assigned a probability for each case as a weight, based on survival function.

Logistic regression
One of the most common machine-learning algorithms is logistic regression. It is a generalized linear model (GLM) used for classification problems. Instead of assuming that a dependent variable is a normal distribution in the case of a linear regression model, it assumes that a dependent variable is a Bernoulli distribution. Hence, logistic regression converts a linear combination of independent variables to binaryvalued outcomes using a logit function formulated as (X) = 1 (1 + exp (− X)) ⁄ , where π(X) indicates probability of the dependent variable, y, being in class 1 given the independent variables, or simply p(y=1|X) 7 . A logistic regression model is trained to minimize a predefined cost function which, in our case, was defined as (ŷ, y) = ∑(−y log ŷ − (1 − y) log(1 − ŷ) where ŷ is equivalent to p(y=1|X). Further, to avert a problem of overfitting, which prevents a model from generalizing unseen data, we also applied Lasso and Ridge, which constrains the cost function using ‖w‖ 1 and ‖w‖ 2 2 , respectively, so that it prevents a model being overfitted.

Decision-tree
The decision-tree algorithm, another commonly used classification algorithm, is a simple and intuitive yet robust machine-learning algorithm.
It is easier to implement a decision-tree algorithm, and interpret its results, than many other machine-learning methods. Further, it is robust due to its nature of non-linearity 4 . We employed a classification and regression tree (CART) algorithm, which is a specific type of decision-11 tree algorithm. CART forms a binary tree and gradually expands its leaf nodes to maximize purity measurement or equivalently minimize impurity measurement. Among three commonly used impurity measurements, we chose Gini index, which measures the impurity of internal nodes. The algorithm expands until it meets stopping rules specified as hyperparameters 4 .
To enhance the performance of individual algorithms, ensemble methods are often employed. These methods are machine-learning algorithms that combine multiple base learners with the aim of improving predictive performance of the given base model. In this paper, we used bootstrap aggregating, also known as bagging 2 , and random forest 3 as ensemble methods. Bagging consists of multiple base models independently trained on bootstrapped samples of the same size from the training dataset. In inference time, it aggregates output predictions by averaging and voting for regression and classification, respectively. The random forest algorithm adds more randomness to bagging. It not only bootstraps samples but randomly chooses a fixed number of attributes among all the attributes available and finds the best split using them 1 . In this way, it improves accuracy of the output predictions. We chose CART as a base learner for both bagging and random forest 4 .

Neural network
A neural network is a network of neurons that aims to recognize underlying relationships of data through a process that imitates the way a human brain operates. It consists of input, hidden, and output layers. An input layer corresponds to the variables of input data. After a neural network takes input data through an input layer, it is passed into a hidden layer, which linearly combines the input data and modifies it using a nonlinear function, also known as an activation function. Then, output of the hidden layer is passed into either the next hidden layer or output layer. A neural network can approximate a function for both classification and regression problems. In our case, it was designed to solve the binary classification problem.

12
In general, a neural network can be formulated in a mathematical form as follows: The network can be trained by minimizing a loss function as a proxy to improve its performance, which in our case, is classification accuracy.
Although there are many options for a loss function, in a classification task, cross entropy loss (defined below) is generally used. Due to nonconvexity of cross entropy loss, it is not possible to compute a global minimizer using analytical optimization methods. Instead, numerical optimization methods, such as gradient descent or its variants, are used to estimate a global minimizer.

Approach to survival problems using individual learners and ensemble variants
As a characteristic of observational cohort datasets, much data is censored. It is often omitted for the sake of simplicity, but this degrades the performance of a model due to insufficient follow-up. An alternative solution is to treat censored data as non-recurring samples (classification) and their follow-up times as survival times (regression).
Both of these solutions, however, introduce bias that is amplified when the rate of event occurrence is low. To avoid such bias and include all censored data, we modeled a Survival Decision Tree (SDT) algorithm using survival statistics 5,8 .

13
As described in the previous subsection, a general decision-tree algorithm recursively finds the best attribute to split a node using an impurity measurement such as Gini index or entropy index, which measures impurity in a classified outcome. Conversely, SDT uses survival statistics as a split criterion. It expands its nodes to maximize the improvement, which is formulated as: As with a general decision-tree method, SDT expands until it meets stopping rules. For our experiments, we set the model to stop splitting when either split did not improve the fit by a certain threshold or the depth of any node reached a certain threshold. Through the stopping rule, we prevented the model from being overfitted to the training dataset. To boost performance of the STD model, we applied ensemble methods in a similar manner to the classification models. We employed both bagging and random forest with STD as a base model 8 . Table 3, and S5 show the final results for survival model parameters as concordance index (C-index).

Deep leaning algorithm process including a recurrent neural network with an autoencoder imputation
We have not been satisfied with the performance of the model despite its analysis process, so we tried to strengthen the model by solving two problems after our CRC-ESRD cohort by using a deep learning algorithm: i) The time-sequential longitudinal observational nature of data was attempted to overcome and perform deep learning algorithms, such as the recurrent neural network (RNN) and long short-term memory network (LSTM); (ii) missing data was managed by an autoencoder (AE), which was used to strengthen the model ( Table 5).
(i) The first feature of the longitudinal observational cohort is the presence of time-variable attributes. Changes in these attributes might have played an important role in predicting the target variable. The recurrent neural network (RNN) is a type of artificial neural network, and the connection between its units has a cyclic structure. 9 These structures allow states to be stored inside the neural network to model time-variable dynamic attributes. Unlike conventional feed-forward artificial neural networks, the RNN can process sequence-type inputs using internal memory. Thus, the RNN can process data with time-variable characteristics. In the case of vanilla RNN, gradients cannot be propagated normally as they either vanish or explode if the input sequence is long during the training process. This is called the problem of long-term dependencies (LTD) 10 . To solve this problem, a special case of RNN, the LSTM, was introduced. An LSTM unit consists of an input gate, an output gate, a forget gate, and a memory cell. The process is shown in Figure S1. Figure S1 shows the structure when applying RNN/LSTM to the classification model; X is a static variable, and Xt is a time-dependent variable. In the study protocol for our cohort as shown in Figure S1, the time-dependent variables were traced at 0/3/12 months (Table S2) Figure S1). In the case of the patient with a tracking value, the unit made predictions according to these changes ( (Death| , 0 , 3 ) or (Death| , 0 , 3 , 12 )). Figure S1. The longitudinal data management for the RNN and LSTM.
(ii) The second feature of inevitable nature for the observational cohort is the existence of missing data. When the data is missing values, the simplest processing method is a complete data analysis that omits the missing data. However, this method can cause two major problems. The first is that statistical significance can be lost due to a decrease in the size of the data, and the second is that the bias of the model as a result of the difference in the population distribution can occur. To solve these problems, we used an autoencoder (AE), which is a neural network that simply predicts the input value as an output value. If we set the number of nodes in the hidden layer to less than the input layer, the AE can learn the compact representation of the input. This constraint enables us to learn how to express data efficiently, and it is possible to use this AE to express information, including missing values, as shown in Figure S2. In the training process, some input variable values were randomly removed, and the AE was trained to restore them as the original values. In the inference process, the encoding value of the input was utilized regardless of the existence of the missing value. Figure S2(c) shows the overall structure in which the AE is combined with RNN/LSTM.
Among the various algorithms, the AUC value for logistic regression was the best at 0.804.
Using these longitudinal data, the AUC of DT was also improved to 0.801 ( Figure 5). Our proposed deep learning model was 0.840 when using only LSTM and 0.858 when combined with an autoencoder ( Table 5).