Neural networks versus Logistic regression for 30 days all-cause readmission prediction

Heart failure (HF) is one of the leading causes of hospital admissions in the US. Readmission within 30 days after a HF hospitalization is both a recognized indicator for disease progression and a source of considerable financial burden to the healthcare system. Consequently, the identification of patients at risk for readmission is a key step in improving disease management and patient outcome. In this work, we used a large administrative claims dataset to (1) explore the systematic application of neural network-based models versus logistic regression for predicting 30 days all-cause readmission after discharge from a HF admission, and (2) to examine the additive value of patients’ hospitalization timelines on prediction performance. Based on data from 272,778 (49% female) patients with a mean (SD) age of 73 years (14) and 343,328 HF admissions (67% of total admissions), we trained and tested our predictive readmission models following a stratified 5-fold cross-validation scheme. Among the deep learning approaches, a recurrent neural network (RNN) combined with conditional random fields (CRF) model (RNNCRF) achieved the best performance in readmission prediction with 0.642 AUC (95% CI, 0.640–0.645). Other models, such as those based on RNN, convolutional neural networks and CRF alone had lower performance, with a non-timeline based model (MLP) performing worst. A competitive model based on logistic regression with LASSO achieved a performance of 0.643 AUC (95% CI, 0.640–0.646). We conclude that data from patient timelines improve 30 day readmission prediction, that a logistic regression with LASSO has equal performance to the best neural network model and that the use of administrative data result in competitive performance compared to published approaches based on richer clinical datasets.


Recurrent neural network (RNN)
RNN computes a hidden vector at each time step (i.e. state vector h t at time t), representing a history or context summary of the sequence using the input and hidden states vector form the previous time step. This allows the model to learn long-range dependencies where the network is unfolded as many times as the length of the sequence it is modeling. Equation 1 shows the computation of the hidden vector h t using the input x t and the previous hidden vector h t−1 where φ is a non-linear transformation such as ReLU (z) = max(0, z) or tanh(z) = e z −e −z e z +e −z . To compute the outcomeŷ t at time t, an affine transformation followed by non-linear function are applied to the state vector h t as described in equation 2. The non-linear operator σ can be either the sigmoid function σ(z) = 1 1+e −z applied to scalar input z ∈ R, or its generalization the sof tmax function applied to vector z ∈ R K , sof tmax(z) i = e z i K j=1 e z j for i = 1, ..., K. As a result, the outcomeŷ t represents a probability distribution over the set of possible labels V label at time t. where | representing the model weights θ to be optimized and D h , d are the dimensions of h t and x t vectors respectively. Note that the weights are shared across all the network (see Figure 1 for RNN representation).

Long short-term memory (LSTM)
Long short-term memory (LSTM) [1,2] falls in the gated memory cells approach that modifies the basic RNN by replacing the standard neurons/units in the hidden layer with gated/memory cells to generate the hidden state vector h t as described by the equations below. Moreover, LSTM introduces a new cell state vector c t that overall contributes in the decision mechanism on what part of the history to keep or forget. The computation of the outputŷ t at time t remains the same as explained in the RNN section.
c t = φ(Wc hx x t + Wc hh h t−1 + bc hx ) (new state/memory cell)

Gated recurrent unit (GRU)
Gated recurrent unit (GRU) [3] presents similar approach to LSTM but with a simpler model that modifies the computation mechanism of the hidden state vector h t through the specified equations below.
where D h and d are the dimensions of h t and x t vectors respectively. The operators notation have the same meaning as described in the LSTM section.

RNN objective function
In our work, we use RNN to refer to RNN, LSTM and GRU models targeting the sequence labeling view of the problem. We defined the loss for an i-th sequence at each time step by the cross-entropy loss where the loss for the i-th sequence (i.e. i-th patient's timeline/trajectory) is defined by the average loss over the sequence length T i Given that our focus is on the 30 days all-cause readmissions after HF hospitalization, our model's focus should be on the index events (i.e. claims/events where HF is the primary diagnosis of hospitalization). Hence, the objective function could be modified to reflect this requirement. We modify the defined loss over i-th sequence (see Eq. 4) by defining an average loss for non-index and index events separately and then taking a convex combination between both losses parametrized by α. The parameter α is determined using a validation set. Our modification is inspired by the work done in [4].
is an indicator function that is equal to 1 when the feature vector x (i) t representing the event at time t for the i-th sequence has HF as the primary diagnosis (we refer to this loss by Convex_HF_NonHF). A second variation (Uniform_HF) to the first objective function in Eq. 5, is to consider only the index events in the patients' timeline where we compute the average cross-entropy loss for the index events only.
A third variation (Convex_HF_lastHF) is to take a convex combination between all index events and the last index event in the timeline.
where L lastHF i is the cross-entropy loss for the last HF event (i.e. the last index event in the patient's timeline) and l (i) The previous variations consider the sequence labeling approach since the loss function for a sequence is defined as a composite of losses from different times/events in a patient's timeline. A final variation that uses the sequence classification view of the problem is to define an objective function that focuses only on the last HF event (LastHF) by computing the loss at the last target event we aim to predict.
Lastly, the objective function for the whole training set D train is defined by the average loss across all the sequences in D train plus a weight regularization term (i.e. l 2 -norm regularization) applied to the model parameters represented by θ In practice, the training occurs using mini-batches where computing the loss function and updating the parameters/weight occur after processing each minibatch of the training set.

RNN with conditional random fields (CRF)
CRF models the conditional probability of a sequence y given its corresponding sequence of observation vectors x (i.e. p(y 1 , y 2 , ..., y T |x 1 , x 2 , ..., x T )) using a parametrized global feature vector F (x,y) ∈ R J that takes input/output sequences to produce J-dimensional vector. As a result, the computation of the conditional probability of an output sequence given its input sequence of observations is equal to where Y is the set of all label sequences, the denominator represents the normalizer (commonly referred to the partition function Z), and θ is the weight vector corresponding to the global feature vector F (x,y). The common definition of the feature vector F uses the first-order Markov assumption in order to make the inference and model training tractable [5]. That is where the global feature vector F is the sum of a local feature vector f applied at each time step until the end of the sequence. The local vector f has the same dimension of F (i.e. ∈ R J ) and has access to the whole observation sequence x, and the current and previous states/outputs y t−1 and y t [5]. Generally, increasing the model order k (i.e. k ≥ 2) would lead to exponential computational complexity in terms of k. However, recent work as in [6][7][8], showed under the assumption of label pattern sparsity, the use of higher-order models (i.e. models with k ≥ 2) is feasible without incurring an exponential complexity in the training and inference algorithms [9]. We denote the output features of the RNN layer by z = [z 1 , z 2 , · · · , z T ] representing the sequence of output features computed from the input sequence x (both sequences have equal length). The potential functions in the CRF layer are computed using z along with label sequence y. In our work, we experimented with two potential functions: 1. RNNCRF (Unary) that computes unary potentials ψ yt (z t ) by passing the RNN output feature vector z t at time t to a linear affine map and applying a non-linear transformation resulting in a vector of size equal to the number of classes |V label | for each z t . The pairwise potential is modeled using a transition parameters matrix A(y t−1 , y t ) of size |V label | × |V label | representing the transition score from one outcome class to another. The total score computation is equal to F (z,y) = T t=1 (ψ yt (z t ) + A(y t−1 , y t )). 2. RNNCRF (Pairwise) that computes pairwise potentials ψ yt−1yt (z t ) by using linear affine map transformation followed by non-linear element-wise operation generating an output of size |V label | × |V label | similar to the approach reported in [10]. The total score F (z, y) is equal to F (z,y) = T t=1 ψ yt−1yt (z t ).

Input features x t
Each claim/event in a patient's timeline is represented by a feature vector x t encoding the characteristics of the hospitalization event and the corresponding patient. The feature vector x t is composed of: • Diagnosis: every claim in the dataset includes 25 ordered fields, each registering patient's diagnosis category based on CCS grouper [11] during the corresponding hospitalization event. We first extracted set V diagnosis representing the diagnosis having at least 1000 counts/occurrences registered in the HF dataset. Then, we constructed the following vectors: 1.
x diag1 a one-hot encoded vector ∈ {0, 1} |V diagnosis | representing the diagnosis category registered for the primary diagnosis field 2.
x diag2 a one-hot encoded vector ∈ {0, 1} |V diagnosis | representing the diagnosis category registered for the secondary diagnosis field 3. x diag3 a one-hot encoded vector ∈ {0, 1} |V diagnosis | representing the diagnosis category registered for the tertiary diagnosis field 4. x countdiag a vector ∈ R |V diagnosis | representing the count of diagnosis categories registered in all 25 diagnosis fields • Procedures: every claim in the dataset includes 15 ordered fields, each registering patient's administered procedure category based on CCS grouper during the corresponding hospitalization event. We first extracted set V procedures representing the top procedures having at least 1000 counts registered in HF dataset. Then, we constructed the following vectors: 1.
x proc1 a one-hot encoded vector ∈ {0, 1} |V procedures | representing the procedure category registered for the primary procedure field 2.
x proc2 a one-hot encoded vector ∈ {0, 1} |V procedures | representing the procedure category registered for the secondary procedure field 3.
x proc3 a one-hot encoded vector ∈ {0, 1} |V procedures | representing the procedure category registered for the tertiary procedure field 4. x countproc a vector ∈ R |V procedures | representing the count of procedure categories registered in all 15 procedure fields • Body-system chronic condition: every claim in the dataset includes 25 ordered fields, each representing body-system chronic condition indicators, categorizing ICD-9-CM diagnosis codes into chronic or not [12]. We refer to the list of body-system categories by set V bchronic that includes 18 categories. We constructed the following vectors: 1.
x bchronic1 a one-hot encoded vector ∈ {0, 1} |V bchronic | representing the body-system chronic condition indicator category listed in the primary field 2.
x bchronic2 a one-hot encoded vector ∈ {0, 1} |V bchronic | representing the body-system chronic condition indicator category listed in the secondary field 3. x bchronic3 a one-hot encoded vector ∈ {0, 1} |V bchronic | representing the body-system chronic condition indicator category listed in the tertiary field 4. x countbchronic a vector ∈ R |V bchronic | representing the count of bodysystem chronic condition indicator categories registered in the 25 fields • External cause of injury code: every claim in the dataset includes 4 ordered fields, each detailing an injury code (E-code) based on CCS software categorizing all ICD-9-CM diagnosis codes into 20 categories [12]. We refer to the list of E-code categories by set V ecode that includes 20 categories. We constructed the following vectors: 1.
x ecode1 a one-hot encoded vector ∈ {0, 1} |V ecode | representing the Ecode injury category listed in the primary field 2.
x countecode a vector ∈ R |V ecode | representing the count of E-code injury categories registered in the 4 fields • Procedure classes: every claim in the dataset includes 15 ordered fields, each describing a broad category code (i.e. class) based on categorization of the ICD-9-CM procedure codes [12]. We refer to the list of procedure broad categories by set V procedureclass that includes 4 categories. We constructed the following vectors: 1.
x countpclass a vector ∈ R |V procedureclass | representing the count of procedure class categories registered in the 15 fields • Comorbidity condition: every claim in the dataset includes 29 binary fields ∈ {0, 1}, each representing an indicator of a specific comorbidity that was determined by the AHRQ comorbidity software [13]. The software determines comorbidities that are more likely present prior to hospitalization event [13]. We refer to the comorbidity categories encoded in the 29 binary variables by set V comorbid . We constructed the following vector: 1.
x comorbid a vector ∈ {0, 1} |V comorbid | where each component corresponds to one of the 29 binary fields representing the presence/absence of a comorbidity condition • Major diagnostic category (MDC) assigned by HCFA DRG Grouper algorithm during the processing of HCUP dataset [12]. We refer to the list of MDC categories by set V mdc where we constructed the following vector: 1.
x mdc a one-hot encoded vector ∈ {0, 1} |V mdc | representing the MDC code/category • Risk of mortality subclass: every claim in the dataset includes a field that measures risk of mortality subclass based on all patient refined diagnosis related groups assigned using software developed by 3M Health Information System [12]. The measure consists of five categories which we refer to by the set V riskmortal . We constructed the following vector: 1.
x riskmortal a one-hot encoded vector ∈ {0, 1} |V riskmortal | representing the risk of mortality subclass category • Severity of illness subclass: every claim in the dataset includes a field that measures severity of illness subclass based on all patient refined diagnosis related groups assigned using software developed by 3M Health Information System [12]. The measure consists of five categories which we refer to set V severity . We constructed the following vector: 1.
x severity a one-hot encoded vector ∈ {0, 1} |Vseverity| representing the severity of illness subclass category • Major operating room procedure indicator: x orproc ∈ {0, 1} a binary variable indicating whether a major operating room procedure was reported on discharge • Number of chronic conditions: x nchronic ∈ R a variable representing the counts of unique chronic diagnosis reported on the discharge • Socio-demographics: every claim is associated with a patient and includes information regarding patient's age, gender, income and place/location of residence. We constructed a vector x sociodem that represents the concatenation of the following variables: • Event/claim info: every claim included the length of stay of the hospitalization event, if the admission was on a weekend, and the discharge month. We also computed the time difference between the hospital admission of a current claim/event and the discharge of previous claim/event for all events in a patient's timeline. We constructed a vector x event that is the concatenation of the following variables: 11. number of admission events: x countevents ∈ R a variable representing the number of admission events in the timeline of a patient up to the current admission event (inclusive) Hence, the feature vector x t is the concatenation of all these variables, encoding the characteristics of both the event and its corresponding patient at one time step of the patient's trajectory.

RNN model
The set of all possible hyperparameters configuration (i.e. choice of values for hyperparameters) for RNN models is reported in Table 1. These hyperparameters controlled the network architecture design that is represented in Figure  2.

RNNSS model
The hyperparameters configurations for RNNSS model is reported in Table 2, which controlled the network architecture design depicted in Figure 2.

RNNCRF model
Similar to RNN models, the set of all possible hyperparameters configuration for models using RNN with CRF is reported in Table 3 along with the network architecture design in Figure 3.

CRF and Neural CRF models
CRF only and Neural CRF models' hyperparameters configuration is reported in Tables 4 and 5 respectively along with the network architecture/design in Figure 4.    Table 3: RNNCRF hyperparameter options (see Figure 3)        Table 7: Logistic regression with l 2 -norm regularization

Logistic regression
Logistic regression models' hyperparameters options are reported in Tables 6  and 7 respectively.

Logistic regression
The analysis of feature importance is reported in Figure 8, which shows the normalized coefficients of the trained LASSO models averaged across all folds.

RNNCRF model
For the best neural model (RNNCRF), we report the analysis of feature importance according to an approach previously reported in [14]. In short, we iterated over all features attached to the last HF event, and computed the probability of readmission with a feature present or absent. The difference between both probabilities allowed us to quantify a feature's importance across the five folds (we call this metric diff_prob see Fig. 9). Additionally, we computed another variation of the same metric by incorporating the percentage of occurrence of each feature (i.e. when the feature is present) in the computation. In other words, we weighted the computed differences by the percentage of time each feature was present in the dataset (referred to diff_prob_weighted, see Fig. 10). A third variation, is computing a ratio (for every feature) dividing the average difference in probability (diff_prob) by the average value of the feature when it was present and again weighted by the percentage of occurrence of the feature (ratio_diff_prob_weighted, see Fig. 11). Figure 12 reports the ROC and average ROC curves of the best models across all five-folds.