Early prediction of diagnostic-related groups and estimation of hospital cost by processing clinical notes

As healthcare providers receive fixed amounts of reimbursement for given services under DRG (Diagnosis-Related Groups) payment, DRG codes are valuable for cost monitoring and resource allocation. However, coding is typically performed retrospectively post-discharge. We seek to predict DRGs and DRG-based case mix index (CMI) at early inpatient admission using routine clinical text to estimate hospital cost in an acute setting. We examined a deep learning-based natural language processing (NLP) model to automatically predict per-episode DRGs and corresponding cost-reflecting weights on two cohorts (paid under Medicare Severity (MS) DRG or All Patient Refined (APR) DRG), without human coding efforts. It achieved macro-averaged area under the receiver operating characteristic curve (AUC) scores of 0·871 (SD 0·011) on MS-DRG and 0·884 (0·003) on APR-DRG in fivefold cross-validation experiments on the first day of ICU admission. When extended to simulated patient populations to estimate average cost-reflecting weights, the model increased its accuracy over time and obtained absolute CMI error of 2·40 (1·07%) and 12·79% (2·31%), respectively on the first day. As the model could adapt to variations in admission time, cohort size, and requires no extra manual coding efforts, it shows potential to help estimating costs for active patients to support better operational decision-making in hospitals.

Cost weights for the DRG groups were obtained via official publications on governmental websites * , and the versions of fiscal year 2013 were used for both MS-DRG and APR-DRG. These documents lists all DRG groups and their corresponding relative weights based on which payers reimbursement hospitals. We denoted each DRG group as an independent target for MS-DRG and denoted a DRG concatenated with its severity score as a target for APR-DRG, because each DRG group in MS-DRG corresponds directly to a cost weight whereas in APR-DRG the weight maps to the combination of DRG group and severity. As an example for APR-DRG, DRG group 139 (Other Pneumonia) has four subgroups categorized by patient severity ranging from 1 to 4, each with a unique cost weight and thus can be treated as a unique DRG target. This enabled us to handle the extra component of patient severity and mortality in APR-DRG to focus on the final reimbursement rate. Then for each DRG dataset, we obtained the ground rules for weights that map a DRG target or DRG code y a 2 Y (including its description) for patient p a to a relative payment weight, which is denoted as rw a = drgRule(y a ).

Supplementary Method 1. Clinical text-based model
In this section we formalize the problem of DRG classification and weight prediction and describe the adopted modeling strategies. The deep learning architecture to model text was inspired by the work 1 for ICD-tagging. For our task on DRG prediction, we model the available data between the hospital admission and the T -th hour after ICU admission (which we refer as hours post admission (HPA)) of the patient p a . We use such data as input to predict the DRG code obtained after patient discharge, which we denote as the target y a 2 Y and can be mapped to a relative weight rw a based on existing payment rules as described in the main article and the last section.
The clinical reports N i assigned to a patient p a before a given time threshold T i are used in the modeling. The reports are tokenized and concatenated into a single text composed of words (tokens) w 1 , w 2 ,...,w n . These words are then embedded with pre-trained word embeddings 2 to form the input representation of the clinical text, which we denote as e 1:n = [e 1 , e 2 ,. .., e n ] 2 R d e ⇥n , with d e being the embedding dimension. This represents the information about the early patient stay.
Subsequently, a number of one-dimensional convolutions are performed to adaptively extract phrases that are most informative for making predictions. Here the classic setting is to have convolutional filters with multiple feature maps as well as filters of different window sizes. Specifically, a filter f j , composed with parameters w j 2 R u i ⇥d e and b j 2 R, performs convolutions by rolling through the input sequence e 1:n with a fixed-sized text window, or kernel size, u i 2 U. This step produces a feature map and with F filters the whole convolution operation encode the text as C T = [c 1 , c 2 ,...,c n ] 2 R |F|·|U|⇥n . Latent text features can then be obtained either via max-pooling the feature map or by applying an attention mechanism. For attention-based feature pooling in CAML, 1 each label is assigned with a vector v`, which queries the features map matrix C T to produce weight distribution over different e i (i 2 n), where so f tmax(x) = exp(x) . Features for DRGs are then computed as h In CAML, this representation is then transformed with prediction weights to produce output O Notice we do not apply sigmoid transformation here as in 1 since DRG prediction is a single-label classification, in contrast with the ICD prediction that CAML was designed originally to solve. With that, the distribution of the probabilities for DRG codes isŷ = so f tmax(O T ) 2 R |Y | and |Y | = 745 for our current experiment, one dimension per DRG code as stated in the article. The final DRG prediction for patient p a with input data at the T -th hour of ICU is thenŷ a = argmax i (ŷ), and the relative weight prediction iŝ rw a = drgRule(ŷ a ).
To train the model for DRG classification, we minimize the cross entropy loss between the target DRG y and the predicted likelihoodŷ as our training objective. If we parameterize the model by q , the cross entropy loss is: and this is aggregated for all samples in a training batch to update q with gradient method.

Supplementart Method 2. DRG Prediction using Structured data
Besides using clinical text, we also explored to use structured clinical measurements to make early DRG predictions in the ICU setting. A range of works in recent years have examined using rich measurement data to predict patient outcomes and intervention needs, such as in-hospital mortality 3 and circulatory failure 4 at ICU.
Here we followed the feature selection and preprocessing strategies introduced by MIMIC-Extract 5 pipeline, a robust benchmark to process clinical structured data for machine learning research, providing 104 curated and aggregated clincal variables to construct patient time-series. We refer readers to the original paper for more information on the data completeness and preprocessing procedures such as unit conversion and outlier handling. For each clinical variable at a specific time step (an hourly bin), the mean value of the recordings in that hourly bin and the mask to indicate whether the value is imputed or not were then used as features. In our experiment, we used data in the first 24 hours of ICU stay as input, thus resulting in 24 time steps, each with 208 (104 ⇥ 2) features. Missing values were first imputed in a carry-forward manner, where the most recent measured values were used to fill the missing ones. If there are no previous values in this case, the missing values would be set to the mean for the individual. Finally, if a variable was never observed, the value was set to the global mean in the training set during the cross validation (for example, global mean in the 4/5 of the data).
The preprocessed features were processed by a single layer LSTM and its last hidden state (equivalent to O T ) was fed to a classifier to predict the DRG code. The setting to train this model was similar to the text-based model, where the cross entropy loss was minimized to find the most likely DRG target for each sample. The implementation details are introduced in the next section.

Supplementary Note 2. Implementation and hyper-parameter setting
We conducted all the experiments using PyTorch. Both text-based and measurement-based models were trained with Adam optimizer with an early stopping setting, where the training stopped if there was no improvement in micro-averaged F1 score performed on the validation set for 10 consecutive epochs. Since the training of neural net models is stochastic and we implemented early stopping, there is a need for a validation set to indicate when the model is sufficiently trained. To address this, we used 10% of the data during training as an early-stopping set to guide training. For example, when optimizing the hyper-parameters in each of the five folds of cross validation, 20% of the cross validation data was used for model selection and 80% was used to train the model, among which 10% (or 8% of the cross validation data, or 7.2% of the whole dataset that includes the hold-out test set) was actually used as an inner validation set to implement early stopping. Similarly, when the model was retrained using all cross validation data after the best hyper-parameter setup was determined, 10% of the data (or 9% of the whole dataset) was used for early stopping and was not seen by the model in learning.
For hyper-parameters searching, it was impossible to tune all possible settings given the wide range of hyperparameters and the computational complexity. Therefore, for text-based model, we referenced to the settings of the original implementation of CAML model and tuned a limited combinations of learning rate, filter size, number of filters, dropout rate, and weight decay for each fold of cross validation for each DRG dataset. For LSTM, we tuned learning rate, weight decay, dropout rate, hidden size of the RNN, but also and batch size. The batch size for text-based model was fixed to 32. More implementation details can be found in the released repository.

Supplementary Method 3. Modeling with regressional objective
If we have no interest to learn about the DRG group and only care about the final cost or DRG weight, it is possible to bypass the classification step and to predict the weight directly as a regression problem. Here we described our experiments to model DRG relative weights using CAML, where the label in each sample was changed from the DRG code to DRG weight.
Following the denotations in previous sections, O T is the hidden state extracted by CAML to represent the input clinical text. Instead of feeding it to a classifier, O T was further processed by two more feed-forward layers with ReLU activation, with the first one having 100 hidden states and last layer having only one hidden state, which outputs a real number,r w a = MLP(O T ), as result. To train the model, we minimized Mean Square Error (MSE) as the objective, where ||rw a r w a || 2 (4) and A is the total number of patient stays in a training batch. For evaluation, we calculated MAE (Mean Average Error), RMSE (Root Mean Square Error), and R2 to show how accurate a model is in assessing DRG weight. As a comparator, we listed the model for DRG classification as a comparator where weights were mapped to the predicted DRGs. Based on results in Supplementary Table 1, we could observe that it is feasible to directly predict cost weight in a regression manner, bypassing DRG prediction, though the relatively high errors again showed the challenge to estimate cost for each individual patient accurately. The performances of directly modeling payment weight were in fact superior than translating weight from predicted DRG codes, indicating the capacity of the model to further fit to specific cost information in the data. However, this strategy was unable to provide DRG groupings and thus made the prediction less interpretable and potentially less desirable in certain scenarios. In addition, obtaining weights by mapping from classification allows perfectly predicted cases (when the model makes right classification) and could contribute to offsetting over-and under-predicted weights. Nevertheless, Supplementary Table 1 demonstrated the potential to extend the experimental setup of modeling routine clinical data to predict DRG-based cost to payment system other than DRG when past payment records can be obtained. Here are the detailed CMI errors used to plot Figure 1 in the main manuscript, for the two cohorts respectively. CMI: Case Mix Index; HPA: Hour Post Admission.

Supplementary Method 4. Text-based DRG Prediction with BERT and ClinicalBERT
Here we provide results of applying pre-trained models on the early DRG prediction based on clinical notes. We examined BERT-base model (12 layer, uncased) 6 , a Transformer-based model that is pre-trained on large quantities of text, and fine-tuned the model to predict DRGs based on progress notes. We followed our previous preprocessing steps to remove de-identification placeholders from the concatenated reports and applied BERT tokenizers to tokenize the text string, ensuring the original text inputs are the same across different modeling. Given the BERT capacity, the number of input tokens is limited to 512, including two special tokens [CLS] and [SEP] that are at the start and the end of an input sequence. The [CLS] token was fed to a feed-forward layer to produce the model output in both training and testing, following BERT's normal fine-tuning pattern for sequence classification. We again trained (or fine-tuned) five different models in each of the five-fold crossvalidation experiments, with the difference that we did not change the hyper-parameters but instead follow the recommended practice for classification-oriented fine-tuning 6,7 , which is to set learning rate to 2e-5 and batch size to 16. In this case, each of the five BERT models saw slightly different training data (as 10% was randomly sampled for early stopping), but it is the same training data with its CAML counterpart. In short, we ensured that CAML and BERT were compared fairly using the same training and testing dataset splits, though BERT consumes shorter input compared to CAML. Finally, we also fine-tuned the domain adapted version of BERT, ClinicalBERT (cased) 8 , which was further pre-trained on biomedical text (initialized from BioBERT) and clinical notes. We refer readers to the original publications for more details on BERT and ClinicalBERT. The table presents the results of the CNN-based model (CAML) and the pre-trained Transformer-based models (BERT and ClinicalBERT) on predicting all test DRGs in both MS-DRG and APR-DRG cohorts, which contain 1977 cases (369 unique labels) and 2747 cases (517 labels), respectively. The results were summarized in mean (standard deviation) over the performances of five models on the hold-out test set. Best performances are highlighted in boldface.