DRG-LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients

In the U.S. inpatient payment system, the Diagnosis-Related Group (DRG) is pivotal, but its assignment process is inefficient. The study introduces DRG-LLaMA, an advanced large language model (LLM) fine-tuned on clinical notes to enhance DRGs assignment. Utilizing LLaMA as the foundational model and optimizing it through Low-Rank Adaptation (LoRA) on 236,192 MIMIC-IV discharge summaries, our DRG-LLaMA -7B model exhibited a noteworthy macro-averaged F1 score of 0.327, a top-1 prediction accuracy of 52.0%, and a macro-averaged Area Under the Curve (AUC) of 0.986, with a maximum input token length of 512. This model surpassed the performance of prior leading models in DRG prediction, showing a relative improvement of 40.3% and 35.7% in macro-averaged F1 score compared to ClinicalBERT and CAML, respectively. Applied to base DRG and complication or comorbidity (CC)/major complication or comorbidity (MCC) prediction, DRG-LLaMA achieved a top-1 prediction accuracy of 67.8% and 67.5%, respectively. Additionally, our findings indicate that DRG-LLaMA ’s performance correlates with increased model parameters and input context lengths.


Introduction
The emergence of LLMs, such as GPT-3 (Brown et al. 2020) and InstructGPT (Ouyang et al. 2022), has brought about a transformative shift in the landscape of Natural Language Processing (NLP).These LLMs have demonstrated exceptional capabilities across many NLP tasks in the general domain.However, the integration of LLMs into the medical field remains at a nascent stage within the academic community.Recent instances of progress highlight their significant potential, including OpenAI's GPT-4 (Nori et al. 2023), Google's Med-PaLM2 (Singhal et al. 2023), and Google Deepmind's Med-PaLM M (Tu et al. 2023).GPT-4 and Med-PaLM 2 have achieved impressive performance on the United States Medical Licensing Examination (USMLE), and Med-PaLM M can even classify radiology images.Nonetheless, the medical domain introduces elevated concerns regarding safety and privacy, necessitating detailed analysis regarding the performance and limitations of LLMs to address the inherent risks such as hallucination, bias, and reasoning deficiencies (Au Yeung et al. 2023).
Since its inception by Medicare in 1983, DRG has served as the foundation for the inpatient prospective payment system within the United States (Quinn 2014).Each distinct DRG code is delineated by a particular set of patient attributes, including principal diagnosis, specific secondary diagnoses, procedures, sex and discharge status (CMS 2016).Traditionally, the assignment of DRGs constitutes a labor-intensive manual endeavor undertaken by coding specialists, typically subsequent to a patient's discharge.Given the pivotal role of DRGs and their bundled metrics (e.g., case-mix index, geometric length of stay) in the operational and financial performance of hospitals, a pressing interest exists in the accurate early prediction of DRGs during a patient's hospitalization.This prediction is vital for efficacious resource planning and allocation.The task of DRG prediction presents distinct challenges compared to automated International Classification of Diseases (ICD) coding.This distinction stems from differences in the nature of the task: DRGs involve multi-class classification, where one DRG code is assigned to each visit, in contrast to the multilabel classification of ICDs, where multiple codes may apply to a single visit (Kaur, Ginige, and Obst 2022).Additionally, the hierarchical structure of the codes, such as the presence of a principal diagnosis in DRGs, and the context of utilization in hospital operations further differentiate the two tasks (CMS 2016).Previous studies have showcased advancements in DRGs classification accuracy through various machine-learning algorithms (Gartner et al. 2015) and deep neural networks (Islam et al. 2021).More recently, a deep learning-based NLP model leveraging adjusted Convolutional Attention for Multi-Label Classification (CAML) has been applied to predict DRGs based on clinical notes and yielded promising outcomes (Mullenbach et al. 2018;Liu et al. 2021).
With LLM's remarkable natural language synthesis and generating capabilities, we hypothesize LLM could be applied to effectively predict DRGs directly from clinical notes.In this work, we present DRG-LLaMA , a finetuned LLM derived from LLaMA (Touvron et al. 2023a).DRG-LLaMA is trained on discharge summaries from the MIMIC-IV dataset for the task of DRG prediction.In our investigation, we approached DRG prediction from two perspectives: 1) as a single-label classification task, where the model makes an end-to-end prediction of the DRG label, and 2) as a two-label classification task, where the model predicts base DRG and CC/MCC status as two separate labels, followed by the inference of the final DRG label from these two components (i.e., base DRG and CC/MCC status).Our work revealed superior performance of DRG-LLaMA in DRG prediction compared to the previously reported leading models of CAML (Liu et al. 2021) and ClinicalBERT (Alsentzer et al. 2019).

Study cohort
A summary of the study cohort and data preprocessing steps was shown in Figure 1.We focused on hospital stays with Medicare severity-DRGs (MS-DRGs) within the MIMIC-IV dataset.The "brief hospital course" section from discharge summary was extracted to serve as input text.We also filtered out low-quality discharge summaries and rare DRGs with less than 2 occurences in the cohort.90% of the data was allocated as training set while the rest 10% as testing set, and this partitioning was stratified on DRGs.The training and testing set contains 738 and 723 unique DRG labels, respectively.There is no significant difference in the average word counts in the training vs. testing set (398 vs. 399; p = 0.51 from two-sided t-test).The distribution of cases per DRG is imbalanced, with a median number of 124.5 in the training set (Supplementary Figure 1).

DRG prediction as a single-label classification task
We presented the results with a maximum input token size of 512 in Table 1.DRG-LLaMA consistently outperformed ClinicalBERT and CAML across all evaluation metrics, with the most notable contrast seen in macro-F1 score (showing a relative improvement of 40.3% and 35.7% compared to ClinicalBERT and CAML, respectively).The accuracy of top-1 and top-5 predictions achieved by our fine-tuned DRG-LLaMA -7B model was 52.0% and 84.8%, respectively.When only considering the most frequent 300 DRGs, the top-1 accuracy improved to 55.7%, and this further increased to 69.4% in the most frequent 30 DRGs.As expected, DRG-LLaMA 's performance declined in less frequent DRGs (Figure 2a).When compared to CAML, Clin-icalBERT achieved higher AUC and top-1 prediction accuracy but lower macro-averaged F1 score.High AUC scores were obtained for all models due to the many infrequent DRG classes, resulting in high true negative predictions for all negative class predictions.(Liu et al. 2021).
We investigated DRG-LLaMA 's performance across varying model sizes and input context lengths (Table 2), observing a consistent improvement in all evaluation metrics with larger models and longer input contexts, measured in maximum token numbers.The optimal configuration, utilizing a 13B LLaMA model and a maximum input token size of 1024, achieved a top-1 prediction accuracy of 54.6%, a top-5 prediction accuracy of 86.5%, and a macro-F1 score of 0.361.

DRG prediction as a two-label classification task
In the two-label approach, we first dissect each DRG into two distinct components: a base DRG label and a CC/MCC label (denoting complication or comorbidity / major complication or comorbidity).This dissection process was based on the composition delineated within the MS-DRG v34.0 definitions manual (CMS 2016).The five distinct labels attributed to CC/MCC are as follows: "without CC/MCC", "with CC", "with MCC", "without MCC", and "not applicable".As an example, in DRG code 53 of "spinal disorders and injuries without CC/MCC," "spinal disorders and injuries" represents the base DRG label, while "without CC/MCC" serves as the CC/MCC label.Following this mapping process, the 738 DRG codes were converted into a combination of 340 base DRG labels each paired with one of the five CC/MCC labels.Results of two-label approach using DRG-LLaMA -7B with a maximum input token size of 512 was shown in Table 3.The top-1 prediction accuracy for base DRG and CC/MCC reached 67.8% and 67.5% respectively.This result suggests that predicting the principal diagnosis or procedure without considering CC/MCC is a significantly easier task on its own.Upon integrating a mapping rule designed to infer DRGs through the combination of base DRG and CC/MCC labels, the accuracy reached 51.5% across all DRGs.Notably, this performance was comparable with the accuracy attained in the single-label approach of 52.0% using the same base model, showing that the LLM was able to achieve state-ofthe-art performance via either classification setting.

Error analysis
As noted above, a correlation exists between the number of training cases and prediction performance.The accuracy of DRG prediction depends on various factors.DRGs with a top-5 prediction accuracy exceeding 80% are typically associated with a median of 309 training cases per label.In contrast, those DRGs with a top-5 accuracy below 20% are associated with only a median of 17 training cases per label (as shown in Figure 2b).However, other factors, such as the type of DRG, also affect prediction performance.For instance, (100.0)F1 and AUC scores were calculated using macro-averaged or micro-averaged method as shown in the header.Notably, in a multi-class classification problem, micro-averaged F1 score is equal to top-1 prediction accuracy when labels of all classes are considered (Grandini, Bagli, and Visani 2020).Accuracy @1, @5 and @10 measure whether the top-1, top-5 and top-10 predictions by the model contain correct DRG code, respectively.Standard deviations are shown in parentheses and calculated using a bootstrapping procedure.Top DRGs are selected based on the number of cases per DRG in the dataset.Number (%) of cases represents hospital stays covered by the given DRG group in the testing set.Bolded scores denote the best performance with respect to the task.DRG-LLaMA outperformed ClinicalBERT and CAML across all evaluation metrics, with better performance in more frequent DRGs.DRG denotes diagnostis-related group, AUC denotes area under the receiver operating characteristic curve, and ACC denotes accuracy.out of the DRGs with a top-1 prediction accuracy of 100%, 8 out of 9 are surgical DRGs, which have distinct hospital courses that make them easier for the model to comprehend (as listed in Supplementary Table 2).We randomly selected 10 samples from the subset where the model presented erroneous predictions within its top ten outcomes for manual error analysis (as listed in Table 4).Broadly, the identified errors were categorized as follows: erroneous CC/MCC (1/10), correct information needed for DRG prediction unavailable (1/10), difficulty in selecting correct base DRG (3/10), inadequate clinical concept extraction (4/10) and an isolated case of a plausible incorrect DRG label (1/10).Certain errors, like inadequate clinical concept extraction, indi-cate the model's weaknesses.Other errors, such as the difficulty in selecting the base DRG, likely stem from the intricacies of the DRG assignment rules.Furthermore, errors such as the unavailability of correct information required for DRG prediction underscore the limitations of solely relying on discharge summaries for DRG predictions.

Discussion
Large language model context: Language models based on the transformer architecture, either pretrained or fine-tuned using biomedical corpora, have demonstrated efficacy across a spectrum of NLP benchmarks within the biomedical realm (Lee et al. 2020; Huang, Altosaar, and Ranganath 2020; Gu Toward deploying a local LLM, we used LLaMA, a robust and openly accessible foundational LLM with parameters ranging from 7 billion to 65 billion (Touvron et al. 2023a).Instruction-following models fine-tuned from LLaMA such as Alpaca (Taori et al. 2023) and Vicuna (Chiang et al. 2023), exhibit performance on par with GPT-3.5.Within the medical context, several groups have directed their efforts towards fine-tuning LLaMA.Notable examples among these are ChatDoctor (trained on authentic patient-physician dialogues), HuaTuo (fine-tuned with a Chinese medical knowledge graph), and PMC-LLaMA (fine-tuned on biomedical academic papers) (Wang et al. 2023;Li et al. 2023;Wu et al. 2023).These LLaMA-based models focused on medical question answering, yielding encouraging outcomes.

Impact of DRG prediction:
In this study, we demonstrated superior performance of the fine-tuned LLaMA in the text classification task of DRG prediction.Previous studies have underscored the effectiveness of employing diverse machine learning algorithms and deep neural networks for DRG prediction within healthcare systems outside the United States (Gartner et al. 2015;Islam et al. 2021).These studies focused on using structured data as input variables instead of clinical text.More recently, CAML model exhibited superior ability to predict DRGs (Liu et al. 2021).CAML model, exclusively utilizing clinical notes, surpassed the performance of a Long Short-Term Memory (LSTM) model using structured clinical variables (Liu et al. 2021).When compared with ClinicalBERT, CAML provided improved F1 scores but lower AUC (Liu et al. 2021;Alsentzer et al. 2019).We observed that DRG-LLaMA outperformed prior leading models of ClinicalBERT and CAML.
Remarks on DRG prediction results: ClinicalBERT and CAML already stand as robust baselines, with the added benefit of much faster training times (supplement Table 1).While BERT-based models have a maximum input length of 512 tokens, CAML has the flexibility to handle longer context (Devlin et al. 2018;Liu et al. 2021).We also observed that the performance of DRG-LLaMA enhanced with the utilization of larger models and longer input context length.Interestingly, a recent study revealed that the optimal performance of LLMs is attained when pertinent information is positioned at either the beginning or the end of the input context, with a decline as the input context expands (Liu et al. 2023).In our constrained experiments conducted with a maximum input token limit up to 1024, we have yet to encounter this limitation.In our study, the performance of both the baseline models and DRG-LLaMA surpassed the outcomes reported in prior research (Liu et al. 2021).Beyond the substantially larger training dataset employed in MIMIC-IV compared to MIMIC-III (236,192 vs. 17,815), it is plausible that this enhanced performance is predominantly linked to our strategic input data selection.
The study by (Liu et al. 2021) included only clinical notes charted up to 48 hours post-admission or 48 hours after ICU admission.In the MIMIC-III database, a large portion of records during this time window comprises nursing and radiology notes, potentially lacking the pivotal admission History of Present Illness (HPI) notes.In contrast, our methodology entailed the utilization of discharge summaries as the input data source.Discharge summary is a comprehensive clinical narrative encapsulating pivotal events, diagnostics, and treatments during hospitalization.To accommodate the input token limitations of LLaMA, we exclusively focused on the "brief hospital course" section of the summary, intentionally excluding other segments such as physical examinations, radiology, laboratory, and medication list.Additionally, to enhance data consistency, we formulated an algorithm aimed at addressing discrepancies in DRG nomenclature and assignments across different years.
Nuance of DRG prediction task: In the context of the DRG system, a DRG code comprises a base DRG and a CC/MCC status.The base DRG represents the principal diagnosis (for medical cases) or procedures (for surgical cases) leading to the patient's admission.Meanwhile, CC/MCC categoriza-

Inadequate clinical concept extraction
We manually reviewed 10 cases for error analysis.For each case, we extracted most pertinent medical problems and their narratives from discharge summaries.Certain errors, like inadequate clinical concept extraction, indicate the model's weaknesses.Other errors, such as the difficulty in selecting the base DRG, likely stem from the intricacies of the DRG assignment rules.Furthermore, errors such as the unavailability of correct information required for DRG prediction underscore the limitations of solely relying on discharge summaries for DRG predictions.4, despite the discharge summary providing a more comprehensive discussion on gastrointestinal bleeding compared to acute renal failure, the latter was deemed the correct base DRG.This selection is guided by the DRG assignment rule, a factor extending beyond the scope of what is directly evident within the discharge summary.Limitations of our work: Our study has several limitations.1) We were limited by the constraints of the MIMIC-IV dataset and could only use discharge summaries as input data, which are only available after the patient is discharged from the hospital.However, an effective alternative for predicting early DRGs would be to utilize HPI notes and/or Emergency Department (ED) notes.This approach has the potential to significantly impact hospital operations.
The "assessment and plan" in HPI notes are similar in structure to the "brief hospital course" in discharge summaries.Thus, LLMs might find it easier to extract information related to the principal diagnosis from these notes, given their earlier time stamp in the hospitalization process.
2) We were also restricted by computational resource limitations, so we could only experiment with the LLaMA model up to a parameter size of 13 billion.Unfortunately, we couldn't perform an extensive hyperparameter search.The largest LLaMA models have over 65 billion parameters.

Conclusion and future work:
The results presented in this study highlight the potential of adapting LLMs for medical purposes, particularly in predicting DRGs.Future research should involve collaborating with healthcare systems and utilizing admission notes to enable early DRG prediction.Additionally, our findings suggest that experiments utilizing the latest LLMs, including the recently launched 70-billionparameter LLaMA-2 model with a maximum context length of 4096 tokens (Touvron et al. 2023b), should be considered.Finally, a crucial area for exploration concerns the practical implications of such DRG prediction, particularly when integrated into existing hospital coding workflows.

Methodology Dataset and Preprocessing
We conducted a study using the publicly available MIMIC-IV dataset, which comprises 431,231 unique hospital admissions from 299,712 patients admitted to an ICU or the ED of the Beth Israel Deaconess Medical Center in Boston, Massachusetts (Johnson et al. 2023).The dataset covers the period from 2008 to 2019.We used regular expressions to extract the "brief hospital course" section from the discharge summary as input text.We then filtered the discharge summaries that were of low quality, identified by either duplicated content or containing less than 40 words.
Our focus was on hospitalizations with MS-DRGs.However, Centers for Medicare & Medicaid Service adjusts MS-DRG regulations annually, resulting in varying DRG assignments for identical conditions over time within the MIMIC-IV dataset (Johnson 2023).To address this discrepancy, we designed an algorithm based on clinical knowledge to harmonize MS-DRG codes across different time points to a unified version (Supplementary Method 1).We selected MS-DRG version 34.0 published in 2016, which included a total of 757 DRG codes, 738 of which were present in our dataset (CMS 2016).We allocated 90% of the data to the training set and the remaining 10% to the testing set, stratified by DRG codes.

Model Development
We performed fine-tuning of the LLaMA model using discharge summaries and DRG codes within the context of a classification task.Our approach includes two distinctive strategies (Also shown in Figure 3).

Single label approach
In this approach, the model generates a single-label multi-class prediction for the DRG code from a training set of natural text discharge summaries T SU M and labels containing (T SU M,i , y i ) ∈ D1 .First, let us tokenize T SU M based on the LlaMA Tokenzizer into K = tokenize(T S U M ).K is a list of indices that index into learnable embedding weights.Let LLM () be a function that outputs the embedding for each token after running the transformer model.Finally, the raw logits are calculated as ŷ = LLM (K) −1 where we use the last token embedding of LLM (K) as the predicted raw logit score of each DRG code ŷ ∈ R 738 .Note that this logit score is the raw, unnormalized output of the last layer of the LLM.Before applying the activation function like the softmax function, which converts these scores to probabilities, the values produced by the network are referred to as logits.
The conventional categorical cross-entropy loss function for multi-class classification is used.i.e., a classic multiclass problem with loss: The target DRG y is an integer between 0 and 737 (note that we use an integer representing a specific DRG code for simplicity).Two-label approach In contrast, the two-label approach entails the model initially predicting the base DRG and the CC/MCC status as two separate classification tasks.Subsequently, a mapping rule is applied to derive DRG code.Details on the dissection and inference process from DRGs to base DRGs and CC/MCC status and vice versa can be found in Supplementary Method 2. This approach entailed a loss function configured as the cross-entropy loss of the base DRG, plus half of the cross-entropy loss of the CC/MCC status.
To enable ease of implementation, we used an output logit dimension of ŷ ∈ R 340+5 and indexed the first 340 dimensions for ŷDRG base = ŷ0,...,339 and indexed the last 5 dimensions for ŷCC = ŷ340,...,344 .At inference time, we take the base DRG and CC/MCC predictions as the argmax of their respective logits.
Subsequently, we apply the mapping rule, as detailed in Supplementary Method 2, to derive the final DRG prediction from base DRG and CC/MCC labels.
Addressing Computational Constraints via LoRA Training Given the constraints of available computational resources, an extensive hyperparameter search was not viable.Instead, our focus encompassed exploring the performance across diverse model sizes and token lengths.We used LoRA during training, which involves freezing the pretrained model weights and incorporating trainable rank decomposition matrices into each layer of the transformer architecture (Hu et al. 2021).Lora training of the attention mechanism is shown in Figure 3.
As a quick summary, let us assume that we have original weight matrix W 0 ∈ R d×k .LoRA works by adding a low-rank matrix to the original weight matrix: ∆W + W 0 , ∆W = BA where B ∈ R d×r and A ∈ R r×k .Note that one should choose r ≪ min(d, k) and only adapt the attention weights to ensure constraints on the dimensionality of the new weights and preserve original model performance.Training is only performed on this ∆W , and original model weights are kept the same.We also only tune the weights of the attention mechanism for further cost savings while preserving performance.
Training Details Model training adopted standard Huggingface training framework and the sequence classification module (Wolf et al. 2020).Since LLaMA is a decoder-only (causal) model, we follow the traditional approach of using the embedding of the last token to do the classification, as other causal models (e.g.GPT-2 (Radford et al. 2019)) do.Logits score of each DRG label was calculated from this linear output layer, and probabilities of DRGs could be derived using a softmax function.
We referenced the training protocol of Alpaca-Lora (Wang 2023).Our model was trained using cross-entropy loss with the Adam optimizer (learning rate = 2 × 10 −5 and weight decay = 0.01) for 3 epochs on all training data and batch size of 4. Lora parameters were configured with r set to 8, an alpha value of 8, and a dropout rate of 0.05.All attention blocks were included in the Lora target modules.
The training regimen for all DRG-LLaMA models were executed on a singular Nvidia RTX A6000 GPU with 48GB of graphics memory.

Baseline Models
As baseline models for benchmarking, We selected CAML (Mullenbach et al. 2018;Liu et al. 2021) and ClinicalBERT (Alsentzer et al. 2019).CAML is an adjusted convolutional neural network (CNN).In CAML, clinical notes are tokenized and embedded with pre-trained word embeddings to form input representations.Subsequently, inputs are passed on to a neural network with one-dimensional convolutions that pool CNN features using the attention mechanism.In line with the approach detailed in Liu et al. (2021), our training of CAML included early stop when there was no improvement in micro-averaged F1 score for 10 consecutive epochs, with a maximum epochs of 50.All default hyperparameters were kept, except for max seq length which was set to 512.
ClinicalBERT was built upon BioBERT, a domainspecific BERT model pre-trained on PubMed abstracts and full-text articles from PubMed Central (Lee et al. 2019).ClinicalBERT performed further pre-training of BioBERT using 2 million clinical notes from MIMIC-III (Johnson et al. 2016).In our fine-tuning process of ClinicalBERT, we conducted three training epochs, same as DRG-LLaMA .We set a learning rate of 2 × 10 −5 and a batch size of 16, consistent with previous recommended practice for classificationoriented fine-tuning of BERT (Devlin et al. 2018;Adhikari et al. 2019).

Statistical analysis
We used the implementation from (Liu et al. 2021) to calculate AUC and F1-score in both macro-and micro-approach for predictive models.We also reported accuracy of DRG prediction at top one, five and ten results.Standard deviations was calculated using a bootstrapping procedure with 30 iterations.For each bootstrap iteration, we randomly resampled the whole sample size from the testing set with replacement.Smoothing spline fit in Figure 2a was performed using npreg package in R with generalized cross-validation method and default parameters (Helwig 2021).

Figure 1 :
Figure 1: Flow diagram of the cohort processing steps.

Figure 2 :
Figure 2: Relationship between training cases per DRG and prediction accuracy by DRG-LLaMA .Results from DRG-LLaMA -7B with a maximum input token size of 512.(a) Scatter plot of top-5 prediction accuracy versus DRG ranks by number of training cases.Y-axis is top-5 prediction accuracy of each DRG label.X-axis is the rank of the 723 DRGs by their number of training cases, where DRG ranked 1 st has the most training cases, and DRG ranked 723 rd has the least training cases.Black dots indicate individual DRGs.The solid line represents smoothing spline estimated relationship (equivalent degrees of freedom: 6.35; R 2 : 0.434).The gray shaded area denotes a 95% Bayesian confidence interval for the smoothing spline estimated function.As expected, DRG-LLaMA 's performance declined in less frequent DRGs.(b) Boxplot of training cases per DRG with groups of different prediction accuracy.DRGs are grouped by range of top-5 prediction accuracy as shown in X-axis.Y-axis is the number of training cases per DRG.The green line represents the median value; the box limits show the interquartile range (IQR) from the first (Q1) to third (Q3) quartiles; the whiskers extend to the furthest data point within Q1-1.5*IQR (bottom) and Q3+1.5*IQR (top).DRG groups with better prediction performance generally have a greater number of training cases, although there is a large variance in the number of training cases within the best performing group.

Figure 3 :
Figure3: An illustration of both approaches we tested: Single Label Prediction-which directly predicts the DRG code from the text-as well as Two Label Prediction-which breaks down the classification task into 2 tasks.The two predictions are then combined using filtering rules (discovered from data for each DRG) at inference time for the final DRG prediction.LoRA training is used to train the LLM due to computational constraints.

Table 1 :
Main Results on DRG prediction with a max input token size of 512

Table 2 :
DRG-LLaMA performance on different model and max input token sizes Experiments were performed on LLaMA with a size of 7 billion and 13 billion parameters.Bolded scores denote the best performance.We observed that DRG-LLaMA 's performance consistently improved with larger models and longer input contexts.

Table 3 :
Main Results on DRG prediction as a two-label task with a max input token size of 512 table capacity for comprehending and reasoning with clinical knowledge.Without domain-specific fine-tuning or specialized prompt crafting, GPT-4 exceeded the passing score on USMLE by over 20 points and set a new state-of-the-art (Nori et al. 2023).On this premise, it is plausible to speculate that once attuned to the medical domain, an LLM could deliver robust performance across diverse NLP tasks, including the prediction of DRGs.

Table 4 :
Example of incorrect DRG predictions 77 base DRGs with no splits (examples in Supplementary Note 1)(CMS 2016).We experimented to resemble this structure through a two-label DRG prediction strategy.Surprisingly, the top-1 accuracy for CC/MCC stands at 67.5%, similar to 67.8% of the base DRG despite the considerably smaller label count (5 labels in CC/MCC vs. 340 labels in base DRG).These unexpected results likely stem from the noisy nature of CC/MCC assignment.For instance, the DRG code "pulmonary edema and respiratory failure" does not have a CC/MCC split.Therefore, a hospital stay with this DRG code may truly contain MCC, but the MCC would not be labeled as positive in the training set.To address this challenge, we formulated rules in both the DRGs dissection phase (extracting base DRGs and CC/MCC from DRGs) and the inference phase (deriving DRGs based on base DRGs and CC/MCC).These rules cater to various split scenarios, thus improving accuracy.Implementing such rules has culminated in a final DRG prediction accuracy close to singlelabel prediction (51.5% vs. 52.0%).Remarks on error analysis: Our error analysis also revealed intriguing observations.While certain vulnerabilities (e.g., erroneous CC/MCC classification and inadequate clinical concept extraction) present opportunities that theoretically can be addressed through employment of larger LLM and more data, other challenges likely stem from inherent limitations within our training data setup.For instance, in Case 2 in Table ). DRG groups with better prediction performance generally have a greater number of training cases, although there is a large variance in the number of training cases within the best performing group.and