An explainable machine learning framework for lung cancer hospital length of stay prediction

This work introduces a predictive Length of Stay (LOS) framework for lung cancer patients using machine learning (ML) models. The framework proposed to deal with imbalanced datasets for classification-based approaches using electronic healthcare records (EHR). We have utilized supervised ML methods to predict lung cancer inpatients LOS during ICU hospitalization using the MIMIC-III dataset. Random Forest (RF) Model outperformed other models and achieved predicted results during the three framework phases. With clinical significance features selection, over-sampling methods (SMOTE and ADASYN) achieved the highest AUC results (98% with CI 95%: 95.3–100%, and 100% respectively). The combination of Over-sampling and under-sampling achieved the second-highest AUC results (98%, with CI 95%: 95.3–100%, and 97%, CI 95%: 93.7–100% SMOTE-Tomek, and SMOTE-ENN respectively). Under-sampling methods reported the least important AUC results (50%, with CI 95%: 40.2–59.8%) for both (ENN and Tomek- Links). Using ML explainable technique called SHAP, we explained the outcome of the predictive model (RF) with SMOTE class balancing technique to understand the most significant clinical features that contributed to predicting lung cancer LOS with the RF model. Our promising framework allows us to employ ML techniques in-hospital clinical information systems to predict lung cancer admissions into ICU.


S2.1 Lung Cancer LOS attributes characteristics
The number of the attributes associated with the lung cancer patients comprised a set of (complete blood count, differential, white blood count (WBC), vital signs, laboratory tests, demographics, medications) as seen in Table 1. We extracted these variables from previous works [7,8] for further data processing.
A clinical oncologist involved in this study affirmed the attributes selection with the relevance to Lung cancer diseases. Eventually, the number of features that we inserted is 75 attributes. Other nonsignificant clinical variables to lung cancer LOS were dropped from the table according to the inclusion criteria ( Figure 1).   [9].

S2.2 benefits and drawbacks of the class balancing class technique
-Improved performance in lowdimensional data [10].
-Increases training time and memory to hold training data [11] (computational cost).

ADASYN
Enhances the learning about the distribution of the sample in a more efficient way [12] -In minority and majority classes, it does not sacrifice one class in the preference for another [13].
-Risk of generating many false positives due to the generated synthetic data may be very similar to the majority class.

ENN
Reducing the number of training data samples and improves storage and model run time.
Ignore useful information which might be important building rule model.
-low performance in the binary class prediction -Many samples are removed if the decision boundary is unclear.

SMOTE-ENN
Good performance in small datasets [14] Can remove more examples indepth -Adjust class distribution [11].
introducing artificial minority class examples too deeply in the majority class may lead to overfitting -Suffers from Overfitting in high-dimensional datasets.

S2.3 LOS Lung Cancer Models' Description
-Not able to work with nonlinear problems The inclusion mechanism considered only ICU hospitalized patients. At the first screening, we excluded all patients who died in the hospital from the inclusion protocol. Further, we have dropped all events with missing HDAM_ID and ICUSTY_ID (a unique ID to link unique ICU stays with each HDAM_ID) from the inclusion criteria. We applied an additional inclusion criterion comprising diagnosis codes for lung cancer hospitalizations (162.x) identified by the International Classification of Diseases (ICD)9. Accordingly, we included 119 lung cancer patients in our study from the whole dataset. Figure 1 reveals the inclusion protocol for lung cancer patients in this study.

S4.1 Data Imputation
We applied a null function from the Pandas library in Python [15] to verify and eliminate records with frequent missing values for each admission. This decision was coordinated with the clinical oncologist and made only in the records with cells (entries) that suffer from many missing values that cannot be replaced by any handling missing values techniques ( Figure 3). We disregarded any event (lung cancer admission) that lacks clinical insights to avoid any negative impact on the prediction models' performance and the overall research aim. In the case of the missing values that did not cause any absence of each (event/admission) picture, we imputed the missing values with entries based on the variable median method [7].

S4.2 Discretizating Target Class (LOS)
Discretization is the process of transferring (numeric/continuous) variables into (nominal/categorical) variables (bins). Several Artificial intelligence (AI) studies such as [16,17]  In this research, we binned LOS (continuous variable) into a binary learning LOS approach based on previous studies using the discretized scaled label "LOS" into two labels: 1) Label zero (0) for a short length of Stay or (Short LOS, 0-6 days), and label one (1) for a long length of Stay (Long LOS, 7 + days), in Figure 2.

S4.3 Categorical Variable Transformation
In this research, we implemented the (One-hot-encoding or "nominal encoding") method [8] to transform (independent) categorical variables before building the LOS predictive models. This method aims to transform categorical to (nominal and binary) attributes that improve machine learning models' performance.

S4.4 Features selection with Clinical Significance
We disregarded the non-clinical significant variables from the inclusion criteria. The inclusion and exclusion decision for the significant clinical variables was affirmed with the clinical oncologist. This decision was necessary for the study. Firstly, to ensure all clinical features of Lung Cancer are considered and the non-important features are eliminated. Therefore, the feature selection (CS) puts the patients on the Length of stay prediction perception from a clinical perspective. Secondly, it helped to reduce the features' dimensionality' and improve machine learning models' performance in the baselining stage. The disadvantage of the approach is that it may leave weak associations between independent and dependent variables and impact the performance's predictive models. AI-based cancer studies [24,25] utilized features' selection with the CS approach in machine learning predictive tasks. Table 1 shows variables selection with a clinical significance approach.

S4.5 Features Selection with Recursive Feature Elimination (RFE)
The principle of the RFE technique [26] is based on selecting features recursively. This is achieved by removing a smaller set of attributes per loop. This process occurs recursively, and the weakest features are eliminated at the end. The features are ranked by the model's coefficient (coef) or feature importance. The optimal set of features are attained using cross-validation. RFE has been utilized in cancer-based studies such as in [27][28][29]. RFE is achieved using the algorithm (Figure 3)

S5.1 SMOTE
The Synthetic Minority Oversampling Technique (SMOTE) is an oversampling technique applied in imbalanced datasets in classification problems. The SMOTE is an over-sampling method in which the minority class involves creating synthetic elements of minority class examples based on the existing ones. It picks up a point from the minority class and calculates the nearest neighbors by the Euclidean distance between data points in the feature space.

S5.2 ADASYN
The Adaptive Synthetic (ADASYN) algorithm [13] is an oversampling method similar to the SMOTE technique. It works by generating many samples for a given feature vector ( ). The ( ) is proportional to a number of the nearby samples, and it does not belong to the same class as ( ). This helps to deal with outliers, especially when generating new synthetic training examples.

S5.3 Edited Nearest Neighbors (ENN)
The Under-sampling with Edited Nearest Neighbors (ENN) method applies the nearest-neighbors algorithm. It edits the dataset and removes the samples from the dataset that do not agree enough with their neighborhood [30]. In ENN, the nearest-neighbours are computed, and if the selection criterion is not satisfied, the sample is removed. This process or (removing noise from samples) ensures each sample in a class to be under-sampled.

S5.4 Tomek Links
Tomek links algorithm [31] is an under-sampling method that detects a pair of observations (two different samples from different classes x and y) near to teach others. The pairs are called Tomek link. The Tomek link is defined for any sample (z):

S5.5 SMOTE+ENN
The SMOTE+ENN algorithm [14] is a hybridization technique (combination of over-and undersampling) where SMOTE helps to do extensive data cleaning. Further, the misclassification caused by NN (nearest-neighbors) samples is removed [32] in both classes (Short LOS and Long LOS). This results in achieving a clear, concise and separation between classes.

S5.6 SMOTE-Tomek
The SMOTE+Tomek algorithm [14] is a hybridization technique combination of over-and undersampling) that combines the oversampling (SMOTE) and the undersampling (Tomek-Links) techniques in order to achieve optimized performance for the classifier. The SMOTE+Tomek first applies SMOTE to the minority class (e.g., Long LOS) to a balanced distribution, then the examples from the majority class (Short LOS) in Tomek Links are identified and removed.

S5.7 Pseudocode for LOS Lung Cancer Framework
where ȓ is the density distribution. 6. Calculates the number G of synthetic data examples that are required to be generated for each minority example . = ȓ * 7. For each minority class data sample ( ) for each neighborhood, generate synthetic data sample; 1) Randomly choose one minority data sample (example) within the neighborhood ( ). 1. Over-sampling using SMOTE 2. cleaning using ENN SMOTE-Tomek steps: 1. Over-sampling using SMOTE 2. cleaning using Tomeklink Figure 4 Pseudocode for the six class balancing methods

S6.1 Random Forest
Random Forest (RF) algorithm [33] is an ensemble learning model and classification-based method. The RF model works by generating Random subsets from the original dataset (bootstrapping). Then, in each node in the decision tree, only a random set of features are to be considered for deciding the best split. After that, a decision tree model is fitted on each (of the subsets). The final output (prediction) is achieved by calculating the average predictions from all decision trees. To summarise, the model operates by randomly selecting data points and features and then building multiple trees (forests).
The RF classifier was appropriated in this study for the LOS lung cancer predictive framework with Gini Index (IG [34]) is implemented: (1) Where Pi proportion of samples that belongs to a class (C) for a particular node.

S6.2 XGBoost
The eXtreme Gradient Boosting (XGBoost) algorithm [35] is an ensemble-based learning (bagging) model. The XGBoost is an implementation of the gradient boosted decision trees [36] designed for performance and speed. It uses more regularized model formalization to control the over-fitting, giving it better performance [35].
Considering dataset (d c ) with m features and n of examples, where d c ={(xi,yi)} (xi ∈ R m , yi ∈ R, i = 1,2, . . .,n), the XGBoost model can be described as the following [37]: That = � ( ) = ( ) �( : → {1,2, … }, ∈ ) is the CART decision tree structure set, q is the tree structure of the sample map to the leaf nodes, T is the number of leaf nodes, and w is the real score of leaf nodes.
When constructing the XGBoost model, finding the optimizer is necessary to establish an optimal model. Therefore, the objective function of the XGBoost is divided into an error function L term, and a model complexity function Ω. Then the objective function is written as the following: where is the regular term of L1, , is the regular term of L2. and are adjustment parameters to prevent the model from overfitting. Now the objective function is expressed as: where � ( −1) is the predicted value of t-1th model and ( ) is the new function added at tth time. The Obj is a scoring function that is used as an evaluation model; noting that the smaller the Obj value, the better the model affect.

S6.3 Logistic Regression
The logistic regression (LR) [38,39] is a statistical model that uses the logistic function to predict dependent variables from the independents used. It is used in machine learning in predictive binary tasks (classification). The logistic function is formulated as the following:

S7.1 Accuracy:
Donates the ratio of the correct predictions to the total of a number of predictions. The accuracy does not describe the predictive story in imbalance-class datasets. Therefore, other metrics are used to evaluate imbalanced data, such as precision, Sensitivity, G.mean and IBA.

S7.2 Precision:
Refers to the number of positive classifications that are actually correct (or called positive predicted value 'PPR').

S7.3 Sensitivity (Recall):
Measures the proportion of actual positives that is well classified (or called the true positive rate 'TPR' or Recall). = + (10)

S7.4 Specificity:
Measures the proportion of actual negatives that is well classified (true negative rate 'TNR')

S.7.5 F1-Score:
It can be interpreted as the weighted average of precision and recall. F1-Score = 1 is the best possible value, and F1-Score close to 0 is the worst value.

S7.6 Index Balanced Accuracy (IBA):
The Generalized IBA [40] is used to weigh a suitable measure to evaluate an imbalanced datasets' performance. The weight factor assists in favoring those results with the better classification rates on the minority class. Formula where Dom is the domenance; Dom = TPR -TNR within the range [-1, +1] Dom is used to estimating the relation between TPR and TNR. The closer Dom to (0), the more balance with both individual rates are achieved. and it is weighted by ≥ 0 to reduce its influence on the results of the particular metric M 1 + . is the weighting factor

S7.7 Area Under the ROC Curve (AUC)
The where TPR is the true positive rate, and FPR is the false positive rate.

S7.8 Geometric Mean Score (G.Mean):
The G.mean [41] aims to maximize each of the classes' accuracy while keeping the accuracy balanced.
This study refers to accuracy in the models benchmarking (baselining) performance evaluation, whereas, Precision, Sensitivity, Specificity, AUC, IBA, and G.mean during class-balancing performance evaluation.

S8.1 Experiments Setup
We conducted the methods evaluation and implementation on a computer with (Intel core i7), CPU speed at 1.90GHZ, and 16 GB RAM. We used Python 3 for machine learning and all framework steps development and deployment.

S8.2 Baseline Stage with Cross-Validation
The first phase is the benchmarking stage (models-baselining) with cross-validation (k-fold=10). In this phase, we assessed the three proposed predictive models (RF, XGBoost, and LR) on feature selection methods (CS) and the RFE with three varieties (Top 20 features, Top 40 features, and Top 60 features). The outcome of the first phase is the model (mean accuracy => 85%). The second phase incorporates evaluating the performance of the candidate model using six class balancing methods. The research framework integrates all these phases in the pipeline and fits the outperforming model for further clinical interpretations using the SHAP machine learning explainability (third phase). A total of 119 unique lung cancer patients met the inclusion criteria ( Figure 1) of our study sample. Therefore, we verified the cohort selection on the proposed models (RF, XGBoost, and LR) using crossvalidation (K-fold=10), (Test is 30%) with the mean accuracy comparison and standard deviation (std) error in the mean performance as the performance evaluation metrics. We report the baselining analysis results in ( Figure 5) and ( Figure 6 ). As seen from ( Figure 5), the RF achieved the best predictive results with (k-fold =10, mean accuracy 87.4%) by ensuing the CS feature selection procedure. Moreover, RF classifier attained with RFE (Top 60 features) the highest mean accuracy (87.3%) amongst the RFE model-based Top features selection procedure.
The XGBoost classifier showed a minor fluctuation, with mean accuracy ranging from 82.3% to 82.8% using two different feature selection methods (RFE and CS). In contrast, the logistic regression classifier's mean accuracy performance retreated as more features were build up in each RFE evaluation with (k-fold = 10) metric. This trend was affirmed within all features (CS), where the LR achieved the lowest mean accuracy (81.6%).
The std error in the mean performance for RF remained stable (9.9%) with RFE top features (40,60) and all features (CS), respectively. However, the RFE (Top 20 features) evaluation had the least (std) error in the mean performance (9.7%). XGboost classifier showed an improved trend in the reported (std) error in the mean performance. While it recorded 10% of the std error in RFE (Top 40 features), it achieved an optimized (std = 9.6%) in the CS feature selection procedure with (k-fold = 10). LR classifier acquired relatively higher std error in the mean performance compared to other RF and XGboost.
While LR is the fastest model to train, the XGboost and RF needed more time to report their crossvalidation results (k-fold=10) (Figure 7). Hence, RF and XGboost are more computational costly models according to the data input and number of features in this study. In analyzing reported results during the baseline phase, RF is designated for further detailed analysis in the next phase (classbalancing) performance evaluation.  Table 4 demonstrates the lung cancer dataset before applying imbalanced data (WCB) and utilizing various class balancing techniques. The second column in the table illustrates the short LOS percentage to the long LOS per each class-balancing approach. We foresee a better predictive performance to be achieved from the balanced data. Specific metrics (section S7) were being used during classification performed on balanced dataset evaluation measures.

S8.5 Class-Balancing with (SHAP)
The Random Forest prediction outcomes with the four class balancing methods ( Figure 8) were unlocked and explained using the SHapley Additive exPlanations (SHAP) [42]. The SHAP works by explaining the prediction of instance (x) by computing each feature's contribution to the prediction. It is also referred as a method to explain individual predictions. A TreeExplainer function [43] using the TreeSHAP [42] algorithm was exploited to visualize and explain the Random Forest (ensemble) tree model's output.