Introduction

In 2020, primary liver cancer was the sixth most commonly diagnosed cancer and the third leading cause of cancer-related deaths worldwide1. The majority of liver cancers are hepatocellular carcinomas (HCC). Given that most HCC patients also suffer from liver dysfunction, the benefits of treating the cancer must be weighed against the potential harms of medical interventions. Consequently, treatment decisions for HCC patients are highly multifactorial, with physicians taking into account not only the tumor burden but also the extent of liver dysfunction and performance status. The Barcelona Clinic Liver Cancer (BCLC) algorithm, a widely used staging system2, has been endorsed in clinical practice guidelines3,4. However, there is a significant difference in the initial treatment choice between the recommendations of the BCLC system and real-world clinical practice, particularly for patients in East Asia5,6. One of the possible reasons for this discrepancy is etiological and ethnic differences, which play an important role in post-treatment prognosis, as well as differences in preferred treatment and medical reimbursement plans.

A clinical decision support system (CDSS) is an information system designed to improve healthcare delivery by enhancing medical decisions with targeted clinical knowledge, patient information, and other health information7,8,9. Recently, artificial intelligence (AI) has been applied to CDSS to predict post-treatment prognosis for patients using machine learning (ML) or artificial neural networks10,11. Previously, we developed ML-based CDSS to recommend an initial treatment option and predict post-treatment survival for HCC patients12. This model performed well in an institutional patient cohort. In this study, we externally validated the previous model using multi-center datasets from eight institutions in South Korea and made modifications to ensure its effective utilization across multiple institutions.

Results

Demographic and baseline clinical characteristics of patients

Table 1 presents the demographic and baseline clinical characteristics for all datasets. A full list of patient characteristics for each center in the external validation datasets is described in the Appendix (pp 10–12). Excluding patients who underwent other therapies and transplantation, the final study populations for the internal and external validation datasets included 935 and 1750 patients, respectively. Initial treatment options included radiofrequency ablation or percutaneous ethanol injection therapy (RFA or PEIT) in 6.8% to 21.6% of patients, resection in 3.7% to 35.8%, trans-arterial chemoembolization (TACE) in 34.8% to 64.8%, TACE combined with external beam radiotherapy (EBRT) in 0% to 24.4%, sorafenib treatment in 0% to 7.4%, and supportive care in 3.1% to 16.7% of patients.

Table 1 Baseline characteristics of the patients in the internal and external validation datasets.

Treatment recommendation model for multi-center setting

The seven top-performing classifiers, sorted by accuracy were the C-support vector machine (SVM) with linear kernel, exhibiting the highest mean accuracy and recall, followed by the Gaussian process, random forest, extra-trees, histogram-based gradient boosting, light gradient boosting machine, and multi-layer perceptron classifiers (Table 2). However, the SVM alone performed worse than the ensemble voting machines that included three to seven top-performing classifiers. There were no significant differences between voting classifiers according to the number of top-performing classifiers (Appendix p 15). Compared to the previous cascaded model12, the mean accuracy of the ensemble voting machine increased for both internal and external datasets (Appendix p 15). The mean accuracy of the external validation using ensemble voting machine was 55.34%, which was lower than that of the internal dataset at 67.27%.

Table 2 Top 7 classifiers sorted by accuracy for internal dataset.

The results of individual training with each institutional dataset demonstrated higher mean accuracy and recall compared to those of external validation in all centers except for one, Severance Hospital (Appendix p 16). The mean accuracy for each institutional dataset was 66.53%, which was higher than that of the external validation, 55.34%. Table 3 shows the performance of the ensemble voting machine when the second treatment option is considered correct in both internal and external datasets. When the second option was considered correct, the accuracy of external validation increased to 86.06%, similar to that of the internal dataset and individual training for the external datasets. After model calibration, there was a slight decrease in mean accuracy, which was more pronounced in the external validation datasets (Appendix pp 18–19). The decrease in accuracy can be attributed to significant internal variations and diversity within the datasets. Conversely, the standard deviation decreased after model calibration, indicating that calibration contributed to more consistent results.

Table 3 Comparison of performance considering second treatment options.

Prediction of individual post-treatment survival and risk stratification for each treatment

Table 4 displays the performance of the survival prediction model, which achieved better results in the internal dataset compared to the external validation dataset. The integrated time-dependent area under the receiver operating characteristic curve (iAUC) ranges from 72.63 to 86.18, with TACE showing the best performance. Figure 1 provides simulation examples of risk stratification for specific treatments using the model. In the group of patients who actually underwent resection, the actual survival rate of those recommended for resection by the first-stage model (Fig. 1a) was higher compared to the group recommended for TACE (Fig. 1b, c). Moreover, among patients who were initially recommended for TACE by the first-stage model but actually underwent resection (Fig. 1b, c), the survival prediction curves based on setting TACE as the treatment input variable in the survival prediction model showed a stronger correlation with the actual survival curves compared to those for resection. These findings demonstrate that our model has the capability to predict the prognosis of patients accurately. In the supplementary experiments, we used propensity score matching to achieve a closer match between the two groups. We then evaluated the model’s performance and analyzed the reasons for the observed changes in its performance (Appendix pp 20–25). Furthermore, we used models trained separately with two institutional datasets to simulate different treatment choices and subsequent survival prediction outcomes for new data from another center as shown in Fig. 2.

Table 4 Performance of survival prediction of internal and external datasets.
Fig. 1: Actual (=Ground truth) and predicted (=Prediction) survival curves according to a treatment input variable in the second-stage model.
figure 1

a Patients who were recommended resection in the first-stage model and actually underwent the resection. In the second-stage model, the treatment input variable was set as resection. b, c Patients who were recommended TACE in the first-stage model but underwent resection in actual practice. In the second-stage model, the treatment input variable was set as resection in b and TACE in c.

Fig. 2
figure 2

An example of simulations on the new data using models trained individually with each institutional dataset.

Discussion

In this study, we developed a two-stage ML-based model for treatment recommendation and post-treatment survival prediction for patients with HCC, which was validated using multi-center datasets. The performance of external validation on treatment recommendation using the ensemble voting machine was inferior to the results obtained from the internal dataset and those trained individually with each institutional dataset. The cause of this performance decline is speculated to be due to the different treatment options among institutions, prompting us to propose a second treatment option in addition to the first, using the probabilities from the ensemble voting machine. Furthermore, the risks associated with each treatment can be stratified by predicting individual survival based on the recommended treatment and other prognostic variables.

The strength of our CDSS model lies in its ability to not only provide treatment suggestions but also prognoses associated with those treatments. This model can serve as a supplementary system to the current guidelines, assisting physicians in determining treatment options for HCC. Additionally, it can help explain the rationale behind treatment choices to patients. In particular, treatment selection can be further supported by predicting individualized survival graphs for patients with various treatment options available or who are at the borderline of the guidelines. Furthermore, this model can be beneficial in aiding treatment decisions for inexperienced doctors or in situations where multidisciplinary care poses challenges.

However, developing a CDSS for treatment decisions in HCC presents significant challenges. One of the primary factors contributing to these challenges is the intricacy of determining the optimal treatment option due to the heterogeneous nature of HCC and underlying chronic liver disease. The optimal treatment option for HCC may vary depending on individual patients’ medical conditions, insurance coverage, and willingness to undergo treatment. Consequently, the management of HCC in real clinical practice often deviates from the treatment options recommended by the widely accepted BCLC staging system13,14. In this study, we developed a CDSS model that emulates treatment decisions made by clinical physicians to address the substantial disparities observed between actual treatment decisions and the recommended treatment guidelines. However, one of the most significant issues associated with this type of CDSS is the rationality and appropriateness of the judgments made by the clinical physicians used as the reference standard. To mitigate these concerns, we provided alternative options and corresponding survival rates in addition to the optimal treatment option. Moreover, we recommend the complementary use of a data-driven CDSS system based on the recommendation-based current guidelines.

Another notable limitation of the current data-driven CDSS is its inherently data-restrictive nature. Over the past several decades, cancer treatment has undergone significant advancements, and the expansion of extensive genetic and clinical databases, supported by efficient computing systems, has accelerated the pace of treatment advances and shortened the cycle for updating treatment guidelines. However, data-driven models ultimately rely on past databases to derive conclusions. Consequently, if newly developed treatment options are inadequately represented or absent in the training dataset, the CDSS model may fail to select those treatments or accurately predict patient survival. During the study period, Sorafenib, a protein kinase inhibitor with activity against many protein kinases including vascular endothelial growth factor receptors, platelet-derived growth factor receptors, and RAF kinases15,16, was the first-line recommended treatment for advanced-stage HCC according to the BCLC staging system. However, due to issues related to insurance coverage, Sorafenib was not widely utilized in actual clinical practice. Consequently, it was rarely recommended by the current model, and the reliability of treatment classification and survival prediction for that treatment was low. Similarly, newly incorporated treatments for advanced-stage HCC in the recent BCLC guidelines, such as Atezolizumab-Bevacizumab, Durvalumab-Tremelimumab, and Lenvatinib, were entirely absent from the training dataset. Therefore, the existing model is unable to predict these newly incorporated treatments.

To overcome these limitations, physicians must acknowledge the constraints of the system and consistently update it with new data, with the assistance of the model developer. The field of cancer treatment is a constantly evolving domain where new therapies emerge to replace previous treatment modalities, and innovative treatment approaches continue to arise. Therefore, it is crucial to periodically update the model with new data and analyze whether the results align with physicians’ opinions and the latest treatment guidelines (Appendix p 26). This periodic analysis should also include the variable selection process. In a previous study, we selected 20 key features from the initial 61 pretreatment variables based on the importance scores calculated from the automated classifier model12. Feature reduction helps to select the most influential features by providing unique knowledge for inter-class discrimination17,18. However, the excluded features in this process could be significant factors in determining treatment choices for certain patients within the internal dataset or patients in external datasets. Furthermore, the factors that determine treatment and subsequent prognosis are constantly changing as new diagnostic methods, genetic biomarkers, and novel treatments are developed. Therefore, it is essential to select new deterministic features, evaluate their prognostic effects, and retrain the model with the modified features for continuous learning. Automatic data updates from an electronic medical record database and tumor segmentation using deep learning techniques from computed tomography (CT) or magnetic resonance imaging (MRI) may be useful for this process19,20.

Recently, several studies have explored the use of ML or deep learning to predict the prognosis of HCC patients, incorporating radiomics, proteomics, and genomics21,22,23,24,25,26. However, it remains challenging to predict initial treatment choices or post-treatment survival based solely on radiomic or genomic features extracted from diagnostic images or tissues at a specific time point. Furthermore, the effective integration of high-dimensional data such as images or genomic profiles with more concise clinical variables remains an unresolved issue. In this study, we employed clinically established prognostic factors to predict initial treatment choices and post-treatment survival of HCC patients. However, further research is needed to explore whether the combination of radiomics, proteomics, and genomic features with these established prognostic factors can enhance the accuracy of predictions.

In machine learning, overfitting is a prevalent issue that occurs when a model becomes excessively specialized to the training data, resulting in degraded performance on new, unseen data. One of the reasons for the decreased performance of our model on external datasets is primarily due to its training on the internal dataset, which led to overfitting. Specifically, the diversity and variability in treatment choices among institutions for patients with similar characteristics inevitably contributed to this performance decline. There have been numerous approaches proposed to address the issue of overfitting27,28. Increasing the diversity of the training data by collecting various examples or using data augmentation techniques can enhance the generalization of the model. Moreover, reducing the model’s complexity by limiting the number of parameters and employing regularization techniques can effectively prevent overfitting. In this study, we improved the model performance by utilizing an ensemble voting approach instead of using a single machine learning algorithm. To reduce the complexity of the model, we minimized hyperparameter tuning for each algorithm. Furthermore, considering the significant variation in treatment patterns among different institutions, we proposed the inclusion of alternative treatment options in addition to providing only one treatment option that may be overfitted to the internal dataset. However, to significantly improve the performance of external validation datasets, it may be more effective to incorporate external datasets into the training dataset or conduct individual training.

Medicine is not only a science, but also a social and psychological subject. Wilkinson et al. argued that the majority of health states and events are so complex that we can only understand them probabilistically, and chance can never be predicted at the individual level29. Recent studies on Watson for Oncology, an AI assistant decision system developed by IBM Corporation with the help of top oncologists from Memorial Sloan Kettering Cancer Center, also commonly mention the system’s limitations and complementary roles30,31. Nonetheless, in an era of accelerated developments in new diagnostic methods, biomarkers, and treatments, selecting the appropriate treatment for individuals based solely on simple guidelines is becoming increasingly challenging. Therefore, by incorporating our system as a complementary resource for physician decision-making, along with the existing BCLC guidelines, can assist in guiding physicians to make more individualized treatment choices. Furthermore, our proposed system can be utilized as a virtual simulation tool for comparing institutional treatment choices and post-treatment prognoses in advance. Although digital twin technology is still in its early stages of development in the healthcare and medicine fields32,33,34, our system can serve as an example of its potential application.

In conclusion, we developed and validated an ML-based model that offers initial treatment recommendations and predicts post-treatment survival for HCC patients, utilizing multi-center datasets. Several experiments were conducted to ensure the model’s applicability in a multi-center setting, and we have addressed various issues and strategies for implementing this system within an actual clinical environment. Our proposed CDSS can assist the process of making treatment decisions for HCC patients by providing data-driven predictions in conjunction with existing guidelines.

Methods

Internal dataset and pre-treatment variables

For the purposes of training and internal validation, we used a prior dataset, which comprised 1021 newly diagnosed HCC patients at the Asan Medical Center in South Korea between January and October 201012. All enrolled patients were diagnosed with HCC using liver protocol CT or MRI or liver biopsy in accordance with the current guidelines of the American Association for the Study of Liver Diseases35. Exclusion criteria were as follows: (a) patients with a history of prior HCC treatment (n = 356), (b) patients with metastatic liver cancer or secondary malignancies that might affect survival (n = 36), (c) patients with combined hepatocellular cholangiocarcinoma (n = 21), (d) patients with incidentally detected HCC after transplantation (n = 7), and (e) patients who underwent cytotoxic chemotherapy, other targeted agents (brivanib, sunitinib, erlotinib), or combined treatments (n = 83).

We retrospectively collected a total of 61 pre-treatment demographic, clinical, and imaging variables, initial treatment information, and survival status of 1021 HCC patients from an institutional database (Appendix p 4). To select the key variables, we used importance scores calculated by the previously described automated classifier models (Appendix pp 5–9), resulting in a selection of 20 variables, including 14 patient-related factors (age, body mass index, Child-Pugh class, presence of varix, presence of ascites, Eastern Cooperative Oncology Group score, hemoglobin level, platelet count, albumin level, prothrombin time, alanine aminotransferase level, total bilirubin level, creatinine level, and alpha-feto protein level) and 6 tumor-related factors (tumor number, maximal tumor size, tumor distribution, presence of portal vein invasion, presence of metastasis, and RFA feasibility)12. Overall survival was defined as the time between HCC diagnosis and death from any cause.

Previously, the initial treatment options were grouped into eight categories: RFA/PEIT, surgical resection, TACE, TACE combined with EBRT, sorafenib treatment, supportive care, transplantation, and other therapies that did not fit into the other seven predefined categories. In this study, we excluded other therapies and transplantation options due to significant heterogeneity among institutions in the other therapies group and large differences in the number of patients between institutions due to the severe shortage of deceased liver donors in the transplantation group, respectively. Furthermore, due to one institution among the eight external centers reporting a maximal tumor size of 10 cm for tumors larger than 10 cm, the maximal tumor size values for the remaining centers, including the internal dataset, were uniformly adjusted using the same rule, wherein any size exceeding 10 cm was recorded as 10 cm.

External datasets

We collected datasets from eight institutions for external validation of the algorithm, including Korea University Guro Hospital (n = 138), Seoul National University Bundang Hospital (n = 193), Samsung Medical Center (n = 439), Seoul National University Hospital (n = 224), Catholic Medical Center (n = 162), Severance Hospital (n = 148), Chung-ang University Hospital (n = 171), and Inha University Hospital (n = 275). All datasets were compiled from a retrospective database of HCC patients diagnosed in each institution between January 2010 and December 2012. De-identified information on 20 pre-treatment key variables, initial treatment information, and survival status were obtained at the patient level. The protocols of this study were approved by the Institutional Review Board of Asan Medical Center (IRB number: 2017-0188) and the study was approved by each institutional review board of all the participating institutions. The requirement for informed consent from patients was waived due to the retrospective nature of the study. All methods were performed in accordance with the relevant guidelines and regulations.

Model development

The ML algorithm consists of two stages: The first stage recommends initial treatment, and the second stage predicts post-treatment survival, as illustrated in Fig. 3. In the first stage, we modified the previous cascaded random forest model12 to an ensemble voting machine to improve the model’s flexibility and classification performance. Initially, we evaluated 19 ML algorithms for the development of the ensemble voting approach. Following the sorting of each classifier based on mean accuracy, the ensemble voting machine was trained using the top-performing three, five, and seven classifiers for the internal dataset. These ensemble voting machines were then compared with the top-performing classifier itself. Ultimately, we applied an ensemble voting machine involving the top five performing classifiers and evaluated its performance on both the internal and external validation datasets. Detailed information regarding the development of the ensemble voting machine is described in the Appendix (pp 13–15). Finally, we compared its performance to that of the previous cascaded random forest model.

Fig. 3
figure 3

Overall workflow.

We conducted two additional experiments to enhance the usability of this model in a multi-center setting. First, we trained the model using each institutional dataset (Individual training) and compared its performance with that of external validation. Second, we derived the first and second treatment options from the ensemble voting machine’s results using the highest and second-highest probabilities, respectively, and recalibrated the evaluation metrics by treating the second treatment option as the correct answer. The final performance of the ensemble model after model calibration was assessed in both the internal and external validation datasets. The process of model calibration is described in the Appendix (p 18). All experiments for the internal dataset and each institutional dataset were trained and tested using a five-fold cross-validation stratified by the treatment, while external validation was performed using the model trained with the entire internal dataset.

For the second stage, we trained the random survival forest algorithm with 21 variables, including 20 key variables and the initial treatment information, to predict post-treatment survival. Finally, we employed our two-stage model to simulate risk stratification for each treatment by predicting individual post-treatment survival based on the treatment recommendation results from the first stage model. TACE and resection, the two treatments with the highest proportion in the internal dataset, were used as illustrative examples for risk stratification.

Statistical analysis

The baseline characteristics of patients between the internal dataset and the external datasets were compared using the chi-square test for categorical variables. For the continuous variables, we assessed normality using Shapiro-Wilk’s W test, and based on the results, we employed either the Student t-test or the Mann-Whitney U test for comparison. Per-patient based analysis was used to evaluate the treatment recommendation model, which included metrics such as accuracy, macro-average of recall, weighted-average precision, weighted-average F1 score, Cohen’s kappa score, and Matthew’s correlation coefficient36. For the survival prediction model, we utilized three metrics to evaluate performance: Harrell’s concordance index, iAUC, and integrated Brier score37. To evaluate the survival curve, we estimated it using a Kaplan-Meier fitter and compared it to those predicted by our model38. The open-source scikit-learn package version 0.23.239, scikit-survival package version 0.15.040, and scipy package version 1.5.4 were used for model development and statistical analyses.