Introduction

Esophageal cancer (EC) is the eighth most common malignancy and the sixth leading cause of cancer-related death worldwide1. Esophageal squamous cell carcinoma (ESCC) accounts for almost 80% of all EC cases and ESCC is one of the deadliest cancers due to its highly aggressive nature and poor survival rate2. Chemoradiotherapy (CRT) is one of the most effective treatments for ESCC because CRT can potentially be curative and is less invasive than surgery; in contrast, esophagectomy remains highly invasive and is sometimes correlated with postoperative morbidity and mortality3,4,5. Moreover, CRT is widely applicable to early-stage and locally-advanced ESCC, even at the palliative stage. Despite the effectiveness of CRT for ESCC, a certain population of patients who undergo CRT experience subsequent recurrence within a relatively short period6. The resistance to CRT is one of the major causes of treatment failure in patients with ESCC7. However, the molecular characterization of CRT resistance is very complex, and it is extremely challenging to identify and decode the mechanism of CRT resistance using a basic biological approach. Therefore, it is necessary to find optimal clinical biomarkers that can distinguish responding and non-responding patients with ESCC.

Radiomics is a new quantitative analysis approach to medical imaging. The information generated about a large number of image features within tumors including their spatial and temporal heterogeneity can be applied to create diagnostic, prognostic, and predictive models. Radiomics analysis can be performed by extracting quantitative radiomics features from multimodality medical images, such as ultrasound (US), computed tomography (CT), magnetic resonance (MR), and positron emission tomography (PET) scans8,9,10,11,12. Recently, the technology of CT analysis has enabled high-level quantitative evaluation of features and pixel-based textures for tumor characterization13. Furthermore, machine-learning algorithms of artificial intelligence (AI) using CT images are boosting the powers of radiomics to predict treatment response and prognoses14. Such recent advances in radiomics technologies have opened a new era of radiomics-based biomarker discovery, which can reveal in-depth tumor characterization. In addition, the availability of large amounts of medical data, together with advanced computerized image analysis approaches with AI have paved a new path for identification of more precise and robust biomarkers.

Therefore, in this study, using a systematic and comprehensive biomarker discovery process with AI, we compared CT-based radiomics features of ESCC between responder and non-responders, established a novel, non-invasive, radiomics prediction model and then validated the model in an additional cohort. Moreover, using Kaplan–Meier survival analysis, we evaluated the model’s performance in predicting the prognosis of CRT in ESCC patients. Finally, we performed univariate and multivariate Cox regression analysis to show superiority of the AI based radiomics model.

Methods

Patients and study design

This was a retrospective single-institution study at Tokushima University Hospital (Tokushima, Japan). We continuously enrolled a total of 50 patients with pathologically proven ESCC who underwent CRT as first-line treatment from February 2009 to September 2019, and generated datasets on February 24, 2022. Of these patients, 6 were excluded from this study due to lack of clinical information such as accurate survival time, and the remaining 44 were analyzed. Among the 44 patients, 27 were admitted and received CRT in the Department of Gastroenterology and Oncology, and 17 were in the Department of Thoracic, Endocrine Surgery and Oncology of Tokushima University Hospital. Because the ratio of the training cohort and validation cohort is reported to be approximately 6:415, we used the former as a training cohort and the latter as a validation cohort.

To identify a novel CT-based radiomics model associated with the CRT response in ESCC patients, we designed this study in 3 phases: a discovery phase for the selection of candidate radiomics features and construction of the prediction model, a validation phase with an independent CRT clinical validation cohort to assess the performance of the radiomics prediction score as a CRT response marker, and a development phase with the validation cohort (n = 17) and all enrolled CRT patients (n = 44) to assess and advance our CRT response marker as a prognosis-prediction marker as well (Fig. 1).

Figure 1
figure 1

Study design for the identification and validation of the CT-based radiomics model for predicting response to and survival following CRT in ESCC. Among 50 patients with esophageal squamous cell carcinoma (ESCC) who underwent CRT, we excluded 6 patients from this study due to lack of clinical information such as accurate survival time. Ultimately, 44 patients were enrolled. Radiomics features were extracted and selected from the CT images for each patient. We created 5 machine learning models using radiomics features from the training cohort (n = 27) and selected the best model using ROC analysis. We evaluated the best-performing prediction model using a validation cohort (n = 17). Survival analysis was performed using the validation cohort (n = 17) and all cases (n = 44).

Cancer staging was performed according to the Union for International Cancer Control TNM staging system (8th Edition). All patients underwent esophageal endoscopy for endoscopic and histological evaluation of the effect of CRT 1 month after completion of CRT. The median follow-up time of all the patients analyzed was 63.4 months (95%CI 46.5–80.2). This study was conducted in accordance with the Declaration of Helsinki and approved by the ethics committee of Tokushima University Hospital. Informed consent was obtained from all patients prior to the collection of any data.

CT imaging protocol

All patients were examined using a 16-detector row Aquilion LB model CT scanner (Toshiba, Tokyo, Japan). The CT scanning parameters included a tube voltage of 120 kV, tube current auto, pixel size 0.976 × 0.976 mm2, and slice thickness 2.5 mm. All raw data were reconstructed with a 0.625 mm section thickness for the routine axial CT images. No patient received intravenous contrast medium.

Chemoradiotherapy

During the whole course of radiotherapy, patients underwent 2 cycles of chemotherapy. The chemotherapy regimens used were as follows: NF (nedaplatin 70 mg/m2 + 5-fuluorouracil (5-FU) 700 mg/m2) for 16 cases, FP (cisplatin 60 mg/m2 + 5-FU 600 mg/m2) for 8 cases, DNF (docetaxel 30 mg/m2 + nedaplatin 50 mg/m2 + 5-FU 400 mg/m2) for 8 cases, and DFP (docetaxel 25 mg/m2 (weekly) + cisplatin 6 mg/m2 (day1-5) + 5-FU 370 mg/m2) for 12 cases. All patients received 50.4 Gy/28 Fr (n = 27) or 60 Gy/30 Fr (n = 17).

Treatment evaluation

A responder to CRT was defined as `complete response (CR) of primary lesion in the radiation field’ maintained for more than 1 year. Evaluation of the CRT response was performed 1 month after the completion of CRT by CT scan and endoscopy. CT scans were then taken every 3–6 month for 2 years, and approximately every 6 months since then in all patients. According to the criteria from 11th edition of the Esophageal Cancer Handling Regulations, endoscopic primary lesions were evaluated as follows: (1) all endoscopic findings suggestive of neoplastic lesions have disappeared, (2) there is pathologically no cancer detected by endoscopic biopsy of the primary lesion that was present before treatment, (3) the entire esophagus can be observed by endoscopy, (4) there are no endoscopic findings suggestive of active esophagitis (no swelling alteration, no white moss). A complete response was achieved when all of the above findings (1) to (4) were satisfied16. Patients who did not meet the response definition were categorized as non-responders.

Feature extraction

We extracted radiomics features from each pretreatment CT images. A schematic illustration of the process of extracting radiomics features is shown in Supplementary Fig. S1. First, the volume of interest (VOI), which is equivalent to gross tumor volume (GTV) in the treatment planning of radiotherapy, was manually delineated by the same radiologist (T.K.) to mitigate intra-observer delineation variability. The VOI was set on the three-dimensional (3D) CT image for all patients, and then 8 features depending only on the shape and size of the VOIs were extracted. A 3D wavelet transform was applied to each CT dataset to decompose into 8 components for extraction of the histogram and texture features. All decomposed images as well as the original image were resampled isotropically with a 2-mm scale and were requantized with a 25-Hounsfield unit bin size. Then, 10 × 9 (90) histogram features were extracted from each component image as well as from the original image. Similarly, 42 × 9 (378) texture features were extracted. Thus, 1 case has 476 features extracted from original and wavelet filtered images. Through the feature extraction process, we modified MATLAB programming tools for radiomics feature extraction17,18.

Feature selection

In the discovery phase, we selected candidate radiomics features from the CT images, which associated with CRT response in the training cohort. We calculated AUC value for each of all features for response by receiver operating characteristic (ROC) analysis using c-statistics, and selected radiomics features which significantly associated with responders (AUC ≥ 0.7, p < 0.05). Furthermore, to avoid redundancy for such selected features, we used Pearson’s correlation coefficient analysis and limited the feature spaces by discarding features that were highly correlated with the others. In this study, we used r ≥ 0.7 (p < 0.05) as the threshold value for the pairwise correlation19,20,21.

Machine learning

We used 5 commonly machine learning algorithms to achieve the best predictive model. These machine learning algorithms including Random Forest (RF) model, Naive Bayes (NB) model, Ridge Regression (RR) model, Artificial Neural Network (ANN) model, and Support Vector Machine (SVM) model were compared based on ROC curves and the best-performing prediction model was selected22,23,24. In the validation phase, we evaluated the models constructed in the discovery phase to discriminate between responders and non-responders by ROC analysis for the validation cohort.

Prognosis analysis

In the development phase, to evaluate whether our radiomics model is able to predict prognosis as well, Kaplan–Meier analysis was performed comparing progression-free survival (PFS) and overall survival (OS) between the high-prediction score group and low-prediction score group of RF model. The data of all the patients (n = 44) were used and a p value was calculated by log-rank test. PFS was defined as the time from the date of CRT initiation to the date of first radiologic confirmation of tumor progression or death from any cause. OS was defined as the time from the date of CRT initiation to the date of death due to any cause. The follow-up endpoint was set at February 24, 2022. To find possible factors associated with PFS, we used Cox proportional hazards model for univariate and multivariate analyses.

Statistical analysis

The CR rate of CRT for esophageal squamous carcinoma patients was expected to be 29.6%, according to a previous study25. Assuming that the AUC value of our radiomics algorithm for CRT response is 0.9, the sample size (validation cohort) was calculated to be 17, with 80% power and 5% significance level, as determined using Medcalc statistical software. In general, the ratio of the validation cohort and training cohort sample sizes should be reportedly 4:615. Therefore, we set the validation and training cohort sample sizes as 17 and 27, totaling 44 patients.

Statistical differences were analyzed using χ2, Fisher exact test or Student t-test. All statistical analyses were performed using R software version 4.0.3, Medcalc statistical software (v.12.7.7., Medcalc Software bvba, Ostend, Belgium), GraphPad Prism version 9.0 (GraphPad Software, San Diego, CA), and JMP software (10.0.2., SAS Institute, Cary, NC). Pearson’s correlation coefficient (r) was used to evaluate the linear relationship between 2 variables. For time-to-event analysis, survival estimates were calculated using Kaplan–Meier analysis, and groups were compared by log-rank test. ROC curves were established to discriminate between CRT responders and non-responders, and the Youden’s index was used to determine the optimal cutoff thresholds for prediction score to predict the CRT response. The prediction score was calculated using the RF model, as described in “Supplementary methods”. According to this formula, a higher score is more likely to show a better response, whereas a lower score is more likely to show a poorer response. The AUCs were compared using DeLong’s test. All p values were 2-sided, and those less than 0.05 were considered statistically significant.

Results

Patient characteristics

The clinical characteristics of the patients are shown in Table 1. A total of 44 patients were enrolled in this study, including 27 in the training cohort and 17 in the validation cohort. The mean ages were 73.4 years (range 56–96 years) and 68.6 years (range 47–88 years), respectively. A majority of the patients were males; 92.6% and 76.5% respectively. Most patients were T3/4; 85.1% and 70.6%, respectively. The clinical stage was mostly IV that did not have metastatic lesions (M0), namely locally advanced lesions, in both groups. There were 6 (22.2%) CRT responders in the training cohort and 6 (35.3%) in the validation cohort. No significant difference was observed in any of the factors between the 2 groups.

Table 1 Patient characteristics.

Radiomics features associated with CRT responders

To select the optimal predictive radiomics features associated with the tumor response to CRT, we calculated AUCs for each of the 476 features in the training cohort of ESCC patients. We selected 110 radiomics features with AUCs more than 0.7. We then calculated correlation coefficients among those features, and grouped features with a correlation coefficient (r ≥ 0.7) into 12 groups. The 12 groups and their constituent features are shown in Supplementary Table  S1. The feature with the highest AUC in each of the 12 groups was selected as a CRT susceptibility predictor for ESCC; LLLEnergy, HHLVariance, HLHKurtosis (histogram-based features) and LHLRP, HHHLZE, HHLZP, LHLGHomogeneity2, HHLGContrast, ROIGCorrelation, HLLLRE, ROISRE, and HLHLRE (texture-based features). None of the 12 features were included in the “shape and size-based” features, suggesting that the influence of errors on tumor region delineation should be relatively small. All AUC values of the 12 radiomics features are shown in Supplementary Table S2.

Radiomics models for CRT response

We used 5 machine learning models (RF, NB, RR, ANN, SVM) to construct radiomics models based on the 12 features. The results for the ROC curves of the CRT response in the training cohort is shown in Fig. 2A. The RF model achieved the highest AUC 0.99 although no significant difference in AUC value was observed among the 5 models. All the radiomics prediction score for the RF model in the training cohort are shown in Supplementary Table S3 (max: 0.449, min: 0.113). Based on the prediction score data in the RF model, the optimal cutoff value was set at 0.19 based on the Youden index; a prediction score ≥ 0.19 represents effectiveness of CRT, and a prediction score < 0.19 represents ineffectiveness of CRT. The accuracy, sensitivity, and specificity of the RF model in the training cohort were 96.3%, 100.0%, and 95.2%, respectively. The other models were also able to show relatively high diagnostic rates: i.e., an AUC of 0.98 for the NB model, 0.96 for the RR model, 0.97 for the ANN model, and 0.95 for the SVM model. Using these 5 machine learning models, we then evaluated ROC curves in the validation cohort (Fig. 2B). Among the 5 models, only the RF model showed significantly higher AUC compared with the ANN and SVM models by DeLong’s test (p = 0.01, RF vs SVM; p = 0.033, RF vs ANN), although NB and RR did not show any significant difference. Thus, the RF model showed the highest performance (AUC 0.92; accuracy, 82.4%; sensitivity, 83.3%; specificity, 90.0%) for the validation cohort. All the prediction scores of RF model in the validation cohort are shown in Supplementary Table S4. The NB and RR models also showed high AUC values of more than 0.8.

Figure 2
figure 2

ROC curves plotted by prediction models for each machine learning algorithm. The diagnostic abilities of 5 machine learning models— the RF, NB, RR, ANN, SVM models—were evaluated using ROC curves in the training (A) and validation cohorts (B). A. Among the 5 machine learning models, the RF model exhibited the highest AUC (0.99 [95%CI 0.86–1.00]) despite showing no significance among the 5 models. B. The RF model showed the highest AUC (0.92 [95%CI 0.71–0.99]), which was significantly higher compared with ANN and SVM by DeLong’s test (p < 0.05). The NB and RR did not show any significant difference compared with any of the 5 models.

Survival analysis

Since the RF model had the highest prediction performance in the validation phase, we performed survival analyses comparing the high-prediction score group and low-prediction score group in the RF model. In all patients, the PFS in the high-prediction score group was significantly longer than that in the low-prediction score group (55.6 vs 5.9 months; HR:0.25 [95%CI 0.11–0.52]; p < 0.001) (Fig. 3A). Similarly, the OS in the high-prediction score group was significantly longer than that in the low-prediction score group (100.4 vs 13.4 months; HR:0.26 [95%CI 0.10–0.57]; p < 0.001) (Fig. 3B). Univariate and multivariate Cox regression analysis associated with PFS and OS are shown in Tables 2 and 3. The T stage, lymph node metastasis and radiomics prediction score were significantly associated with both PFS and OS in the univariable analysis. Furthermore, multivariate analysis revealed significant differences in lymph node metastasis (HR:0.41 [95%CI 0.19–0.83]; p = 0.013) and radiomics prediction score (HR:0.35 [95%CI 0.14–0.77]; p = 0.009) in Table 2, and the T stage (HR:0.26 [95%CI 0.06–0.79]; p = 0.014), lymph node metastasis (HR:0.34 [95%CI 0.15–0.70]; p = 0.003) and radiomics prediction score (HR:0.44 [95%CI 0.17–0.98]; p = 0.056) in Table 3. Similar results were obtained in Kaplan–Meier analysis, and univariate and multivariate analyses in the validation cohort (Supplementary Fig. S2, Tables S5 and S6). Thus, the radiomics prediction score was shown to be an important prognostic factor for ESCC patients treated with CRT.

Figure 3
figure 3

Kaplan–Meier analysis of PFS and OS comparing high- and low-prediction score groups of ESCCs in the RF model. All patients (n = 44) were analyzed using the RF model, categorized as high- or low- prediction score groups, and Kaplan-Meyer curves were drawn. A. Kaplan–Meier curves of PFS. The median PFS in the high-prediction score group was significantly longer than that in the low-prediction score group (55.6 vs 5.9 months; HR:0.25 [95%CI 0.11–0.52]; p < 0.001). B. Kaplan–Meier curves of OS. The median OS in the high-prediction score group was significantly longer than that in the low-prediction score group (100.4 vs 13.4 months; HR:0.26 [95%CI 0.10–0.57]; p < 0.001).

Table 2 Univariate and multivariate analyses of possible factors associated with PFS.
Table 3 Univariate and multivariate analyses of possible factors associated with OS.

Discussion

In this study, we performed a comprehensive CT-based radiomics analysis to identify candidate features for CRT response from 27 ESCC patients in a training cohort and subsequently identified 12 radiomics features for the CRT response. In addition, we developed a radiomics prediction model for the CRT response with 5 commonly used machine learning algorithms. Thus, we were able to validate high diagnostic performance of the model using another independent CRT cohort of 17 ESCC patients. Furthermore, we expanded survival evaluation and showed a prognostic ability to predict PFS as well as OS. This is the first study proposing a CT-based radiomics model associated with high initial response as well as long-term response after CRT in ESCC patients. Notably, we showed that the radiomics prediction score had superior survival predictability compared with serum SCC-Ag, the most commonly used conventional clinical serological marker for ESCC.

In previous studies, the CRT response was evaluated only a few months after treatment, and radiomics features were analyzed based on such short-term responses because most patients in these studies underwent surgical resection19,21. However, in the present study, we defined a CRT responder as `CR of primary lesion in the radiation field maintained for more than 1 year’. Consequently, our model could successfully predict the long-term response after CRT (the median PFS time, 55.6 months). Furthermore, though our study included a variety of patients from early stage to palliative stage and with both resectable and unresectable cancers, most previous studies analyzed only patients who underwent neoadjuvant CRT; ie, resectable patients. Owing to our systematic and comprehensive biomarker approach using the medical data of a total of 44 patients, our radiomics model provided a greater predictability and higher diagnostic accuracy (AUC: 0.92, p < 0.001) in comparison with these previous studies26,27,28,29,30. Furthermore, the greatest strength of our study is that our radiomics model could predict not only the response to CRT but also the prognosis of ESCC patients who received CRT.

Among the 5 machine learnings (RF, NB, RR, ANN, SVM) used in this study, all the models were predictive with high accuracy rates, especially the RF, NB, and RR models. Our data clearly suggest that the 12 selected features can appropriately predict the CRT response. In particular, the RF model showed the best performance compared with the other models. The RF algorithm uses a number of decision trees and predicts more accurately by averaging the data in case of regression and voting them in case of classification31. The RF algorithm can also be used with a wide range of sample sizes including small sample sizes. The characteristics of the RF algorithm may be suitable for the analysis of our data from a relatively small sample size consisting of a wide range of stages (Stage I-IV).

A limitation of this study is that the sample size was comparatively small, and that it was a single-institution retrospective analysis, although radiomics studies with small sample sizes at single institution, similar to our study, have been reported28,32,33. Therefore, large multicenter and prospective cohort studies are needed to optimize the generality, robustness, and clinical usefulness of our model. Another limitation is that inter-observer consistency was not evaluated in this study. Intraclass correlation coefficient analysis for this model should be performed in the future.

In conclusion, we used a comprehensive biomarker discovery process with 2 independent clinical cohorts to develop and validate a novel CT-based radiomics model for the prediction of the response to CRT as well as the prognosis of ESCC patients after CRT. Our radiomics model of RF using 12 radiomics features, which is clinically useful, cost-free, and non-invasive, may have the potential to contribute to more effective treatment strategies as a promising and personalized decision-making tool to decrease ESCC mortality.