Introduction

Personalized treatment strategies can play an essential role in oncological patient management as they are expected to improve outcomes of patient populations with heterogeneous treatment response. In particular, for patients with locally advanced rectal cancer (LARC), the response to neoadjuvant chemoradiotherapy (nCRT) varies widely, ranging from pathological complete response (pCR) with no viable remaining tumour cells to persisting disease (pathological non-responders: pNRs)1. There is increased interest in the application of organ-preserving and low-morbidity surgeries or watch-and-wait strategies, for patients with clinical complete response (cCR) after neoadjuvant or total neoadjuvant CRT2,3. These strategies require validated biomarkers that allow for an early and accurate identification of this patient population. Several studies have been analysing molecular data, such as gene expressions, mutations, and single nucleotide polymorphisms as potential biomarkers of response to nCRT in LARC4,5,6. The inclusion of non-invasive biomarkers from clinical imaging may further increase the robustness and accuracy of corresponding prognostic models.

Radiomic analyses employ classical statistics and modern machine learning algorithms to identify biomarkers based on multimodality imaging and have shown a great potential for treatment outcome prediction in different cancer entities7,8,9. For predicting patient’s response to nCRT and long-term outcomes including freedom from distant metastases (FFDM) and overall survival in LARC, radiomics models were widely developed on features extracted from T2-weighted (T2w) magnetic resonance imaging (MRI)10,11,12,13,14,15, and multiparametric MRI (mpMRI)16,17,18,19,20. Few studies have considered radiomic features extracted from computed tomography (CT) imaging21,22, positron emission tomography (PET)23,24, or a combination of CT and MRI features25. Although the results of these analyses are encouraging, important aspects, such as assessing feature robustness, were not always considered and external validation was rarely performed.

One key challenge in radiomics is the selection of features that correlate well with the endpoint of interest26. Feature classes of different complexity are commonly extracted: (i) morphological features that describe the shape of the region of interest (ROI), (ii) first-order features (FO) that describe the voxel intensity distribution, (iii) second-order texture features (SOT) that describe statistical inter-relationships between neighbouring voxels, and (iv) higher order features, where (i)–(iii) are extracted after applying transformations on the base images. In several studies, morphological and first order (MFO) features extracted from pre-treatment T2w MRI12,16,27,28 had a high association to treatment response in LARC. Other studies considered SOT features only13,29,30 or in combination with MFO and SOT features11,14,15,17. However, it is generally unclear which feature classes are more relevant and generalizable for predicting treatment outcomes in patients with locally advanced rectal cancer.

In this manuscript, we present two studies related to the described open questions of radiomics for LARC: (i) In the modelling study, we identified and independently validated novel radiomic signatures for the prognosis of tumour response to nCRT and FFDM in patients with LARC using a multicentre retrospective cohort of the German Cancer Consortium—Radiation Oncology Group (DKTK-ROG). In particular, we investigated the prognostic value of different feature classes and developed multimodal radiomics signatures combining pre-treatment CT and T2w MRI with clinical characteristics. (ii) In the external validation study, we aimed to validate radiomics signatures that were previously developed by other researchers to predict tumour response to nCRT or FFDM in LARC using our multicentre data.

Methods

Patient data

In this multicentre retrospective study, data of 190 patients were collected from four partner sites within the DKTK-ROG and divided into training and validation data based on the site (122 and 68 patients, respectively). Ninety-four out of 122 patients of the training data were treated at the University Hospital Carl Gustav Carus Dresden between 2006 and 2014. The remaining 28 patients were treated at the Klinikum rechts der Isar Munich between 2007 and 2013. In the validation data, 12 out of 68 patients were treated at the University Hospital Freiburg between 2008 and 2013, while the remaining 56 patients were treated at the University Hospital Frankfurt between 2007 and 2015. All patients had a histopathologically confirmed diagnosis of LARC and underwent nCRT followed by surgery. Additional inclusion criteria for our study were the availability of pre-treatment T2w MRI, treatment planning CT with sufficient image quality (e.g. without strong streaking artifacts, patient motion or scanner distortions), and endpoint information. Ethical approval for the multicentre retrospective analyses was obtained from the Ethics Committee at the Technische Universität Dresden, Germany (BO-EK-385082020). The requirement for individual informed consent was waived owing to the retrospective nature of the study.

The considered endpoints were tumour response to nCRT and freedom from distant metastases (FFDM). Tumour response was determined by expert pathologists from the work-up of the surgical specimens. The patients were stratified into two groups based on the tumour regression grade (TRG): responders (corresponding to TRG 3 and 4, labelled as 1) and non-responders (corresponding to TRG 0–2, labelled as 0) following Dworak et al.31. For the external validation study, where we aimed to validate radiomic signatures from the literature, patients were stratified to match the stratification indicated in the respective manuscript. A detailed description of the TRG is presented in Supplementary Table S1. The survival endpoint FFDM was calculated from the first day of nCRT to the day of event or censoring. For patients with observed distant metastases, the event time was accompanied by an event indicator variable of 1, whereas for patients without an event, the last follow-up time was used together with an event indicator variable of 0.

Study design

In our modelling study, we developed and independently validated radiomic signatures for the prognosis of tumour response and FFDM in patients with LARC based on different radiomic feature classes. Figure 1a summarizes the design of this study. Imaging features were computed from the gross tumour volume (GTV) individually on the treatment-planning CT and pre-treatment T2w MRI, including morphological and first-order features (MFO), second-order texture features (SOT), and intensity features of Laplacian of Gaussian (LoG) transformed imaging. The features were filtered for stability under small image perturbations and clustered. In order to assess which image modality is more suitable for the prediction of the endpoints and which feature class has the highest prognostic value, four radiomic models were developed on the training data individually for each imaging modality based on (i) MFO, (ii) SOT, (iii) LoG, and (iv) all features, i.e. the combination of MFO, SOT, and LoG features. In an additional analysis, the selected features from CT and T2w MRI were combined for each of the cases (i) to (iv) to assess the benefit of multimodal radiomic models. The performance of each signature was then validated on the independent validation data using the area under the curve (AUC) and the concordance index (CI) for the prognosis of tumour response and FFDM, respectively. Details of image processing and modelling are described in the following paragraphs.

Figure 1
figure 1

(a) Design of the modelling study. Treatment plan computed tomography (CT) and pre-treatment T2w magnetic resonance imaging (MRI) data were collected from 4 centres and divided into training and validation data. MRI data was preprocessed and gross tumour volume (GTV) was delineated, which was then transferred to CT images after rigid registration. Different feature classes were extracted from both modalities and signatures were developed on training data for tumour response prediction to neoadjuvant chemoradiotherapy (nCRT) and freedom from distant metastases (FFDM) in a cross-validation setting. These signatures were validated independently for both endpoints. (b) Design of the external validation study. Studies were identified via free search using Google scholar and PubMed and excluded if the inclusion criteria were not fulfilled. Information regarding image processing, radiomics workflow, and the best performing radiomics signature was extracted as reported. Image processing and feature extraction was reproduced using MIRP34. Finally, validation was performed either on the pooled training and validation data if model parameters were reported in the study or the model was re-trained on the training data and validated on the validation data.

In our external validation study, we identified and validated radiomics biomarkers proposed for the prediction of tumour response to nCRT or FFDM from the literature (see Fig. 1b). A free search was carried out using google scholar and PubMed until October 2021.

The following free search keywords were used: ‘rectal cancer’ OR ‘Locally advanced rectal cancer’, ‘radiomics’, ‘response prediction’ OR ‘response to neoadjuvant chemoradiotherapy’, ‘distant metastases prediction’ OR ‘prognosis’, ‘deep learning’, ‘machine learning’. The studies were reviewed for eligibility based on the following criteria: (1) radiomics analysis on pre-treatment T2w MRI or CT without contrast agent, (2) radiomics features extracted from primary tumour (GTV), (3) normo-fractionated nCRT (dose 45–55 Gy) followed by surgery, (4) clear radiomics workflow and definition of finally used features available. The search and inclusion of studies were supervised by two reviewers (A.Z., S.L.). The following data were extracted from the included studies: (1) sample size and distribution to training and validation dataset (if any), (2) nature of study, i.e. single centre or multicentre, (3) clinical characteristics of patient cohort (4) used imaging modality, (5) reference standard for TRG, (6) image pre-processing workflow, (7) feature extraction geometry, i.e. 3D, 2D, or largest slice, (8) applied feature extraction framework, (9) final classification/regression model or statistical test, (10) features included in final model, (11) final model parameters (if any), and (12) reported results. The studies were arranged in chronological order of year of publication.

Image acquisition

Imaging datasets were retrieved from the picture archiving and communication system (PACS) in the respective centres and pseudonymized centrally. Staging T2w MRI were acquired before nCRT with either a 1.5 T or a 3 T scanner. Patients received a CT scan for treatment planning prior to radiotherapy. Supplementary Table S2 summarizes MR and CT image acquisition and reconstruction parameters for training and validation data. The GTV was delineated for each patient on T2w transversal MR images by an experienced radiation oncologist and confirmed by a radiologist. CT images were coregistered with MRI using rigid registration in RayStation 8B SP2 (RaySearch Laboratories, Stockholm, Sweden) and the GTV was transferred to the CT.

Image preprocessing, and feature extraction

Supplementary Figure S1 illustrates the process of image preprocessing used in the modelling study as previously described26. First, MR images were corrected for background phase variations that arise due to magnetic field inhomogeneities. This was achieved by creating a mask of the soft tissue region in the image using the Canny Edge detection algorithm and multiplying the true image with the mask, setting all the background phase variations to zero32. N4ITK bias correction method was used to minimize the bias field effect in MR images33. Image intensities were scaled using the 95th percentile of image intensities, i.e. 5% of the highest image intensities were ignored, representing potential outliers. Further image preprocessing followed by feature extraction was carried out using the MIRP Python toolkit (version 1.1.3)34. MR and CT image voxels were resampled to an isotropic voxel size of 1.0 × 1.0 × 1.0 mm3 using trilinear interpolation in order to adjust the voxel spacing and slice thickness between the datasets. In CT images, the GTV was re-segmented to cover only soft tissue voxels between − 150 and 180 Hounsfield units, removing voxels containing air and bone. A set of LoG filters with 5 different kernel widths (1 mm, 2 mm, 3 mm, 4 mm, 5 mm) was applied individually to the base MRI and CT images. The five response maps were averaged to a single image.

After image pre-processing, imaging features were computed. A set of 25 morphological and 57 first-order intensity-based features (MFO features) was extracted from the 3D GTV on the treatment planning CT and on the pre-treatment T2w MRI, respectively. In addition, 95 second-order texture features (SOT features) were calculated for every modality. Finally, the same 57 first-order intensity-based features were extracted from the GTV on the LoG transformed images. This resulted in a total of 234 features extracted from each imaging modality. SOT features were extracted from the 3D GTV based on the grey level co-occurrence matrix (GLCM), grey level run length matrix (GLRLM), grey level size zone matrix (GLSZM), grey level distance zone matrix (GLDZM), neighbourhood grey tone dependence matrix (NGTDM), and neighbouring grey level dependence matrix (NGLDM). Image pre-processing and feature extraction in MIRP were implemented according to the recommendations of the Image Biomarker Standardisation Initiative (IBSI)35,36. The definitions used to calculate the features can be found in the IBSI reference manual. Image processing parameters used for feature extraction are summarized in Supplementary Table S3.

In order to obtain reproducible results, imaging features have to be stable under small image perturbations, as e.g. caused by differing acquisition parameters or positioning uncertainties37. We evaluated feature robustness by applying the following image augmentations based on the training data: adding Gaussian noise (mean 0, standard deviation as present in the image), random volume changes of the GTV (0%, − 15%, 15%), and translations (0.0, 0.25, and 0.75 mm) in all three spatial dimensions. All combinations of these perturbations were considered, leading to 81 perturbed images for each original dataset. The intra-class correlation coefficient (ICC) was calculated with a 95% confidence interval, quantifying the similarity of feature values under different perturbations for every feature. Features with the lower boundary of the 95% confidence interval of the ICC below 0.8 were removed37.

The redundancy of features in MRI and CT was individually mitigated by hierarchical clustering, including (i) MFO features only, (ii) SOT features only, (iii) LoG features (statistical and intensity histogram) only, and (iv) all features, corresponding to the analyses based on the different feature classes. The Spearman correlation coefficient (\(\rho\)) was used as a similarity metric with average linkage as a criterion for merging two clusters; \(\rho \ge\) 0.8 was defined for placing features into the same cluster. The feature with the highest mutual information with the endpoint was selected as the representative for each cluster.

For our external validation study, features were extracted from T2w MRI or CT data using MIRP. The features reported in each individual study were mapped to their closest synonyms in the IBSI manual. A feature was excluded from validation analysis if (i) it was not defined in the IBSI manual or (ii) MIRP cannot extract it. In that case, the remaining features were considered as candidates for validation. Image pre-processing (e.g. image interpolation, image normalization, bias correction) and feature extraction parameters (e.g. feature extraction in 2D, 3D or from the largest tumour area, discretization used for histogram or texture features, LoG or wavelet transformations) were replicated for each study if indicated. If feature extraction parameters were not mentioned in the study, the settings recommended in the IBSI standard were used.

Radiomics modelling

In our modelling study, we evaluated 12 different radiomic models based on different (combinations of) feature classes and imaging modalities, as shown in Supplementary Fig. S2. First, four radiomic signatures were created individually for T2w MRI and CT based on (i) MFO, (ii) SOT, (iii) LoG, and (iv) all features. Once these signatures were developed, four signatures were created by joining the respective MRI and CT signatures from (i) to (iv).

In order to create the eight single-modality signatures, a workflow containing four major processing steps (Supplementary Fig. S2) was applied after feature clustering using an in-house end-to-end statistical learning software package: (i) feature preprocessing, (ii) feature-selection, (iii) model building with internal validation, and (iv) external validation. Steps (i)–(iii) were first performed using 33 repetitions of threefold stratified cross-validation (CV) nested in the training dataset to identify an optimal signature, i.e. the steps were repeatedly performed on the internal training part and validated on the internal validation part of the cross-validation folds. After identifying the final signature, a final model was developed on the entire training data and validated on the independent validation data.

The following procedure was performed for each of the 99 CV runs: (i) Features were transformed using the Yeo-Johnson transformation to align their distribution to a normal distribution. Afterwards, features were z-transformed to mean zero and standard deviation one. Both transformations were performed on the internal training part and the resulting transformation parameters were applied unchanged to the features of the internal validation part. (ii) Four supervised feature-selection algorithms were considered: minimal redundancy maximum relevance (MRMR)38, mutual information maximization (MIM)39, elastic-net40, and univariate regression. To avoid potential overfitting, only the five most relevant features were selected. (iii) The features selected by each of these methods were used to build prognostic models on the internal training part, which were validated on the internal validation part. Multivariable logistic regression was applied for the prognosis of tumour response and Cox regression for FFDM. Average model performance was assessed by the median cross validation AUC and CI for tumour response and FFDM prognosis, respectively, for every feature selection method.

After the cross-validation procedure, the final radiomic signature was created as follows. For each of the above-mentioned feature selection methods, the occurrence of every feature in the 99 modelling steps was counted and features were ranked according to their occurrence across the cross-validation folds. Features with occurrence ≥ 50% in at least 75% of the feature selection methods were selected and the cumulative occurrence of each feature was calculated as a sum of its occurrences. If a subset of these features showed a Spearman correlation \(\rho\) > 0.5 on the entire training data, only the feature with the highest cumulative occurrence was considered. A detailed example of the feature selection scheme is presented in Supplementary Sect. 1, including Supplementary Tables S4S6. The resulting radiomic signature was then used to build prognostic models on the entire training data and (iv) the trained model was applied to the independent validation data.

For creating the four joint signatures combining CT and MRI, the selected signatures in each feature class were pooled together and the same procedure as described in the last paragraph was performed: clusters with \(\rho >0.5\) were reduced to one feature, models were trained on entire training data and validated on external validation data. Finally, clinical features that were significantly associated to tumour response in univariable logistic regression or to FFDM in univariable Cox regression were added to the selected radiomic signature (Supplementary Table S7).

In our external validation study, the pooled training and validation data was used for biomarker validation if a final model was provided in the respective study, or a statistical test was performed for associating the considered biomarker to the endpoint of interest. Otherwise, the given radiomic features were used to re-train a predictive model on the training data, which was subsequently validated on the validation data. Clinical features were combined with imaging biomarkers if mentioned in the study.

Statistical analysis

The following baseline clinical parameters were available: gender, age, tumour localization, UICC stage, grading, T stage, N stage, surgery type, chemotherapy type. Categorical variables of the clinical data were compared between the training and validation data by the χ2 test whereas continuous variables were compared using the Mann–Whitney-U test.

Associations between the final model predictions and the endpoints were evaluated by the AUC for tumour response and by the concordance index (CI) for FFDM prognosis. The estimated value and the 95% confidence interval of these metrics were computed using the bias-corrected bootstrap confidence interval method on 400 bootstraps of the data41. For creating a confusion matrix based on the final signature for tumour response prediction, an optimal cutoff was selected on the training data using Youden index and transferred to the validation data. For association with FFDM, patients were stratified into an optimally separated low and a high-risk group using an optimal cutoff on the training data that was based on maximally selected rank statistics42. The cutoff was transferred to the validation data and FFDM of stratified groups was assessed with Kaplan Meier curves compared with the log-rank test.

Calibration for the prediction of tumour response to nCRT and FFDM was assessed via the Hosmer–Lemeshow goodness of fit test (HL test)43 and Greenwood Nam d’Agostino test (GND test)44, respectively. Correlations between features were assessed by the Spearman correlation coefficient (\(\rho\)). All tests were two-sided with a significance level of 0.05. The importance of individual features in the final signature was assessed by univariate fitting of a logistic regression (tumour response) or Cox regression (FFDM) and computing Wald-test p-values. All analyses were performed in R version 4.0.3.

Results

Modelling study: CT and MRI predict tumour response and FFDM

Patient characteristics of the training and validation data are summarised and compared in Table 1. Patients in the training data had a higher tumour grading (p = 0.001) and higher UICC stage (p < 0.001). Patients of the validation data were treated with a higher dose (p < 0.001). The endpoints tumour response and FFDM were similar for training and validation data (p = 0.13 and p = 0.25, respectively). In univariate analysis, a significant association was observed only between clinical T stage (cT) and tumour response (Supplementary Table S7).

Table 1 Patient, tumour, and treatment characteristics for the training and validation data.

For radiomics modelling, 234 radiomic features were extracted from the GTV in the T2w MR and in the CT imaging dataset. Stability analysis reduced these to 208 features (MFO: 74, SOT: 82, LoG: 52) and 222 (MFO: 76, SOT: 95, LoG: 51) for MRI and CT, respectively. Clustering of correlated features further reduced the feature number to (i) MRIMFO:24, CTMFO:22; (ii) MRISOT:16, CTSOT:19; (iii) MRILoG:14, CTLoG:15; and (iv) MRIAll:39, CTAll:47.

Table 2 presents the results for the prognosis of tumour response, including the names of finally selected features. In internal cross validation, models based on CT data showed better prognostic performance than models based on MRI. Among feature classes, SOT features showed a high prognostic value (MRI: AUCSOT = 0.68, AUCMFO = 0.57, AUCLoG = 0.57, AUCAll = 0.65; CT: AUCSOT = 0.70, AUCMFO = 0.65, AUCLoG = 0.64, AUCAll = 0.67). This result, however, did not translate to the independent validation data, where SOT features performed poorly. Here, the overall best performance was achieved by LoG features for both imaging modalities (MRI: AUCLoG = 0.66, CT: AUCLoG = 0.61). Joint MRI + CT signatures performed almost similar to MRI only signatures in independent validation for all four models.

Table 2 Median area under the curve (AUC) values for cross validation (CV) and external validation for tumour response prediction based on MRI, CT, joint MRI + CT, and imaging combined with clinical T stage. Values in parenthesis represent the 95% confidence interval.

The clinical model containing only cT stage achieved training and validation AUCs of 0.60. Combining cT stage with the combined signature from MRI and CT achieved the best validation result with an AUC of 0.70. At a threshold of 0.248 this signature was able to accurately classify 16/21 responders and 20/47 non-responders (Supplementary Fig. S3). Figure 2 shows receiver operating characteristic (ROC) curves and the corresponding calibration plots for this signature on training and validation data. All features represented independent information (Supplementary Fig. S4) and significantly contributed to the prediction in training (p < 0.05), while only MR_log_stat_min was significant in validation (p = 0.04). The MRI feature log_stat_min (IBSI:1GSF) represents the minimum intensity, while the CT feature log_ih_max_grad_fbn_n32 (IBSI:12CE) represents the gradient of the discretised histogram (32 bins) within the GTV on the LoG transformed image. Image-based interpretation of these features is presented in Fig. 3. In the non-responder group, MR_log_stat_min showed relatively low values, which translates to the existence of bright voxels in the GTV on the original baseline T2w MRI (Fig. 3b). In comparison, responders showed no such high grey values (Fig. 3a). Box plots of these features (Yeo-Johnson transformed and z-score normalized) in the two response groups are shown in Supplementary Fig. S5.

Figure 2
figure 2

(a) Receiver operating characteristics (ROC) curves and (b) calibration plots for tumour response prognosis in training (left) and validation (right) resulting from best performing joint signature combining clinical T stage and Laplacian of Gaussian (LoG) features from T2w-MRI and CT. For calibration, data (thick lines) and 95% confidence intervals (shaded regions) are shown together with linear regression lines (solid lines) that follow the optimal expectation (dashed lines). Density of expected probabilities is shown above the calibration plot.

Figure 3
figure 3

Representative images from MRI (a,b) and CT (c,d) with corresponding Laplacian of Gaussian (LoG) transformed images from two patients (P) in the two response groups, i.e. responder: P1 and non-responder: P2 on the training data. Red contours mark the gross tumour volume (GTV). P1 (responder: TRG = 4) showed an overall homogenous appearance on the baseline MRI. On the contrary, P2 (non-responder: TRG = 1) showed a more heterogeneous GTV with a low stat_min value on the LoG transformed MR image, which corresponds to some high pixel intensities on the baseline MRI. Similarly, a more homogenous GTV (excluding the air voxels) can be seen in P1 compared to P2 on the baseline and LoG transformed CT slices, possibly causing low gradients in the intensity histogram for the responder.

Table 3 presents the results for the prognosis of FFDM, including the names of finally selected features. Median follow up time in training and validation data was 49.1 (5.7–111.8) months and 29.5 (1.2–94.1) months, respectively. Most of the metastases occurred until 24 months after treatment (training: 76%, validation: 56%). Until that time, 7 patients (training: 5 validation: 2) were lost to follow-up because of death, i.e. the competing risk of death was small. In internal cross validation, models based on MRI data showed a better prognostic performance than models based on CT. Among feature classes, LoG features showed a somewhat higher prognostic value (MRI: CILOG = 0.65, CIMFO = 0.60, CISOT = 0.59, CIAll = 0.60, CT: CILOG = 0.52, CIMFO = 0.47, CISOT = 0.51, CIAll = 0.46). In external validation, CT-based features showed a slightly higher performance compared to MRI. While both SOT and LoG features achieved similar prognostic value on MRI data (MRI: CISOT = 0.57, CT: CILoG = 0.57), the overall best prognostic performance in CT was achieved by SOT features (CT: CISOT = 0.69). No additional benefit was achieved by joining the MRI and CT signatures. Patient stratification into groups at low and high risk of distant metastases was performed based on the SOT models for each modality, i.e. for MRI, CT, and joint MRI + CT. While the CT and MRI + CT-based signatures achieved a significant patient stratification in independent validation (p < 0.01), this was not the case for the MRI-based signature (p = 0.68). Kaplan–Meier curves and corresponding calibration plots for the best performing CT signature are shown in Fig. 4 and for the MRI and MRI + CT signatures in Supplementary Fig. S6. The definition and interpretation of selected features with corresponding optimal thresholds for patient stratification are presented in Supplementary Table S8.

Table 3 Median concordance index (CI) values for cross-validation (CV) and external validation for FFDM prediction in MRI, CT, and joint MRI + CT. Values in parenthesis represent the 95% confidence interval.
Figure 4
figure 4

Kaplan–Meier (top) and calibration plots (bottom) on training (left) and validation (right) data for the prediction of FFDM using the three best performing CT-based SOT features, resulting in significant patient stratifications (p < 0.01). For calibration, data (thick lines) and 95% confidence intervals (shaded regions) are shown together with linear regression lines (solid lines) that should follow the optimal expectation (dashed lines). Density of expected probabilities is shown above the calibration plot.

Supplementary Table S9 contains model and transformation parameters for the best performing signatures developed for tumour response and FFDM prediction. Training was performed on the entire training data.

External validation study: most previous studies could not be validated

In total, 34 studies were identified as relevant based on their titles and abstracts. All identified studies were performed on patients with LARC that were treated with nCRT followed by surgery with the aim of predicting tumour response using radiomics. 23 studies were excluded after full text review due to following reasons: 3 studies used contrast enhanced CT data that was not available in our dataset21,45,46, 4 studies used both pre and\or post treatment data47,48,49,50, 5 studies used pre-treatment multiparametric MRI (mpMRI) to develop a final signature with no standalone T2w MRI signature being reported17,18,51,52,53, 2 studies did not report any final signature22,30, 3 studies could not be reproduced as the radiomics workflow or feature definition was not clearly explained25,54,55, 1 study was excluded as the considered ROI was not the primarytumour56, 3 studies were excluded as authors reported failure of radiomics to predict the outcome of interest57,58,59, 2 studies were excluded as the reported signature was computed from feature maps, which are currently not supported by MIRP28,60. Finally, eleven studies were included for external validation analysis. All of them used T2w MRI for predicting tumour response and were published between 2015 and 2020. One study was prospective, nine were retrospective, and three were multicentric. Two of these multicentre studies considered clinical features and imaging biomarkers.

Our external validation results are summarized in Table 4. The considered biomarkers and their corresponding synonyms together with image processing and feature extraction details for included studies are summarized in Supplementary Appendix 2 and Supplementary Table S10, clinical characteristics of the studies are given in Supplementary Table S11. Except for one study, none of the included studies could be validated, i.e. they showed p-values above 0.05 and/or a training/validation AUC significantly below the reported value in the study with a 95% confidence interval including the value 0.5. The only study that could be validated is by Petkovska et al.14. An acceptable performance was observed on our pooled data (AUC = 0.64 [0.51–0.77]). Supplementary Figure S7 shows the calibration plot for this study. In a study by Chidbaram et al.27, pathological complete responders showed a significant association with tumour volume delineated on T2w image (Mann–Whitney-U test p = 0.013). This was somewhat confirmed in our analysis, where we observed a statistical trend (p = 0.061). However, radiomics analyses are not needed to assess the tumour volume. For the study by Antunes et al.11, the random forest (RF) model created on a single feature was not successful on our training data but achieved an acceptable performance on the validation data (AUC: Train, Validation = 0.48, 0.63). Still, on the pooled training and validation data the selected feature was insignificant (Mann–Whitney-U test p = 0.12).

Table 4 Overview of studies included in validation analysis. For all included studies, patients were treated with nCRT followed by resection. Radiomics analysis was reported on pre-treatment T2w MRI with features extracted from the primary tumour region. The column validation approach indicates whether model coefficients or statistical tests were applied on the pooled training and validation data (Pooled) or the model was re-trained on the training data and validated on the validation data (train/valid). AUC area under a curve (with 95% confidence interval in brackets), MRI magnetic resonance imaging, nCRT neoadjuvant chemoradiotherapy.

Discussion

In this study, we developed and validated radiomics signatures incorporating pre-treatment T2w MRI and treatment planning CT imaging features for the prediction of tumour response to nCRT and FFDM in patients with LARC. The discriminative performance of MFO, SOT, LoG, and combination of all features was independently validated for each imaging modality and their combination. Clinical T stage combined with LoG features from CT and MRI showed the best validation performance for the prediction of tumour response (AUC = 0.70), while SOT features from CT showed best performance for FFDM (CI = 0.69). Furthermore, we aimed to externally validate previously published radiomics signatures developed for tumour response prediction based on our multicentre data. Remarkably, no significant results were obtained, except for one study by Petkovoska et al.14 (AUC = 0.64), which overall indicates a potential lack of reproducibility for radiomics studies (see below).

Considering MRI-based multicentre radiomic studies with an independent validation for patients with LARC, the prognostic performance of our best performing signature (AUC = 0.70) was similar to the results of Antunes et al.11 (AUC = 0.71), but lower than results presented by Cusumano et al.12 (AUC = 0.79) and Dinapoli et al.10 (AUC = 0.75), who also assessed tumour response to nCRT in LARC patients using T2w MRI data. Antunes et al.11 used features extracted from laws kernels and gradient organization responses. In our validation analysis, only skewness-laws features could be validated. The corresponding feature used by Antunes et al.11 was not significant in training and showed a statistical trend in validation (p = 0.055). Dinapoli et al.12 used first-order intensity histogram-based features, while the study by Cusumano et al.12 additionally used fractal features in the final signature to build the model. Both studies also combined clinical features (cT and cN) with the radiomics signature. In our validation study, these signatures did not show a good performance (AUC < 0.60).

Single centre retrospective studies have also shown promising results for tumour response prediction in LARC. De Cecco et al.16 and Caruso et al.13 showed a significant association (p < 0.05) of FO statistical and GLCM features, respectively, with tumour response to nCRT on small cohorts (≤ 15 subjects). However, in our validation analysis, no significant association has been found for these features (p > 0.05). Coppola et al.28 showed that heterogeneity of local skewness is associated to tumour response (AUC = 0.90). Ferrari et al.60 showed that complete responders have higher GLCM energy and good responders have high expression of histogram features (AUC = 0.87). These studies could not be validated as the features were extracted from feature maps, which are currently not supported in MIRP. More recent studies showed the association of SOT features with tumour response prediction. The studies by Pizzi et al.30 and Petresc et al.15 showed an AUC of 0.79 and 0.80 in internal validation, respectively. However, validating the results of Petresc et al.15 on our multicentre data was not successful (AUC = 0.48).

Fewer studies have investigated the performance of CT imaging for tumour response prediction to nCRT using patient populations treated with standard procedures, i.e. nCRT followed by TME21,22,57,59, or combined CT and MR imaging25,61. Bibault et al.22 developed a model for the prognosis of tumour response with radiomics features extracted from treatment plan CT data using deep neural networks (DNN) with an AUC of 0.72. Chee et al.21 demonstrated that pre-treatment contrast enhanced CT-based FO features were associated with tumour response prediction (responders showed low entropy, high uniformity, and low standard deviation). Other studies indicated an overall poor performance of CT features for predicting tumour response. Exemplarily, Rao et al.59 and Hamerla et al.57 showed that CT features were not able to predict tumour response. Regarding the combination of CT and MRI, Zhang et al.61 used MFO and SOT features extracted from pre-treatment CT and MRI and achieved an AUC of 0.87, while Li et al.25 showed that contrast enhanced CT and multimodality MRI is able to achieve an AUC of 0.93. While these studies showed promising results, they mostly lacked external validation.

Model performance may be improved by including additional imaging time points, other MRI sequences, or PET. Exemplarily, Jeon et al.28 used delta-radiomic features extracted from pre- and post-nCRT T2w MRI to build predictive signatures for treatment outcomes in LARC. Their signature showed significant risk group stratification for FFDM (p < 0.05). Chiloiro et al.62 also used delta radiomics to predict FFDM as binary outcome with an AUC of 0.78. To the best of our knowledge, no study was yet performed to predict FFDM combining pre-treatment MRI and treatment-planning CT for LARC. Gianni et al.29 showed that radiomic signatures based on PET, T1w MRI, and apparent diffusion coefficient (ADC) images had an increased performance for tumour response prediction (AUC = 0.86) compared to PET only (AUC = 0.84) and T1w MRI only (AUC = 0.72).

In radiomics analyses, numerous features of different complexity can be extracted and frequently their number is larger than the study population, which can lead to substantial model overfitting and difficult feature interpretability. In internal cross-validation, we observed that more complex SOT features showed a high performance for tumour response prediction, while LoG transformed intensity features showed a high performance for the prediction of FFDM. However, in external validation, the opposite behaviour was observed, i.e. LoG transformed statistical and intensity histogram features showed a high performance for the prediction of tumour response, while SOT features showed a somewhat higher performance for FFDM prediction. Also, it is noteworthy that the performance trend of feature classes in internal and external validation was similar for both modalities, i.e. similar feature classes were predictive for both CT and MRI. Specifically, we discovered one MRI-based statistical feature, i.e. log_stat_min, which was predictive for tumour response to nCRT. This feature represents the minimum intensity on LoG transformed images, which is closely related to the maximum intensity (i.e. stat_max) on baseline images. We analysed the predictive performance of both features separately using univariate logistic regression. In training, stat_max was less predictive (AUC = 0.57) then log_stat_min (AUC = 0.64), while both features showed similar performance in validation with an AUC of 0.66. The high association of LoG transformed intensity features with the training data can be attributed to the fact that the LoG kernels help to reduce large variations in the signal, which can be detected within a single image slice (e.g. irregularities due to magnetic field, respiratory motion, or patient movement). Further, we interpret log_stat_min as a potential biomarker for tumour response prediction to nCRT based on the fact that a tumour normally is represented by low to intermediate signal intensity on T2w MRI, excluding the intestinal lumen48,63. The increased expression of log_stat_min in non-responders indicates the presence of high intensities within the GTV on baseline T2w MRI, possibly indicating an aggressive or resistive tumour resulting in incomplete remission.

One major issue in radiomics analyses is feature reproducibility and the lack of consensual guidelines on which features have to be extracted from clinical imaging data. In our validation study, we experienced limited reproducibility of published literature. Only 32% of the eligible literature could be assessed for their validation performance with our data/methods, mostly due to the use of different software implementations and underreporting of methods employed for radiomics analysis of LARC. Important details such as image processing for feature extraction (e.g. discretization for intensity and texture features), final signatures together with their interpretation and final models were not always provided. Thus, there is a strong need of standard radiomics process for signature definition for both reproducibility and progression of radiomics towards clinical application. The IBSI35 aims to establish such a consensus and reporting guidelines for image processing and feature extraction. Although some studies have used large cohorts for radiomics analyses in LARC, external validation was rarely performed. Only 4 studies10,11,12,56 have used retrospective multicentre cohorts with a maximum of 3 data centres involved, which may lead to a low generalizability of the presented radiomic signatures. To tackle such problems, in our multicentre study, we have established and externally validated radiomics signatures in accordance with the IBSI guidelines and we report parameters and algorithms used for their extraction, transformation, stability analysis, and modelling.

In addition to the lack of standardization in the radiomics workflow, there is lack of standardized imaging protocols as well. This can obstruct the successful validation of radiomics models, e.g. for imaging from MR scanners of different vendors or different magnetic field strengths, because such differences may lead to the extraction of differently distributed features64. Standardization at hardware level is costly, thus there is a need to develop generalizable models by incorporating data from different scanners and protocols. We addressed this issue by using multicentre data independent of vendor and imaging protocols for training and validation. Furthermore, we observed significant differences between the clinical characteristics of our pooled cohort and the external cohorts included in the validation study (mainly clinical T and N stage). These differences may explain part of the observed reduced performance of the published models in our external validation analysis.

Limitations of this study are its retrospective nature and the relatively low number of patients in the training and validation data. In addition, there is a class imbalance due the smaller number of events for both endpoints, leading to wide confidence intervals in Tables 2 and 3 often including the value 0.5, i.e., the external validation results have a relatively large uncertainty. We aimed to mitigate this problem by internal cross-validation (CV) on the training data for feature selection. A threefold CV approach was used and repeated 33 times, ensuring that each fold contained sufficient events for training and validation and that the finally considered average model performance was sufficiently robust. A common strategy used in machine learning to tackle the problem of imbalanced data is random undersampling of the majority class. We tested this procedure during stratified splitting of training data in internal cross-validation. We did not observe significant differences in feature selection for both endpoints and therefore do not present the results from these experiments.

In conclusion, in the present modelling study, we developed and independently validated radiomic signatures for the prognosis of tumour response to nCRT and FFDM in patients based on T2w MR and CT imaging. We studied feature classes of differing complexity and observed that a combination of LoG transformed intensity features from MRI and CT together with clinical T stage (cT) led to the highest prognostic value for the prediction of nCRT, while CT-based SOT features performed well in external validation for FFDM. In our external validation study, only one of the radiomics signatures could be validated. This indicates an overall lack of reproducibility and the need for standardization in radiomics procedure and reporting before its prospective clinical application.