Radiomic prediction of radiation pneumonitis on pretreatment planning computed tomography images prior to lung cancer stereotactic body radiation therapy

This study developed a radiomics-based predictive model for radiation-induced pneumonitis (RP) after lung cancer stereotactic body radiation therapy (SBRT) on pretreatment planning computed tomography (CT) images. For the RP prediction models, 275 non-small-cell lung cancer patients consisted of 245 training (22 with grade ≥ 2 RP) and 30 test cases (8 with grade ≥ 2 RP) were selected. A total of 486 radiomic features were calculated to quantify the RP texture patterns reflecting radiation-induced tissue reaction within lung volumes irradiated with more than x Gy, which were defined as LVx. Ten subsets consisting of all 22 RP cases and 22 or 23 randomly selected non-RP cases were created from the imbalanced dataset of 245 training patients. For each subset, signatures were constructed, and predictive models were built using the least absolute shrinkage and selection operator logistic regression. An ensemble averaging model was built by averaging the RP probabilities of the 10 models. The best model areas under the receiver operating characteristic curves (AUCs) calculated on the training and test cohort for LV5 were 0.871 and 0.756, respectively. The radiomic features calculated on pretreatment planning CT images could be predictive imaging biomarkers for RP after lung cancer SBRT.

Overall scheme. Figure 1 illustrates the overall workflow of the proposed scheme for RP prediction. First, the ROIs for calculating radiomic features were extracted from the treatment planning data. Four ROIs were created by extracting lung volumes excluding gross tumor volume (GTV) irradiated with more than 0, 5, 10, www.nature.com/scientificreports/ and 20 Gy for each patient, which were defined as LV0, LV5, LV10, and LV20, respectively. Second, a total of 486 radiomic features, which consisted of 54 original radiomic features (14 histogram-based and 40 texture features) and 432 wavelet-based radiomic features with 8 wavelet decompositions (54 features × 8 wavelet decomposition filters), were calculated from each ROI. Third, the significant features were selected as a signature (set of selected significant features) for each ROI, and then RP predictive models with signatures for each ROI were built to classify patients with and without grade ≥ 2 RP using a least absolute shrinkage and selection operator (LASSO) logistic regression. Finally, the constructed models for each ROI were evaluated with training and test cases by areas under the receiver operating characteristic (ROC) curve (AUC), sensitivity, specificity, and accuracy.

ROI extraction.
The ROIs for calculating radiomic features were extracted using structure data for total lung volumes excluding the GTV and dosimetric data obtained from the treatment planning data. Four ROIs were created by extracting lung volumes irradiated with more than 0, 5, 10, and 20 Gy for each patient, which were defined as LV0, LV5, LV10, and LV20. The image processing was performed using in-house software with MATLAB 2019a (MathWorks).  (Table S1). Then, 432 wavelet-based radiomic features were derived from the same 54 features as the original features on each of the eight wavelet decomposition images 22 . The wavelet transform can decompose multiscale local lung texture patterns related to RP and non-RP in an image into several low-and high-frequency components 23 . The decomposition was performed by applying either a low-pass filter (scaling function, L) or a high-pass filter (wavelet function, H) in the x, y, or z direction. The eight wavelet decomposition filters consisted of a combination of three using either a low-pass filter (L) or a high-pass filter (H) in each direction. Figure 2 shows the CT images with RP decomposed based on the wavelet analysis. The original image shows the texture patterns different from 7 (HLL to HHH) wavelet decomposition images, although the LLL image is similar to the original image. In this study, we assumed that radiomic features on original images could represent lung texture properties different from wavelet-based radiomic features provided by the wavelet decomposition images.   24 , that linear dependence does not imply dispensability, and individual dispensability does not imply pairwise dispensability, we decided to employ the original CT images. Nevertheless, to avoid the risk of overfitting on the RP prediction model, the significant features among 486 radiomic features were reduced to a number of features using a LASSO logistic regression with MATLAB 2019a (MathWorks) 25 . This process was repeated 1000 times for each ROI. The radiomic features with the highest frequency were extracted from the 486 radiomic features to build the RP predictive model for each ROI 26 . The RP grades were annotated by 1 for RP = 2 or above and 0 for otherwise, as the teacher data to be inputted into the logistic regression models. The logistic regression models were constructed with the radiomic signatures for each ROI to classify patients with and without grade ≥ 2 RP.

RP predictive model with dose-volume histogram parameters. A logistic regression model with
four dose-volume histogram (DVH) parameters of the lung volumes receiving more than 5, 10, and 20 Gy (V5, V10, V20) and MLD was also constructed for comparisons between DVH and the radiomics models. The most frequently selected significant DVH parameter combination using a LASSO logistic regression, which was similar to the signature construction mentioned above, was used for the RP predictive model.

Construction of an ensemble averaging model with imbalanced datasets adjustment strategy.
In this study, only 22 (9%) of the 245 training patients had grade ≥ 2 RP. Imbalanced datasets cause performance loss in the classification model 27 . To address the issue of imbalanced data, this study used an imbalance adjustment strategy adapted from that described by Schiller et al. 28 . As shown in Fig. 3a, the data were partitioned into a collection of balanced subsets. Thus, 10 subsets consisting of all 22 RP cases and 22 or 23 randomly extracted non-RP cases were created from the imbalanced training dataset of 245 patients. The recommended number of features should be generally smaller than around one-tenth of the number of training cases to avoid the overfitting problem 29,30 . Additionally, in leave-one-out cross-validation performed beforehand for each subset with an increasing number of top features for RP prediction, the predictive models with the top four features showed the highest performance (Fig. S1). Therefore, for each subset with each ROI, the top four significant features were selected for the construction of signatures, and 10 predictive models were built with the signatures using LASSO logistic regression. Significant DVH parameter combination was also selected for each subset using LASSO logistic regression for the RP predictive model. Finally, as shown in Fig. 3a, an ensemble averaging model was newly built by averaging the RP probabilities of the 10 predictive models constructed from 10 different subsets made from 245 training cases for each ROI.
RP predictive model training and testing. The ensemble averaging model for the RP prediction was considered to be trained with 245 training cases. Four significant features were selected in each predictive model www.nature.com/scientificreports/ for each ROI using LASSO logistic regression. As shown in Fig. 3b, the built ensemble averaging model was tested with 30 test cases in the same manner as the model training. The RP predictive model was evaluated according to the AUC, sensitivity, specificity, and accuracy. The sensitivity, specificity, and accuracy are given by and where TP, FP, TN, and FN are the numbers of true positives, false positives, true negatives, and false negatives, respectively. The AUC was obtained from the area under the ROC curve, which was a plot of sensitivity against (1-specificity) by changing the discrimination threshold of a classifier system.
Ethical statement. This retrospective study was performed with the ethical approval of the institutional review board of our hospital. Written informed consent was obtained from all subjects within the dataset collected in our hospital. All of the methods were carried out in accordance with the Declaration of Helsinki. Table 2 shows AUCs, sensitivity, specificity, and accuracy of the ensemble averaging model for 245 training and 30 test cases. The AUCs of the RP predictive model for the training cohort with DVH parameters and radiomic signatures for LV0, LV5, LV10, and LV20 were 0.703, 0.868, 0.871, 0.905, and 0.890, respectively. The AUCs for the test cohort were 0.290, 0.557, 0.756, 0.602, and 0.608, respectively. All radiomic models showed higher performance than DVH model. Table 3 shows the top four radiomic features selected most frequently for 10 subsets for each ROI. The radiomic feature of "correlation" computed with GLCM on the original images was selected as the signature for each ROI.

Discussion
Using the radiomic features for lung ROIs dosimetrically segmented from the pretreatment planning CT images of 275 NSCLC patients, we found that the radiomic predictive models to classify patients with and without grade ≥ 2 RP performed well. In the training cohort, the AUC for the ensemble averaging model with LV10 signatures using the top four radiomic features reached the maximum value of 0.905. In the test cohort, the radiomic predictive model for LV5 reached the highest AUC of 0.756. This model for LV5 also showed a high www.nature.com/scientificreports/ AUC of 0.871 in the training cohort. Based on these results, the radiomic predictive model for LV5 was considered the best model. The prediction results of the test cohort were lower than those of the training cohort. In particular, in terms of low sensitivity, the number of RP cases in the training and test cohorts might be insufficient. The difference between CT equipment and breathing methods during image acquisition may also have affected the radiomic features. The CT scans were performed on free breathing in the training cohort, while breath-hold techniques were used in the test cohort.
The "correlation" computed with GLCM on the original images was selected as one of the frequently selected features for each ROI in Table 3. Correlation is a measure of how correlated a pixel is to its neighbor over the whole image. Figure 4 shows a bar graph of "correlation" values on the original images for LV5 of RP and non-RP cases in the training cohort and an example of pretreatment planning CT images of RP and non-RP cases. The values of "correlation" of RP cases were significantly higher than those of non-RP cases. These results indicate that the "correlation" on the original images could quantify the RP characteristics different from the one on the wavelet decomposition images and might be one of the imaging biomarkers for RP after lung cancer SBRT.
Previous studies on RP prediction are summarized in Table 4. The previous studies often used DVH parameters such as V20 and MLD as risk factors for RP prediction [8][9][10][11]14,31 . Various clinical factors and biomarkers such as cytokines, single nucleotide polymorphisms (SNPs), and microRNA have also been used for RP prediction 14,31 . In the field of radiomics, Cunliffe et al. proposed that dose-dependent texture changes between pre-and post-RT CT images could classify patients with and without grade ≥ 2 RP. When multiple features were combined in a classifier, AUC increased significantly (from 0.59 to 0.84) 12 . Moran et al. 13 found that changes in radiomic features calculated from follow-up CT images after SBRT for 14 patients were significantly correlated with post-SBRT lung injury scores provided by a radiation oncologist, and that the AUCs using GLCM texture features ranged from 0.689 to 0.750.
Previous studies used differences in radiomic features between pre-and post-treatment CT images for RP prediction 12,13 . However, this study predicted RP risk using only pretreatment planning CT images. Therefore, before new patients receive radiation therapy, it may be possible to determine the RP risk by applying treatment planning data to our RP predictive models. In addition, this method is reasonable in terms of clinical application as it requires only treatment planning data without additional clinical examinations.
This study has two limitations. First, only 22 (9%) of the 245 training patients included in this study had grade ≥ 2 RP. The imbalanced data was also a factor reducing the predictive model performance. To address these issues, the balanced subsets were sampled, and the ensemble averaging model was constructed using the 10 predictive models obtained from each subset. Second, we did not evaluate the repeatability and reproducibility of the radiomic features since we used only pretreatment planning CT. Traverso et al. reported that only radiomic features with high repeatability and reproducibility should be used in predictive models to reduce the risk of false-positive associations 32 . Therefore, to reduce the influence of radiomic feature variation on RP prediction as much as possible, we calculated the radiomic features under the same conditions for image acquisition settings, www.nature.com/scientificreports/ image reconstruction algorithm, digital image preprocessing, and software used to extract radiomic features in the training cohort. Moreover, the constructed models were tested in a separate test cohort, which was scanned on another equipment to validate repeatability and reproducibility. Nevertheless, these problems should be considered a limitation because texture features were less reproducible than histogram features 32 and 11 of the 16 radiomic features selected as signatures for 10 subsets with four ROIs, shown in Table 3, were texture features.
In conclusion, the results of this study demonstrated the potential of RP predictive models after lung cancer SBRT using radiomic features for lung ROIs segmented by dosimetric information on pretreatment planning CT images. All radiomic models showed higher performance than the DVH model. The radiomic predictive model for LV5 was considered as the best model with a high AUC of 0.871 and 0.756 in both the training and test cohorts. Radiomic features calculated from pretreatment planning CT images can be used as imaging biomarkers for RP prediction in SBRT treatment planning for lung cancer.  Table 4. AUCs for different studies using different RP prediction strategies. miRNAs, micro RNAs; SNPs, single nucleotide polymorphisms; RF, random forest; SVM, support vector machine; MLP, multilayer perceptron; CFRT, conventional fractionated radiotherapy. RP grades were decided according to the Common Terminology Criteria for Adverse Events. RP scores were assigned based on identification of radiographic changes between pre-and post-RT CT images.