Development of a robust radiomic biomarker of progression-free survival in advanced non-small cell lung cancer patients treated with first-line immunotherapy

We aim to determine the feasibility of a novel radiomic biomarker that can integrate with other established clinical prognostic factors to predict progression-free survival (PFS) in patients with non-small cell lung cancer (NSCLC) undergoing first-line immunotherapy. Our study includes 107 patients with stage 4 NSCLC treated with pembrolizumab-based therapy (monotherapy: 30%, combination chemotherapy: 70%). The ITK-SNAP software was used for 3D tumor volume segmentation from pre-therapy CT scans. Radiomic features (n = 102) were extracted using the CaPTk software. Impact of heterogeneity introduced by image physical dimensions (voxel spacing parameters) and acquisition parameters (contrast enhancement and CT reconstruction kernel) was mitigated by resampling the images to the minimum voxel spacing parameters and harmonization by a nested ComBat technique. This technique was initialized with radiomic features, clinical factors of age, sex, race, PD-L1 expression, ECOG status, body mass index (BMI), smoking status, recurrence event and months of progression-free survival, and image acquisition parameters as batch variables. Two phenotypes were identified using unsupervised hierarchical clustering of harmonized features. Prognostic factors, including PDL1 expression, ECOG status, BMI and smoking status, were combined with radiomic phenotypes in Cox regression models of PFS and Kaplan Meier (KM) curve-fitting. Cox model based on clinical factors had a c-statistic of 0.57, which increased to 0.63 upon addition of phenotypes derived from harmonized features. There were statistically significant differences in survival outcomes stratified by clinical covariates, as measured by the log-rank test (p = 0.034), which improved upon addition of phenotypes (p = 0.00022). We found that mitigation of heterogeneity by image resampling and nested ComBat harmonization improves prognostic value of phenotypes, resulting in better prediction of PFS when added to other prognostic variables.

www.nature.com/scientificreports/ standardization protocols 20 , for radiomics analysis. We account for differences in image parameters by resampling the images to the minimum voxel spacing parameters and using a nested ComBat approach for harmonization 21 by the image acquisition parameters. ComBat is a harmonization method originally developed for genomics that can correct variation in features due to imaging parameters by using empirical Bayes to estimate location and scale parameters to shift data. In previous studies, ComBat has been shown to harmonize radiomic features from different CT protocols as well as reduce percentage of features with significantly different distributions by batch effect 22,23 . While ComBat is fast and easy to use, current implementations of ComBat are only able to harmonize by a single batch effect at a time and are therefore unable to adequately harmonize datasets that are heterogeneous in more than one batch effect. The nested ComBat approach used in our study enables harmonization by multiple batch effects by implementing sequential harmonization. We hypothesize that the addition of the radiomic phenotypes to established clinical biomarkers including PDL1, will improve the accuracy of the prognostic model in the prediction of progression-free survival in the patients with NSCLC. We also hypothesize that mitigation of the heterogeneity introduced by the voxel spacing and image acquisition parameters in radiomic features will improve the prognostic performance of the radiomic phenotypes.

Materials and methods
Study sample and data. This single-centre retrospective, observational study was conducted at the Hospital of the University of Pennsylvania between November 2016 to December 2020. The study was approved by the University of Pennsylvania's Institutional Review Board (IRB) committee (IRB protocol number-848796) under a waiver of informed consent. All methods in this study were in accordance with the Declaration of Helsinki. Patients with stage 4 NSCLC treated with first-line pembrolizumab based therapy at our institution were identified. The details of the monotherapy and combination therapy regimens are included in Supplementary  Table S1. The shooting ranges of the CT scans were the thoracic regions of the patients. The 3D tumor CT volumes were manually segmented by board-certified, fellowship-trained thoracic radiologists S.I.K (with 18 years of clinical experience) and L.R. (with 4 years of clinical experience) using the semi-automated ITK-SNAP software (version 3.6.0) 20 .
Radiomic feature extraction. The impact of heterogeneity introduced by differences in the physical dimensions of the images was mitigated by performing anisotropic resampling. The images were resampled to the minimum value across each of the voxel spacing parameters (x*y*z − 0.54 mm, 0.54 mm, 0.8 mm). A total of 102 radiomic features were extracted from each tumor's entire segmented volume in the resampled images, using CaPTk 19 . The CaPTk software can visualize the overall tumor burden of lung cancer, including metastatic lesions. However, in our study, only primary lesions were analyzed. The radiomic features represent the following eight type of descriptors: (1) Intensity features or first-order statistics-These features capture the voxel grey-level intensities within a neighborhood. (2) Histogram-based features-The computation of these features involves the generation of an intensity histogram by discretization of the original intensity distribution. (3) Volumetric features-These features are computed by utilizing the voxel intensities in the ROI and are based on the relationship between discretized intensity and the fraction of the volume containing the least intensity. (4) Morphologic features-These features describe geometric aspects of a region of interest (ROI), such as area and volume. (5) Gray level run length matrix features-These features are based on quantifying gray level runs as the lengths of consecutive pixels. (6) Neighboring gray tone difference matrix features-They are rotation-independent features based on gray-level relationships between neighboring voxels and aim to capture the coarseness of the overall texture. (7) Gray level size zone matrix features-The grey level size zone matrix (GLSZM) counts the number of groups (or zones) of linked voxels. Voxels are linked if the neighboring voxel has an identical discretized grey level. Whether a voxel classifies as a neighbor depends on its connectedness. (8) Local binary pattern features (LBP)-These features are used to describe the local texture patterns in an image. The LBP works in a block size of 3 × 3, in which the center pixel is used as a threshold for the neighboring pixel, and the LBP code of a center pixel is generated by encoding the computed threshold value into a decimal value. A list of features belonging to each family which were used in the analysis and their formulae can be found in Supplementary Tables S4 and  S5 respectively.
Radiomic feature harmonization. The impact of heterogeneity introduced by the image acquisition parameters (Table 1) was mitigated using a nested ComBat approach. The original ComBat method was developed to harmonize by a single covariate 21 , so we incorporated a "nested" approach to correct for multiple batch effects. The nested approach was initialized with the original radiomic features as the input data and a list of batch effects-in this case, contrast enhancement and kernel resolution, resulting in two batch effects. Age, sex, race, PD-L1 expression, ECOG status, body mass index (BMI), smoking status, recurrence event and months of www.nature.com/scientificreports/ progression-free survival were all protected during harmonization to prevent removal of biological variation of interest. Harmonization order was determined by iterating through all possible permutations of the list of two batch effects. At each iteration, the original input data was sequentially harmonized with ComBat where the original data was harmonized by the first batch effect in the permutation corresponding to the iteration, then the resulting harmonized data was harmonized again by the second batch effect in the permutation, resulting in a feature set that had been harmonized two times. The resulting harmonized feature sets were each assessed for significant differences between distributions for each batch effect using the Anderson-Darling (AD) test 24 at a p value significance level of 0.05. The harmonized feature set with the lowest number of features with detected differences in distribution across all batch effects was selected as the final output. Features remaining significantly affected by batch effects after ComBat harmonization were deemed as non-robust and discarded from further analysis.

Radiomic phenotype identification.
To identify intrinsic imaging phenotypes, unsupervised hierarchical clustering 25 was performed on the harmonized and reduced feature set. The input to the clustering algorithm is a set of feature vectors, consisting of the harmonized features for each scan. The number of distinct clusters k obtained from the unsupervised hierarchical clustering algorithm are interpreted as intrinsic imaging phenotypes in the cohort. An agglomerative approach was used to create a hierarchical clustering of the patients using Euclidean distance for distance between the feature vectors and Ward's minimum variance method as the clustering criterion 26 . The optimal number of distinct phenotypes, k, was determined by assessing the stability and significance of each phenotype for each value of k that was considered. The optimal number of stable phenotypes was determined using consensus clustering 27 , where dataset was sub-sampled and cluster arrangements were determined using varying values of k. For each value of k, the proportion that two patients occupied the same phenotype cluster out of the number of times they appeared in the same subsample was determined and stored in a consensus matrix, from which a cumulative distribution function (CDF) was determined. Cluster stability, determined by the area under the CDF curve, was evaluated for each value of k. Statistical significance of the identified, stable phenotypes was evaluated using the SigClust method 28,29 . Here, the significance of the cluster index, defined as the sum of within-cluster sums of squares about the cluster-mean divided by the total sum of squares about the overall mean was tested against a null distribution, simulated using 10,000 samples from a Gaussian distribution fit to the data. The test was performed at each phenotype split to determine statistical significance (p < 0.05).

Details of prognostic models and their association with survival outcome. Kaplan-Meier (K-M)
curves and log-rank test using the entire cohort assessed the performance of the model built using the radiomic phenotypes derived from the harmonized feature set (where radiomic phenotype cluster was coded as a categorical variable, thus reducing the dimensionality of the original feature set) in separating participants above versus below the median score during the prognosis of progression-free survival (PFS) 30 . The p value from the logrank test indicates if the model can provide a statistically significant (p < 0.05) separation between patients with prognostic scores above and below the median prognostic score. The performance of this model is compared with that of models built using only the clinical covariates (PD-L1 expression, ECOG, BMI and smoking status ( Table 2)) and with a multivariable model built with a combination of the clinical covariates and the radiomic phenotypes derived from the harmonized features, to measure the improvement in prognostic ability caused by the combination of radiomic and established clinical prognostic factors. The performance of the model using the radiomic phenotypes derived from the harmonized radiomic features was also compared with the performance of the model built using the radiomic phenotype derived from the non-harmonized features, to evaluate the impact of harmonization on the prognostic ability of the features. www.nature.com/scientificreports/ To analyze the difference in survival outcomes between patients treated with PEMBRO monotherapy vs. combination therapy (PEMBRO + chemotherapy), we also performed stratified analysis to evaluate K-M curves using the log-rank test to assess the performance of the prognostic models for patients who received monotherapy compared to patients who received combination therapy.
Further, fivefold cross-validated Cox proportional-hazards regression analysis with 200 iterations was performed on all the models described above. The discrimination capacity of the models was assessed using the concordance statistic (c-statistic) 31 .
Assessment of the biological significance of the radiomic phenotypes. The biological significance of the radiomic phenotypes was studied by assessing the ability of the phenotypes to predict PD-L1 expression of the tumors using a random forest classifier. The predictor and outcome variables were divided into training and test sets (70:30 ratio) and fivefold cross-validated AUCs for the prediction of PDL1 expression were computed. The distribution of the continuous clinical variables of PDL1 expression and BMI across patients belonging to the two phenotypes was also compared using boxplots. The relation between the radiomic phenotypes and the categorical clinical variables (PDL1 expression, BMI, ECOG and smoking status) was tested using the chi-square test, with the null hypothesis that the clinical variables are not associated with the phenotypes.

Results
Patient characteristics. The median age of the patients was 67 years (range [38,90] years), 49% of the cohort was male ( Table 3) and 36% of the cohort consisted of current smokers ( Table 2). The entire cohort received first-line therapy with PEMBRO, with 28.9% receiving monotherapy (PEMBRO) and 71.1% receiving combination therapy ( Table 4). The details of the therapy are included in Supplementary Table S1.
Radiomic feature harmonization. The percentage of features with significantly different distributions arising from the batch effects was reduced after applying nested ComBat to the original features (Supplementary  Table S2). A total of 64.7% of the original radiomic features were robust to differences introduced by the batch effects.
Radiomic phenotype identification. From the 102 initially extracted radiomic features, 34.6% of the features were significantly affected by batch effects after ComBat harmonization and were dropped. Two distinct radiomic phenotypes were identified, with 56 tumors in phenotype 1 and 51 tumors in phenotype 2 ( Fig. 1) (p = 0.02 for SigClust test of two clusters versus one). The cumulative density function curves generated from the harmonized and non-harmonized feature sets are included in the Supplementary Fig. S1. The radiomic phenotypes derived from the 102 non-harmonized feature set are included in Supplementary Fig. S3.   (Fig. 2). The multivariable model of PFS, incorporating clinical covariates (PDL1 expression, ECOG status, BMI and smoking status), yielded log-rank p = 0.034 with a statistically significant separation of K-M curves for patients above  www.nature.com/scientificreports/ versus below the median prognostic score. The multivariable model of PFS incorporating radiomic phenotypes with clinical covariates increased the separation of the curves and yielded log-rank p = 0.00022 (Fig. 3). When PFS was analyzed by line of therapy using the multivariable model involving phenotypes and clinical variables, a statistically significant separation of the K-M curves for patients above versus below the median prognostic score was observed for patients treated with combination therapy (log-rank p = 0.0095), whereas there was no appreciable separation for patients who received monotherapy (log-rank p = 0.32) (Fig. 4). We also built a multivariable model combining the radiomic phenotypes derived from the original feature set without ComBat har-   Table S3).
The multivariable PFS model with the radiomic phenotypes and clinical covariates of patients treated with monotherapy yielded a c-statistic of 0.55 (95% CI 0.51-0.61); the model with the radiomic phenotypes and clinical covariates of patients treated with combination therapy yielded a c-statistic of 0.60 (95% CI 0.52-0.62) ( Table 6).
Biological significance of the radiomic phenotypes. Radiomic phenotypes were used to predict PDL1 expression using a Random Forest classifier and obtained a fivefold cross-validated AUC of 0.56 and a non-cross-validated AUC of 0.63 (Fig. 5). The average value of PDL1 expression for patients belonging to phenotype 1 was 34.5% and those belonging to phenotype 2 was 33%. The average value of BMI for patients belonging to phenotype 1 was 26.7 and those belonging to phenotype 2 was 26.8 (Fig. 6).

Discussion
To our knowledge, no previous radiomics-based study has assessed the ability of radiomics to predict outcomes to first-line pembrolizumab in NSCLC. In addition, we address several areas of unmet need in the NSCLC immunotherapy literature including addressing the impact of heterogeneity of image acquisition and reconstruction parameters on the information extracted from the radiomic features and prediction modelling using individual radiomic features. In addition, we defined two radiomic tumor phenotypes which differ in terms of size and tumor.
The tumors in our sample demonstrate differences in terms of smoothness of the tumor surface boundaries between the two radiomic phenotypes. They also demonstrate differences in terms of size, with tumors belonging to phenotype 2 apparently bigger than tumors belonging to phenotype 1 and having potentially more irregular borders, but otherwise visually appearing similar morphologically (Fig. 7A and B). For example, the average tumor volume of the patients belonging to phenotype 1 is 40.2 cm 3 and the average tumor volume of the patients belonging to phenotype 2 is 118.7 cm 3 . The tumor volumes of the representative tumors shown in Fig. 7A belonging to phenotype 1 are 7.6 cm 3 (A1), 5.1 cm 3 (A2) and 5.2 cm 3 . The tumor volumes of the representative tumors shown in Fig. 7B belonging to phenotype 2 are 85.3 cm 3 (B1), 736.6 cm 3 (B2) and 150.8 cm 3 (B3). This observation suggests that while there may not be additional differences that are visually appreciable besides tumor size, the radiomics analysis may be able to capture patterns relevant to response to therapy and prognostic factors that may not be appreciable by the human eye such as internal tumor regions of tumor hypoxia in larger tumors. In addition, larger tumors are known to have a worse prognosis with each increase in cm in diameter corresponding to a higher t-stage in NSCLC staging systems. www.nature.com/scientificreports/ Our study demonstrates that use of a radiomic feature harmonization strategy (ComBat) to control the heterogeneity introduced by voxel spacing and image acquisition parameters improves the prognostic performance of the radiomic phenotypes. The phenotypes based on the features derived from images with heterogeneity in parameters mitigated using resampling and harmonization techniques, when combined with the clinical covariates, gave a c-statistic of 0.63, [0.54, 0.64] and a statistically significant separation (log-rank test p value = 0.00022) between the patients at low risk from those at high risk from therapy. In comparison, the volume of the tumor regions, when combined with the clinical covariates, could only give a c-statistic of 0.58, [0.52, 0.60]. A multivariable model built from the original features without ComBat harmonization yielded a c-statistic of 0.51 (95% CI 0.49-0.54) and could not produce a statistically significant separation (log-rank test p value = 0.16) between patients above versus below the median prognostic score. This suggests that radiomic features derived from those images where the heterogeneity in imaging parameters has been mitigated using resampling and harmonization techniques result in phenotypes with improved signature signal and better prognostic performance. Thus, we conclude that our imaging biomarker can be combined with PDL1 and other established clinical markers to enhance therapy response prediction. The multi-variable prognostic model built by combining our imaging biomarker with clinical markers gives a better prognostic performance compared to the model built by combining clinical markers with tumor volume, another important prognostic marker explored in previous studies 34,35 . Larger studies are warranted to fully validate these findings.
The phenotypes based on the features of patients treated with PEMBRO monotherapy, when combined with the clinical covariates, gave a c-statistic of 0.55, [0.51, 0.61], and could not produce a statistically significant Figure 5. ROC curve for the prediction of PDL1 expression using radiomic phenotypes: radiomic phenotypes were used to predict PDL1 expression using a Random Forest classifier and obtained a fivefold cross-validated AUC of 0.56. Figure 6. Distribution of PDL1 expression and BMI for radiomic phenotypes 1 and 2. The average value of PDL1 expression for patients belonging to phenotype 1 was 34.5% and those belonging to phenotype 2 was 33%. The average value of BMI for patients belonging to phenotype 1 was 26.7 and those belonging to phenotype 2 was 26.8. The differences in the mean values of PDL1 expression and BMI between patients belonging to phenotype 1 and phenotype 2 are not statistically significant.  www.nature.com/scientificreports/ separation (log-rank test p value = 0.32) between the low risk and high risk patients. The phenotypes based on features of patients treated with a combination of PEMBRO and chemotherapy, when combined with the clinical covariates, gave a c-statistic of 0.60, [0.52, 0.62], and gave a statistically significant separation (log-rank test p value = 0.0095) between the patients at low risk from those at high risk from therapy. This indicates that the model can predict progression-free survival for patients treated with combination therapy better than for those treated with monotherapy. There are important limitations to our study. First, our study sample is relatively small. As a proof of concept, it is important to validate our findings with larger cohorts. Also, since most of the patients had adenocarcinoma and the patients with squamous carcinoma were only a small subset of our patient population. Thus, we did not separate patients based on their histology in our analysis. Further, we have performed manual segmentation of the tumor regions. There are studies that indicate the effect of inter-reader variability on tumor segmentation and radiomic feature extraction 36 . However, some recent studies suggest that such variability may not necessarily affect the robustness of all radiomic features 37 . A recent preliminary study showed that radiomic features extracted from segmentations obtained by different human observers tend to be highly correlated and have similar predictive value 38 . Our future work should evaluate the effect of inter-reader segmentation variability on radiomic features and explore fully automated algorithms.

Conclusion
In this study, we developed a novel radiomic biomarker that integrates with PDL1 expression, ECOG status, BMI, and smoking status data to enhance the ability to predict progression-free survival in patients with stage 4 NSCLC, treated with first line PEMBRO monotherapy or combination therapy. We show that mitigation of the heterogeneity introduced by voxel spacing and image acquisition parameters improves the prognostic performance of the radiomic phenotypes. Our future work will explore the reproducibility of our results in a larger cohort of advanced NSCLC patients treated with first-line immunotherapy.