Radiomics-based machine learning models to distinguish between metastatic and healthy bone using lesion-center-based geometric regions of interest

Radiomics-based machine learning classifiers have shown potential for detecting bone metastases (BM) and for evaluating BM response to radiotherapy (RT). However, current radiomics models require large datasets of images with expert-segmented 3D regions of interest (ROIs). Full ROI segmentation is time consuming and oncologists often outline just RT treatment fields in clinical practice. This presents a challenge for real-world radiomics research. As such, a method that simplifies BM identification but does not compromise the power of radiomics is needed. The objective of this study was to investigate the feasibility of radiomics models for BM detection using lesion-center-based geometric ROIs. The planning-CT images of 170 patients with non-metastatic lung cancer and 189 patients with spinal BM were used. The point locations of 631 BM and 674 healthy bone (HB) regions were identified by experts. ROIs with various geometric shapes were centered and automatically delineated on the identified locations, and 107 radiomics features were extracted. Various feature selection methods and machine learning classifiers were evaluated. Our point-based radiomics pipeline was successful in differentiating BM from HB. Lesion-center-based segmentation approach greatly simplifies the process of preparing images for use in radiomics studies and avoids the bottleneck of full ROI segmentation.

Planning-CT images. All planning-CT images were generated using one of three Philips' Brilliance Big Bore RT CT scanners at our institution with the acquisition parameters provided in Table 2. Planning-CT DICOM files were manually de-identified and exported to a secured hard drive from the Eclipse radiation therapy treatment planning software (Varian Medical Systems, Palo Alto, California), into which they had been previously imported for RT planning.
Lesion identification. The planning-CT images of the BM patients were randomly divided into five sets using the Python |random.shuffle| module and were loaded into our custom-written 3D DICOM visualization web application (diCOMBINE 28 ) for lesion identifying. diCOMBINE is an open-source software developed by our group using the Python Flask 29 framework for DICOM 3D visualization and lesion point location labeling. The center points of BM lesions were labeled by an expert team comprising one staff radiation oncologist and four radiation oncology fellows. Each expert was asked to label BM center points in one of the five data sets, and a peer expert was tasked with reviewing them and validating the labels. A total of 631 validated BM center points Table 1. Radiomics-based ML models reported in the literature for distinguishing bone lesions. * Sample size is the total number of samples. † A accuracy, P precision (Specificity), R recall (sensitivity). ‡ In the semiautomated segmentation methods an expert was required to check and modify the computer-segmented ROIs slice-by-slice. www.nature.com/scientificreports/ were thus identified in the BM data set. Similarly, the planning-CT images of the HB patients were randomly divided into three sets and were loaded into diCOMBINE for HB labeling. One staff medical physicist and two medical physics graduate students were asked to identify HB points in one of the data sets each. When identifying HB points, the medical physicists were instructed to avoid non-metastatic skeletal complications (such as surgically-treated bone lesions). An average of four HB points were identified in each planning-CT image. Then, we asked each physicist to independently review and confirm the HB points that one of their peers had labeled. A total of 674 validated HB points were identified in this way. Screenshots of our diCOMBINE 3D lesion labeling web application are presented in Fig. 2. These BM and HB points were used as center points for our automated ROI delineation.

Delineation of regions of interest.
ROIs were automatically delineated in the planning-CT images using geometric shapes centered on the expert-identified point locations. We used four spherical (SP) and five cylindrical along the z-axis (CY) ROIs of various sizes. The characteristics of the ROIs used are specified in Table 3. The size ranges were defined to extend from the size of a large bone lesion ( ∼ 15 mm) 30 to the maximum size of a spinal vertebra ( ∼ 50 mm) 31,32 .
Radiomics feature extraction. The pydicom package (https:// pydic om. github. io/ pydic om/ stable/) was used to read DICOM CT images and normalize pixel data to Hounsfield Units. Then, the normalized CT slices were stored as 3D raster data using the pynrrd package (version 0.4.2) (https:// pypi. org/ proje ct/ pynrrd/ 0.4. 2/). The pynrrd package was also used to generate 3D binary masks from each of the nine ROIs listed in Table 3.
Finally, The open-source PyRadiomics package (version 3.0.1) 33 was used to calculate the 3D quantitative radiomics features. For each of the nine ROIs listed in Table 3, we extracted 107 radiomics features from each of the planning-CT images. We did not apply any filters prior to feature extraction.   Radiomics workflow. Our complete radiomics-based ML workflow is presented in Fig. 3. After extracting radiomics features for each ROI, we scaled the feature space using z-score normalization 38 . Then, we randomly divided the data set into 70% and 30% stratified training and testing sets, respectively. Each stratified set contained approximately the same BM/HB samples ratio as the initial data set. The training set was used for FS, and ML model development using 5-fold cross-validation 39 . The test set was used for the final performance evaluation. In the present study, we examined the performance of 13 FS methods and 12 ML classifiers as shown in Fig. 3 and described in the following sections.  Performance evaluation. The performance of our radiomics-based ML models were measured using the test data set. The standard error of calculations was reported using 5-fold cross-validation on the training data set. We used the area under the receiver operating characteristic curve 60 (AUC) for performance evaluation. Also, we reported precision and recall for our best-performing pipeline. Matplotlib (version 3.4.3) 61 was used to generate figures.

Ethics declarations. This retrospective study was approved by the Research Ethics Board of the McGill
University Health Centre, Montreal, Quebec, Canada, with the waiver of informed consent. We confirmed that all research were performed in accordance with the relevant guidelines and regulations.

Results
Radiomics feature space. A JSON file of the metadata of extracted radiomics features is available in the supplementary dataset in our public repository 62 . The predictive performance of the different FS methods and ML classifiers was evaluated for each ROI on the test set using the AUC, precision, recall, and F-1 scores. Examples of receiver operating characteristic (ROC) curves are presented in Fig. 4 for the a) NB (a poor performance), b) RF (a good performance), and c) GPR (the best performance) ML classifiers on the test data set (red squares) and 5-fold validation set (pink lines) using 20 mm spherical ROI (SP20) with no FS. Note that 20 mm SP ROI was selected for visualization purposes throughout this paper for no particular reason. The effect of using the various geometric ROIs will be presented later in this paper. Raw data values, including confusion matrices, ROC graphs, and performance tables (precision, recall, F-1, ROC-AUC values on training, validation, and test sets) for all ML classifiers on all ROIs are provided in the output data folder in our public repository 62 .
Effect of feature selection. An example of an AUC grid for different combinations of ML classifiers and FS methods is presented in Fig. 5 for the 20 mm SP ROI. As can be seen in Fig. 5, the best results were achieved by the GPR and NNet classifiers with LASSO FS methods. The RFECV, VT, LASSO, and TREE FS methods out-

Effect of geometric ROIs.
Two examples of the effect of using geometric ROIs with different sizes and shapes are presented in Fig. 6. For plot (a), we used no FS. For plot (b), we used the best performing FS method (LASSO). As can be seen in Fig. 6, the size of the ROI had a significant effect on the performance of our radiomics-based ML models. In general, a smaller ROI resulted in superior performance of models. For example, for the GPR classifier with no FS (the rightmost column of Fig. 6a), the AUC was improved from 86 to 94% when we moved from the SP50 to the SP15 ROI. SP15 resulted in the best overall performance when no FS was used. When we employed FS methods, the ensemble ROIs (like E4SP and E9SC) out-performed the single-size ROIs. This was most pronounced for the LASSO method, which is presented in Fig. 6b. The AUC grids for other FS methods are provided in the output data folder in our public repository 62 . Comparing Fig. 6a and 6b revealed that some ML classifiers (like SVM or GPR) were more sensitive to the use of FS than others (like NNet or RF). Also, we noticed that FS was more important when using large ROIs (such as SP50 or CY50) or ensemble ROIs (such as E4SP or E9SC).
To visualize the effect of the size of ROI on the performance of our models, in Fig. 7 we show the AUCs of our best performing ML classifier (GPR) for (a) single geometric ROIs (sorted by volume), and (b) for ensemble ROIs (sorted by total volume). To show the effect of the use of FS, we plotted the results without FS (blue circles), and with our best-performing FS method (LASSO) (red squares). It can be seen that a smaller ROI resulted in a better performance. Also, FS was more important for larger ROIs (like SP50 and CY50) and ensemble ROIs (like E9SC).
The grid of the F-1 scores for the best performing FS method (LASSO) is presented if Fig. 8. The GPR, NNet, and L-SVM classifiers achieved 0.9 F-1 score in detecting BM using the ensemble ROIs. The AUC, precision, recall, and F1 score of our best performing pipeline, corresponding to the E9SC ROI, LASSO FS method, and GPR ML classifier, were 96%, 92%, 91%, and 0.9, respectively. The performance of our models for all combinations of FS methods, ML classifiers, and ROIs are provided in the supplementary dataset 62 .

Discussion
In this study, we investigated the feasibility of using a single-point-based geometric ROI to develop a radiomics pipeline to distinguish BM and HB locations in planning-CT images of cancer patients with BM. We investigated various FS methods, and ML classifiers using point-based geometric ROIs with various shapes and sizes. The time and effort needed for manual 3D segmentation of ROI are significant limitations to achieving large real-world image data sets. This, in turn, hinders the generation of generalizable radiomics-based prognostic ML models 63 for use in the clinic. Another limitation of manual lesion segmentation is inter-observer variability, which has been shown to have a significant impact on the performance and reproducibility of radiomics-based pipelines 64 . Furthermore, manual segmentation tools, designed for radiation therapy treatment planning, intend to load one patient at a time. Therefore, switching between patients is another time-consuming process that slows down the lesion delineation for multiple patients in the research context 65 .
Our in-house-developed open-source 3D DICOM visualization and lesion identification tool (diCOMBINE 28 ) allowed our collaborating radiation oncologists to quickly review planning-CT images of several hundred patients and efficiently identify 676 BM centers. They found diCOMBINE fast and easy to use, allowing each expert to label around 150 lesions per hour. Based on our experts' anecdotal experience, single-point-based geometric ROI delineation was 10-15 times faster than full manual 3D segmentation. These lesion centers were used to generate ROIs automatically. Defining point-based geometric ROIs, instead of full 3D manual segmentation of the ROIs, allowed us to rapidly generate a large sample set, minimize expert imposed uncertainties, and investigate the effect of the size and shape of the ROIs in the performance of our radiomics models. Besides, our point-based radiomics models will allow us to study the feasibility of building an automated BM-identifying pipeline. To the best of our knowledge, no studies on automated BM delineation have been published previously.
Radiomics extracts hundreds of features from an ROI. However, these features are generally highly correlated and contain much noise. Therefore, it is essential to apply proper FS methods to achieve a robust radiomics-based ML pipeline. Among the seven FS methods we examined in this study, we found that PCA and ICA resulted in lower AUC values than the VT, LASSO, and TREE FS methods. One reason for this difference was that VT, LASSO, and TREE methods automatically defined the optimal number of features, while in PCA and ICA, the number of features was predefined. For highly-correlated features, the optimal number of features (f) is roughly proportional to the square root of the sample size (n) 66 . Accordingly, 30 features would appear to be less than the optimal number of features for our sample size ( f = √ n = √ 1305 = 36 ). For studies with small sample sizes, such as Zhang et al. 9 , that used 112 samples, PCA with 10 features seems to be a suitable FS method. We also noticed that the effect of the FS method depends on the selected ML classifier. For ML classifiers that had built-in FS methods (i.e., RF and NNet), applying FS methods in some cases worsened the overall performance of the model. Inversely, for ML classifiers that did not have built-in FS methods (i.e., GPR), adding FS had a significant effect on the performance of the ML classifier. The effect of the FS method was more significant when working with ensemble ROIs that had many more features. For example, the AUC value for the GPR ML classifier using the E9SC ROI (963 features) improved from 0.52 to 0.97 when the LASSO FS method was used, as shown in Fig. 6.
Among the ML methods we examined in this study, we found that GPR, NNet, SVM, and RF resulted in the highest AUC values and F-1 scores. We showed that the GPR classifier outperformed the NNet classifier for most ROIs. However, for the ensemble ROIs (in which the number of features was large), GPR required a We found that our radiomics-based ML models performed slightly better on spherical ROIs compared to cylindrical ROIs of similar volumes. More significantly, we found that the smaller ROIs (15 and 20 mm) resulted in better performance compared to the larger ROIs (30 and 50 mm) (Fig. 6). This might be due to the fact that in larger ROIs there are probably more outlier features captured from bone or organs/tissue surrounding the lesion of interest. Performances of our models did not improve considerably by decreasing the size of ROI below 20 mm, which is roughly the size of a large BM lesion 30 . As can be seen in Fig. 7, our models performed better on the ensemble ROIs compared to the single ROI when used with FS methods. This could be due to having many features in the ensemble ROIs. For example, the E9SC ROI contains 9 ×107 = 963 features. For such a prominent feature space, FS methods become very important.
Although various radiomics pipelines have been previously developed and reported to classify bone lesions, our radiomics-based ML pipeline, reported here, offers several advantages compared to preceding efforts, mainly in the context of palliative radiotherapy planning. First, we, pragmatically, used planning-CT images of cancer patients for extracting radiomics features, whereas prior studies used hybrid modalities or diagnostic-CT images (as listed in Table 1). Hybrid modalities allow the development of high-quality prognostic pipelines. However, these pipelines are less clinically applicable in palliative radiotherapy treatment planning for BM, which is often primarily based on a patient's planning-CT scan. Second, all ML classifiers presented in the prior studies were restricted to full 3D segmentation of the lesion volumes. In the real-world clinical workflow for palliative radiotherapy of BM, it is common to use single-slice or lesion-center-based treatment planning with radiation oncologists often defining treatment field limits rather than lesion contours. Therefore, pipelines that require full 3D segmentation of ROI have limited application in real-world palliative radiotherapy 65 . Moreover, 3D segmentation of the ROI is a time-consuming bottleneck that likely compelled all the prior studies to train and test their radiomics pipelines with limited sample sizes. Training on a small sample size diminishes the generalizability and clinical applicability of a radiomics pipeline. In comparison, our point-based pipeline allowed us to avoid the labor-intensive manual segmentation step and train and test our pipeline on a large data set. Finally, in this study, we investigated the effects of FS methods, and ML classifiers in achieving the optimal prognostic model using geometric ROIs. To the best of our knowledge, no prior study performed such a comprehensive optimization.
Our study had some limitations. First, we selected BM and HB from two sets of separate patients. This selection might drive the risk of potential susceptibility to bias if there is a systematic difference between the two sets of images. However, our rationale for using non-metastatic cancer patients to select HBs was to eliminate the possibility of error in labeling HBs by our medical physicists. Second, our collaborating medical physicists could not identify non-metastatic skeletal complications from metastatic bone lesions. Therefore, the non-metastatic skeletal complications (i.e., surgically-removed lesions or bone islands) were ignored when labeling HB points. A solution for this problem would be using pathology data to identify non-metastatic and metastatic lesions but this would significantly increase the required effort. Third, we used a nearly balanced data set of HB and BM patients in this study. However, having an imbalanced sample ratio is common in many real-world radiation oncology outcome data sets 69,70 . A study with an imbalanced data set is required to evaluate the effect of sample imbalance when building high-performance real-world radiomics-based ML models [71][72][73] . Forth, while using geometric ROIs significantly simplified the lesion delineation procedure, it ignored some lesion details such as size and shape. One alternative that can be explored as future work is to use deep-learning-based ROI segmentation.   Table 3). For this graph, we used our best performing ML classifier (GPR), with our best performing FS method (LASSO), and without FS method. www.nature.com/scientificreports/ Finally, we used single-center planning-CT images from 359 patients in this retrospective study. A multi-center study with a more extensive data set is required to test the generalizability of our radiomics pipeline. Such a big data set would allow us to try more robust deep learning ML classifiers 74,75 to build an AI tool to scan patients' planning-CT images and identify BM lesions automatically. The present work provides strong motivation to pursue such a multi-center study.

Conclusion
We demonstrated that our radiomics-based ML models can successfully distinguish between metastatic and healthy bones in planning-CT images using lesion-center-based geometric ROIs. Our results suggest that the GPR classifier with ensemble ROIs is particularly promising for the differentiation of BM and HB. Optimum pipeline performance was obtained using elimination-based FS methods such as LASSO. Our results demonstrate that radiomics features obtained from a lesion-center-based geometrical ROI may be sufficient to train radiomicsbased ML classifiers to distinguish between bone lesions when full 3D segmented ROIs are not available. This opens the door to big data artificial intelligence research for cancer patients with BM.

Data availability
The supporting dataset is provided as a figshare repository 62 . This repository contains three files: (1) "features-pace_metadata.json.zip" file that includes radiomics features extracted from 1273 spinal lesions (healthy or metastatic) from radiotherapy planning-ct images using geometrical regions of interest (ROIs). (2) "output.zip" folder that contains the results of our radiomics-based machine learning models that were validated and tested using several FS, and ML on single-point-based geometric ROIs with various shapes and sizes. (3) A README. md file that is provided to explain the information about the data structure and file naming patterns.