Radiomics Features Differentiate Between Normal and Tumoral High-Fdg Uptake

Identification of FDGavid- neoplasms may be obscured by high-uptake normal tissues, thus limiting inferences about the natural history of disease. We introduce a FDG-PET radiomics tissue classifier for differentiating FDGavid- normal tissues from tumor. Thirty-three scans from 15 patients with Hodgkin lymphoma and 68 scans from 23 patients with Ewing sarcoma treated on two prospective clinical trials were retrospectively analyzed. Disease volumes were manually segmented on FDG-PET and CT scans. Brain, heart, kidneys and bladder and tumor volumes were automatically segmented on PET images. Standard-uptake-value (SUV) derived shape and first order radiomics features were computed to build a random forest classifier. Manually segmented volumes were compared to automatically segmented tumor volumes. Classifier accuracy for normal tissues was 90%. Classifier performance was varied across normal tissue types (brain, left kidney and bladder, hear and right kidney were 100%, 96%, 97%, 83% and 87% respectively). Automatically segmented tumor volumes showed high concordance with the manually segmented tumor volumes (R2 = 0.97). Inclusion of texture-based radiomics features minimally contributed to classifier performance. Accurate normal tissue segmentation and classification facilitates accurate identification of FDGavid tissues and classification of those tissues as either tumor or normal tissue.

[ 18 F] Fluoro-deoxy-glucose positron emission tomography (FDG-PET) is used as a diagnostic tool for cancer staging, prognostication and evaluation of treatment response 1 . While PET has been shown to be sensitive for the detection of neoplastic tissue across multiple tumor types [2][3][4][5] , high FDG accumulation and excretion (i.e. brain, heart, kidneys and bladder) in normal tissues routinely precludes standard uptake value (SUV)-based thresholding approaches for identifying tumors [6][7][8][9] . In many cases, review of both CT and MR are required for differentiating FDG-avid normal tissues from tumor 10 . Software-assisted methods of differentiating high glucose uptake normal tissues from tumors are needed for facilitating high-throughput analysis of PET studies from clinical trials.
Radiomic feature analysis is a valuable means of evaluating information from FDGavid-tissues [11][12][13][14][15] . The workflow of radiomics analysis includes image segmentation, feature extraction and informatics analysis. Challenges in PET segmentation include image resolution, variability in shape and location of pathologies and image noise 16 . Current techniques of segmenting FDG-PET data are categorized as thresholdbased, stochastic and -learningbased, -regionbased, -boundarybased-, or as jointsegmentation-based. Comparison of PET segmentation techniques is challenging given the lack of standardization. Despite the challenges with auto-segmentation of FDG-avid tissues, quantitative radiomics features have been successfully correlated with disease prognosis and classification [17][18][19][20] .
In this study, we retrospectively evaluate the ability of radiomics features derived from SUV and shape data to differentiate FDG-avid normal tissues from tumor tissue using a cohort of Hodgkin lymphoma and Ewing sarcoma patients treated consecutively on two prospective clinical trials. We assess the classification accuracy and Image Processing. Image intensities (C) are converted to standard uptake values (SUV) based on the patient's body weight (W), injected dose (D), elapsed time (t) and radionuclide half-life (T 1/2 ) in equation (1) with x, y, z as image Cartesian coordinates. Due to the use of different scanners, the original image resolution was anisotropic and inconsistent in the collected scans, as shown in Supplementary Table 2. Cubic image resampling was applied to reach an isotropic resolution with 5 × 5 × 5 mm 3 . After Gaussian filtering for background removal and noise reduction using Equation (2), SUV thresholding with a selected value of 3 was applied in Equation (3) to obtain a binary function I(x, y, z).
Tissue Segmentation. After SUV thresholding, FDG-avid regions (brain, heart, kidneys, bladder and tumor) were automatically segmented by using Watershed segmentation 21 followed by morphological closing 22  EWS cohort, 10 of 68 scans were manually segmented because of abnormal uptake in the bones, or super-scans. Altogether, 332 regions were identified from 68 scans, averaging 4.88 regions per scan. Before feature computation, all segmented volumes were reviewed to ascertain accuracy of segmentation. Segmentation performance was evaluated by comparing the segmented volumes with the ground truth established by an assisted algorithm in the treatment planning system. Segmentation performance metrics are reported according to sensitivity, specificity, precision, accuracy, dice similarity coefficient and Jaccard index.

Radiomics Feature Computation.
To build the classifier, radiomics features listed in Supplementary   Table 3 were computed for each segmented region, including SUVbased, shapebased and texturebased features.
Definitions of each radiomics feature can be found in the paper by Aerts et al. 23 . In short, SUV-based features encode information of the SUV distribution in each segmented region. Shape-based features describe 3D size and shape of the region. Six additional shape features describing spatial location and orientation of each segmented region were implemented, including centroid X, centroid Y, centroid X, major axis X, major axis Y and major axis Z. The centroid locations were computed as the center of mass for each segmented region. The major axis was computed as the maximum radius vector.
Statistical Analysis. Descriptive statistics are reported for all continuous and count data. Continuous data are summarized using the median and range and tested across groups using the Wilcoxon rank sum test. Count data are summarized using frequencies and are tested across groups using either the Fisher's exact or chi-square test. Feature significance are summarized using student t-tests. Classification accuracy was described using sensitivity and specificity and analyzed with ANOVA. The significance level for statistical tests was p < 0.05 and α = 0.05. Microsoft Excel 2013 (Redmond, WA) was used for all data management. SAS (V 9.3, Cary, NC) or RStudio (V1.0.136, Boston, MA) ggally package were used for analyses. Dimensionality reduction using -tSNE 24 was used to illustrate the relationship between segmented FDG-avid tissues.

Results
Tissue Segmentation. Table 2 lists the total number of regions segmented and their respective ground truth labeling. Among the FDG-avid normal tissues, brain and bladder were present in all HOD and EWS scans. A representative coronal image of a segmented FDG-PET is shown in Fig. 1. Myocardial activity was detected and had uptake >3 in 48 scans (48%). The kidneys were not always apparent as a paired structure: Similarly, left renal activity was visible in 91 scans (91%) and the right kidney, in 86 scans (86%). Tumor was visible in 75 scans (75%). Thyroid and/or salivary glands were only visible in 22 scans (22%), so they were excluded from the tissue classifier. An additional constraint of sole occurrence per tissue in a scan was enforced in classification. Automatic segmentation performance showed a median and 95% confidence interval (CI) of 0.72 (95% CI 0.58-0.82), 0.99 (95% CI 0.98-1.00), 0.78 (95% CI 0.71-0.85), 0.99 (95% CI 0.98-1.00), 0.7 (95% CI 0.63-0.77) and 0.52 (95% CI 0.45-0.59) for sensitivity, specificity, precision, accuracy, dice similarity coefficient and Jaccard index respectively.
Tissue Classification. Using the HOD cases as the training data set, a random forest classifier was built with 28 equally weighted radiomics features, as listed in Figs 2 and 3, with uniform prior probabilities. The robustness of the radiomics features was first evaluated by classifying only the normal tissues. The first row in Table 2 summarizes the results for normal tissue classification. There were 285 normal tissue regions from the EWS test set classified and one bladder was mislabeled. Next, the classifier was retrained to include 25 tumor volumes from the HOD scans and re-tested with all 332-normal tissue and tumor volumes from the EWS scans. The classifier correctly identified 298 volumes in the EWS cohort, reaching 89.8% accuracy ( Table 2). More specifically, the classifier identified 100%, 98%, 93%, 76%, 76% and 85% of brain, bladder, right kidney, heart, left kidney and tumor tissues (primary site and metastases), respectively. Visualization of the random forest classifier is depicted in Fig. 4. Fifty binary decision trees were trained. Figure 4a shows one binary decision tree and its classification features and results. The final classification result was determined by combinatorial decision among all 50 trees. Figure 4b shows different thresholds computed by all 50 trees in two features: volume and centroid Z location. To better visualize the classification results, we employed t-SNE plots in 4c. It is shown that after dimension reduction, segmented brains are distinctively separate from other FDGavid tissues.

Classification with EWS as Training Set.
To evaluate the influence of sample size, we used the EWS cohort as the training set and the HOD cohort as the test set and built another random forest classifier with the same feature settings. The robustness of the features was again evaluated with 142 normal tissue regions and reached a 100% classification rate (Table 2). We then tested the classifier with 167 segmented normal tissue and tumor regions from the HOD cohort: Increased sample size resulted in improved labeling accuracy for all tissues except for right kidney and bladder ( Table 2). In short, the classifier identified 100% of brain, 83% of heart, 96% of left kidney, 87% of right kidney, 97% of bladder and 84% of tumor tissue.

Validation of Tumor Segmentation. After classification, the automatically segmented tumor volumes
were compared with manually segmented volumes drawn by a board-certified radiation oncologist using the CT and the attenuation corrected FDG-PET in MIM Software Inc. (Version 6.7, Beachwood, OH). Supplementary  Fig. 1 shows the correlation between manually delineated and segmented tumor volumes. The determination coefficient demonstrates the high consistency of the automatic segmentation method relative to the manual segmented volumes (R 2 = 0.97). The first-order coefficient 1.06 indicates that segmentation slightly overestimates tumor volume. The constant offset of −3.5 indicates that automatic segmentation may not identify small lesions (<4 ml). Figures 2 and 3 illustrate the respective distribution of all SUVbased and shapebased features. Recognition of the brain when training with either dataset reached 100% because of the large volume of this organ relative to other tissues, as shown in Fig. 3  Bladder and tumor volumes were at the lower end of the range of brain volumes but could be differentiated from brain by using other features, such as compactness 1. Heart misclassification posed a unique challenge, regardless of the training dataset. This may be due to insufficient samples and inconsistent After image processing and SUV thresholding, FDG-PET avid tissues, such as brain (red), tumor (green), heart (dark red), right kidney (dark blue), left kidney (light blue) and bladder (yellow), are segmented for radiomics feature computation to construct random forest classifiers.

Misclassified Regions.
FDG uptake in the heart, which often lead to varied segmentation patterns, as demonstrated in Fig. 5. Across all scans, 48 exhibited high uptake, with 36 displaying the entire heart and 12 displaying only the apex. The recognition rate improved from 76% to 83% after training with EWS cohort. Three hearts remained misclassified due to the existence  of tumors with similar shape. As shown in Fig. 2, only entropy could differentiate heart from other normal tissues and tumor (p < 0.001, α = 0.05). Among shapebased features, compactness 1 and compactness 2 were useful for differentiating heart from other normal tissues (p < 0.001 for each) and from tumors (p < 0.005). The respective recognition rates of left and right kidneys were 96% and 86%, with one and four misclassifications, respectively. As shown in Fig. 2, kurtosis and skewness (p < 0.001 for each) were useful SUV-based radiomics features for identifying kidneys. Centroid Y (posterioranterior) also had high distinctive power between kidneys and other FDGavid tissues. For distinguishing between left and right kidneys, only the Centroid X location had high distinctive power (p < 0.001).
High FDG uptake, indicated by maximum SUV, mean SUV, median SUV, root mean square and mean absolute deviation, significantly distinguished bladder from other FDGavid-tissue, as shown in Fig. 3 (p < 0.001). Moreover, centroid location in the superiorinferior-direction (Centroid Z) also distinguishes bladder from other normal tissues along the z axis as bladder is typically the lowest segmented tissue in the scans. Given the inter-individual differences in patient height, only relative location could be obtained and used in the tissue classifier. The variable scan parameters used for PET scans among institutions also complicate the use of the centroid z metric. Among HOD and EWS scans, 8 and 64 scans were full-body and 25 and 4 scans encompassed only head to pelvis. Potential solutions to centroid z inconsistencies include height normalization, standardization of scan length and image alignment. Table 3 lists the top five features with the most discriminatory influence in the tissue classifier. Shape features, such as volume and compactness, were useful for distinguishing brain from tumor. A similar trend was noted for heart-tumor comparisons. Kidney identification was improved with the inclusion of SUV-based features, such as skewness and kurtosis. Only one shape feature, Centroid Z, added to the discriminatory capability of the classifier for bladder vs. other tissue. Inclusion of Texture Features. We then evaluated the ability of texture-based features extracted from the SUV measurements (including gray level cooccurrence matrix and gray level run length matrix) to add to the discriminatory power of the tissue classifier.

Classification Power of Radiomics Features.
Supplementary Table 3 lists the 34 texture features computed. Definitions for each texture-based feature can be found in Aerts et al. 23 . The overall performance increased by 0.6%, with only two more hearts being correctly identified. To accommodate for the additional features, the number of trees was doubled compared to previous classifiers. The overall performance did not improve with an associated increase in the number and complexity of trees, indicating that the classifier had reached its performance limitation with an outofbag error of 0.06.

Discussion
Accurate tissue segmentation is essential for radiomics feature computation. Although thresholding methods have limitations, several studies [25][26][27] have proposed iterative thresholding for accurate PET segmentation. In this study, SUV thresholding of 3 with watershed segmentation was shown to have both high specificity in FDG-PET imaging and an excellent coefficient of determination relative to manual tissue segmentation. Similar techniques achieve comparable segmentation performance in PET phantom studies 28 .
Tissues with inconsistent uptake and anatomical heterogeneity, such as brain and heart, may be over-segmented. In brain, SUV may vary across grey matter (high SUV) relative to white matter (low SUV). Analogously, in heart, the left ventricle has a higher SUV than the right ventricle and atria. To overcome the possibility of over-segmentation and under-segmentation, we applied image resampling and Gaussian smoothing to provide a more-uniform SUV across the entirety of the segmented organ. This approach has been reported to improve the spatial resolution of radiopharmaceutical uptake estimates and the signal-to-noise ratio of FDG-PET 29 .
Conversely, infrequent abnormal uptake in normal tissues may lead to spurious undersegmentation due to unique circumstances. We observed this in the context of abnormal large intestine uptake leading to undersegmentation of the intestine and bladder. Undersegmentation of these structures may be remedied by applying automatic Otsu thresholding 30 to adjust the local SUV threshold.
Partial and/or variable patterns of FDG uptake in normal tissues had minimal negative impact on classifier performance. Kidney and heart FDG avidity are examples of this phenomenon and representative images are shown in Fig. 5. Although this issue may lead to incomplete segmentation, our classifier performance was only   minimally impacted. To improve kidney segmentation, computed tomography (CT)-based segmentation techniques can be incorporated [31][32][33] by transferring the segmented masks to the co-registered CT-based attenuation correction (CTAC) scans completed at the time of PET acquisition. Although CTAC studies frequently lack the resolution of diagnostic CTs for each body region, they likely contain enough information about each corresponding tissue to provide additional information to add to iterative image segmentation methods to limit the impact of partial segmentation due to varied FDG-uptake. Ramos et al. 34 have also suggested using iterative image reconstruction as a method to achieve uniform SUV in FDGavid tissues. While normal organ volumes often under estimated the total organ size and volume, this was not the purpose of our methodology and FDG-avid regions were characterized for the sole purpose of differentiating them from tumor. Feature stability should also be considered when selecting radiomics features for use as classifiers 31 . Shinohara et al. 35 demonstrated that intensity normalization in MR may enable multimodality comparisons and longitudinal evaluation of MR studies. These methods may also improve the stability of MR-and PET-derived radiomics features. Additional methods, such as height normalization, voxel dimension standardization, PET acquisition standardization and SUV normalization 36 , may stabilize FDGPET radiomics features and improve classifier performance and FDG-avid tissue segmentation. Successful separation of normal tissue and tumor can facilitate volume delineation for radiotherapy planning and tracking tumor response and disease burden 37 .
Tumor segmentation accuracy was excellent using our classifier. Manually segmented tumor volumes showed a high correlation with automatically segmented tumor volumes with an R 2 of 0.97. Segmentation performance was superior in cases with solid tumors compared to nodular cases (ANOVA, p = 0.01). Incorporation of CT images and local thresholding may potentially improve classifier performance. Limitations in CT z-plane resolution were partially overcome by resampling of the CTs to 0.25 × 0.25 mm in the treatment planning system thus aiding manual segmentation to a reasonably level of accuracy. Additional inaccuracies in manual delineation were secondary to region directed but not tumor directed CT scans. In some cases, tumor would span the head and neck region down into the chest resulting in partial overlap and registration difficulties. Automatically segmented tumor based on FDG-PET was completed on a per scan basis, thus overcoming any potential errors in registration or partial overlap across studies. Future studies comparing different tumor types will need to carefully evaluate the relationship between radiotracer avid tumor and tumor heterogeneity to avoid over or under estimating tumor burden. Additional studies in other tumor systems are required to determine whether other tumor types have divergent enough radiomic feature profiles to facilitate accurate differentiation of normal tissue and tumor.
In this study, texture features in FDGPET appear to have low discriminating powers between normal FDGavid tissues and tumor. Nevertheless, texture analysis in CT and MRI has been demonstrated to facilitate differentiating tumor behavior and response [38][39][40][41][42][43] . The varied reproducibility of each feature across studies confounds the utility of textural features 44,45 . Future large-scale studies should continue to evaluate the longitudinal stability of PET-and CT-derived radiomics features in normal tissues to reduce the likelihood that excessive radiomic feature number will over-fit predictive prognostic models. Furthermore, the proposed image processing pipeline facilitated segmentation of the entire organ despite the nonuniform signals within kidneys and heart. This work might potentially extend to segment tumor volumes in other diseases or alternative PET radiotracer studies 46 . The classifier has been made available to public at https://goo.gl/LzJyjY with pseudocodes shown in Supplementary Fig. 2.

Conclusion
This study utilized SUV-based thresholding for FDG-avid tissue identifications followed by radiomics feature extraction and a random forest-based machine learning to develop a classifier that could distinguish FDGavid normal tissues from tumor. We achieved an overall 90% classification rate using predominately shape-based radiomics features for identifying tumor and normal tissues. Normal tissues that could be identified in a reproducible manner based on first-order radiomics features alone derived from SUV data include bladder, heart, kidneys and brain. SUV textural features did not improve classifier performance. Application of such SUV-based iterative thresholding methods with subsequent tissue classification could aid in the analysis and interpretation of automatically segmented FDG-avid tumor regions on longitudinal PET studies from clinical trials by removing the confounding contribution of FDG-avid normal tissues from the analysis.