Differentiation of low and high grade renal cell carcinoma on routine MRI with an externally validated automatic machine learning algorithm

Pre-treatment determination of renal cell carcinoma aggressiveness may help guide clinical decision-making. We aimed to differentiate low-grade (Fuhrman I–II) from high-grade (Fuhrman III–IV) renal cell carcinoma using radiomics features extracted from routine MRI. 482 pathologically confirmed renal cell carcinoma lesions from 2008 to 2019 in a multicenter cohort were retrospectively identified. 439 lesions with information on Fuhrman grade from 4 institutions were divided into training and test sets with an 8:2 split for model development and internal validation. Another 43 lesions from a separate institution were set aside for independent external validation. The performance of TPOT (Tree-Based Pipeline Optimization Tool), an automatic machine learning pipeline optimizer, was compared to hand-optimized machine learning pipeline. The best-performing hand-optimized pipeline was a Bayesian classifier with Fischer Score feature selection, achieving an external validation ROC AUC of 0.59 (95% CI 0.49–0.68), accuracy of 0.77 (95% CI 0.68–0.84), sensitivity of 0.38 (95% CI 0.29–0.48), and specificity of 0.86 (95% CI 0.78–0.92). The best-performing TPOT pipeline achieved an external validation ROC AUC of 0.60 (95% CI 0.50–0.69), accuracy of 0.81 (95% CI 0.72–0.88), sensitivity of 0.12 (95% CI 0.14–0.30), and specificity of 0.97 (95% CI 0.87–0.97). Automated machine learning pipelines can perform equivalent to or better than hand-optimized pipeline on an external validation test non-invasively predicting Fuhrman grade of renal cell carcinoma using conventional MRI.


Scientific Reports
| (2020) 10:19503 | https://doi.org/10.1038/s41598-020-76132-z www.nature.com/scientificreports/ Recently, machine learning-based CT radiomics have been applied in prediction of Fuhrman grade with good results [7][8][9] . Radiomics, an emerging field in medical imaging, has grown exponentially for clinical decision support [10][11][12] . With a high volume of radiomic features extracted, feature choice in pipeline creation critically influences the results of final disease prediction or classification 12,13 . However, the selection of the most optimized pipeline requires extensive testing. The TPOT (Tree-Based Pipeline Optimization Tool) is an automated machine learning (autoML) that automatically chooses the most optimal machine learning pipeline and has been shown to outperform standard ML [14][15][16] .
The goal of the current study was to predict RCC grading using MR-based radiomics and compare performance of autoML with expert manual pipeline optimization on an external validation set.

Methods and materials
Patient cohort. Patients with histologically confirmed RCCs with available Fuhrman grading (I-IV) from two large academic centers in the United States (HUP and MAY), two hospitals in People's Republic of China (SXH and PHH) and The Cancer Imaging Archive (TCIA) were retrospectively identified. The study was approved by the Institutional Review Boards of HUP, MAY, SXH, and PHH. With the agreement to use TCGA/TCIA data, the IRB approval of our study was waived for TCIA. The inclusion criteria were (1) pathologically confirmed RCC with reported histological Fuhrman grade (2) available preoperative MRI including T2-weighted (T2) and T1-contrast (T1C) enhanced sequences, (3) quality of the images was adequate for analysis, without motion or artifacts. The exclusion criteria consisted of (1) patients with WHO/ISUP grading (2) patients diagnosed through biopsy (3) patients with no reported Fuhrman grade (4) patients with incomplete or inappropriate image protocol. If dynamic enhancement was performed, the earliest phase on.
T1C sequence was chosen. Our final cohort consisted of 482 RCC lesions (374 lesions from HUP, 43 lesions from MAY, 39 lesions from TCIA, 11 lesions from SXH, and 15 lesions from PHH). Histopathological diagnosis in the form of Fuhrman grade was obtained for all 482 tumors after surgical excision. RCCs were grouped into low grade (grades I and II) and high grade (grades III and IV).
Tumor segmentation. MR images of all patients were loaded into 3D Slicer software (v4.6), 3D regions of interest were manually drawn slice-by-slice on the T2 and T1C sequences by an abdominal radiologist (Y.Z.) with 5 years of experience reading abdominal MRI 17 .
Image pre-processing. Preprocessing of the lesion images involved n4 bias correction and intensity normalization using ANTS and SimpleITK, respectively. The training set images were scaled to 200 by 200 pixel squares using bilinear interpolation, and augmented with horizontal/vertical flip, shear, and zoom transformations to add variability to the set.
Training, validation, and testing. The 43 lesions from MAY were first separated out to use as our external testing set. The rest of the 439 lesions in our dataset were portioned into training and testing sets in a ratio of 8:2. Overall, the training set consisted of 351 lesions, the testing set consisted of 88 patients, and our external testing set consisted of 43 patients. The cohort can be seen in Table 1. Radiomics features were extracted from each patient's MRI for both T1C and T2WI sequences. For each image space, 79 non-texture (morphology and intensity-based) and 94 texture features were extracted according to the guidelines defined by the Image Biomarker Standardization Initiative (IBSI) 18 . Each of the 94 texture features were computed 32 times using all possible combinations of the following extraction parameters, a process known as "texture optimization" (REF): (1) isotropic voxels of size 1 mm, 2 mm, 3 mm, and 4 mm, (2) fixed bin number (FBN) discretization algorithm, with and without equalization, and (3) the number of gray levels of 8, 16, 32, and 64 for FBN. A total of (79 + 32 × 94), or 3087, radiomics features were thus computed in this study. All the features were normalized using unity-based normalization and features from T1C and T2WI were combined into one dataset. In order to reduce dimensionality of the datasets, radiomics features were selected for training using thirteen different feature selection methods. Ten machine learning classifiers were trained and tested on features from the same splits of patients used in the deep learning methods. The detailed feature selection methods and classifiers used are shown in Supplementary Table 1. Each classifier was trained on the training set thirteen times using thirteen different feature selection methods and validated through tenfold cross-validation. Classifiers were trained on 10, 30, 50, and 100 selected features and performances were compared on the testing set. In addition to performance, the stability of both classifiers and feature selection methods was recorded. Relative standard deviation (RSD%) was calculated for classifier stability. Each classifier was trained and validated on different sub-samples of the data 100 times, and RSD % was calculated by the standard deviation of AUC divided by the mean of AUC for these 100 trials. A stability measure proposed by Nogueira et al. was used for feature selection stability 19 . This function quantified stability of feature selection as the similarity between selected feature sets obtained by the same method over multiple trials. The same feature selection method was run on varying sets of training data 100 times, selecting 50 features at a time. With this data, the stability function outputted a number between 0 and 1, where 1 is most stable, or least variance between selected features and 0 is least stable, or most variance between selected features. The performance of the top-performing classifier was then compared to the performance of an automated optimized machine learning pipeline computed by TPOT, a Tree-Based Pipeline Optimization Tool that chooses the most optimal machine learning pipeline for an inputted dataset through genetic programming. To reduce stochasticity, 10 iterations of the TPOT software were run on the training and testing sets. The best-performing hand-optimized model and the best-performing TPOT pipeline were then tested on the final external testing set.

Results
Patient and tumor characteristics. Supplementary Table 2 shows the clinicopathologic characteristics of our cohort. High grade RCCs were significantly larger than low grade RCCs (mean size, 4.9 cm vs. 2.7 cm, p < 0.001). Renal vein invasion was found in 34 high-grade RCC lesions, whereas only 11 low grade RCC lesions presented with this feature (p < 0.001). There was significant difference in T stage between the two groups (p < 0.001). Presence of lymph node involvement and distance metastasis were more common in high-grade RCC than low-grade RCC (p = 0.004, and p = 0.001, respectively).
Internal testing results. The

Discussion
Prior to the development of our ML-based MRI radiomics study, manual MRI characteristics, such as apparent diffusion coefficient (ADC) measurements, have been used to differentiate low and high grade RCC. In such cases, ADC measurements are taken using different region-of-interest (ROI) placement techniques to distinguish between low and high grade. A study using ADC differentiation by Aslan et al. demonstrates an accuracy value of 0.75 whereas our study displays a top accuracy of 0.81 23 . ML-based radiomics methods have potential to enhance differentiation based on grade compared to manual ADC computation. In this study, we specifically compared automatic and manually optimized machine learning pipeline using MR-based radiomics in discriminating between low and high grade RCCs. The TPOT-exported pipeline and the top manually optimized pipeline achieved similar accuracy. When a machine learning model is implemented as a tool for RCC risk stratification, high specificity is the most important performance measure. With higher specificity, low risk patients can potentially be offered less invasive alternative treatment to preserve renal function and minimize long-term complications. In our study, TPOT generated a pipeline which had higher specificity than the top manually optimized pipeline on the external test set. Previous studies have investigated the value of CT-based radiomics in distinguishing low from high Fuhrman grade RCC [7][8][9] . Texture features can quantify tumor heterogeneity and were found to be correlated with Fuhrman grade 24   The referenced studies proved CT radiomics was useful and promising for non-invasive prediction of Fuhrman grade, but due to only having a cohort from a single institution, these predictive models were not validated externally in an independent cohort, which makes generalization questionable. However, since we do not have the datasets or code of these referenced studies, we cannot make a direct comparison on performance. Overall, our study rigorously evaluated a variety of machine learning approaches and included external validation to assess for expected performance on deployment. Additionally, there are a few quantitative differences in the methodologies used in our study and those used in the studies above. Compared to our external test set of 43 patients, these studies predicted Fuhrman grade on a greater number of patients i.e. 92, 260, and 54. Additionally, these studies utilized more targeted feature selection methodology, computing interclass correlation coefficients (ICC) between feature types, resulting in a significantly fewer number of selected features i.e. 13, 35, and 4. The hand-optimized pipeline in our study selected 50 features through a holistic statistical approach on all feature types, thus streamlining the process, reducing feature bias, but potentially affecting the performance. Compared with previous radiomics studies, our study has several differences. First, we chose MRI instead of CT. MRI provides multi-parametric sequence, which theoretically provide more information than simple attenuation differences measured in Hounsfield units on CT. Second, we have investigated and compared a large group of feature selection methods and classifiers for radiomics-based Fuhrman grade prediction, and the model with highest performance was then compared with an automated optimized machine learning pipeline computed by TPOT. Third, our cohort come from five institutions, one of which was separated as an independent test set to implement external validation strategy, which none of the previous studies have attempted.
In this study, we investigated 13 different filter-based feature selection methods and 10 machine-learning classification methods belonging to 10 different classifier families. We only used filter-based approaches because they are computationally more efficient and less prone to overfitting than the wrapper and embedded methods 26,27 . Furthermore, filter methods are classifier independent, which allow separation of the feature selection and modeling and could increase the generalizability of each component and hence the overall analysis 12 . Our results show that the Bayesian classifier yields the highest predictive performance among the 10 classifiers. Bayesian classifier is fast and simple to train and good at dealing with small data, but have difficulties with complex datasets and shows inferior performance on large datasets 28,29 . The best TPOT-exported pipeline was created using random forest. Random forests have become particularly popular, due to several advantages that include fast training times, the ability to use high dimensional data (where number of features are significantly larger than the number of patients) and high generalizability, but it has been observed to have a problem with overfitting 12,30 . The best TPOT-exported pipeline performed similarly to the top manually optimized pipeline on the internal test set. On the external test set, both pipelines experienced a slight dip in performance, but TPOT slightly outperformed the manually optimized pipeline. The TPOT performance on the external test set is a strength of our study, suggesting that the performance of autoML may be more generalizable.
Limitations of this study include the retrospective selection of only patients with available Fuhrman grade, which may have resulted in selection bias. Second, Fuhrman grade was determined as recorded in the pathology report of the original pathologist. Review by additional pathologists was not feasible due to missing slides and limited resources. Third, segmentation was performed by a single radiologist with 5 years of experience. Automatic renal tumor segmentation will be incorporated in future work. Fourth, the performance was still suboptimal for real-time clinical use. However, the main goal of our paper was to compare the performance of autoML with that manual expert optimized pipeline on external testing.
In this study, TPOT was shown to differentiate low from high histological grade RCC with performance metrics that are slightly better than expert manual pipeline optimization on an external validation set. These results suggest that autoML-based radiomics based on MRI, without the requirement of a machine learning expert, may be a valid strategy to predict RCC characteristics.

Scientific Reports
| (2020) 10:19503 | https://doi.org/10.1038/s41598-020-76132-z www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.