Prediction of blood supply in vestibular schwannomas using radiomics machine learning classifiers

This study attempts to explore the radiomics-based features of multi-parametric magnetic resonance imaging (MRI) and construct a machine-learning model to predict the blood supply in vestibular schwannoma preoperatively. By retrospectively collecting the preoperative MRI data of patients with vestibular schwannoma, patients were divided into poor and rich blood supply groups according to the intraoperative recording. Patients were divided into training and test cohorts (2:1), randomly. Stable features were retained by intra-group correlation coefficients (ICCs). Four feature selection methods and four classification methods were evaluated to construct favorable radiomics classifiers. The mean area under the curve (AUC) obtained in the test set for different combinations of feature selecting methods and classifiers was calculated separately to compare the performance of the models. Obtain and compare the best combination results with the performance of differentiation through visual observation in clinical diagnosis. 191 patients were included in this study. 3918 stable features were extracted from each patient. Least absolute shrinkage and selection operator (LASSO) and logistic regression model was selected as the optimal combinations after comparing the AUC calculated by models, which predicted the blood supply of vestibular schwannoma by K-Fold cross-validation method with a mean AUC = 0.88 and F1-score = 0.83. Radiomics machine-learning classifiers can accurately predict the blood supply of vestibular schwannoma by preoperative MRI data.

research ethics committee of the First Affiliated Hospital of Zhengzhou University, and informed consent was obtained from all subjects or, if subjects are under 18, from a parent and/or legal guardian. All methods were performed in accordance with the relevant guidelines and regulations. All patients underwent a preoperative plain and gadolinium-enhanced MRI in our center, the MRI sequences were collected including T1-weighted images, T2-weighted images, T2-weighted-Fluid-Attenuated Inversion Recovery (T2-FLAIR) images, and T1-weighted gadolinium-enhanced (T1-CE) images. Tumors at Stage I (tumors confined to the internal auditory canal, diameter 1-10 mm) by Koos classification 11 were excluded because the blood supply of such small tumors may be difficult to define through the evaluation criteria of this study, and there was a possibility that insufficient radiomics information could be extracted. All surgical procedures were performed by the same highly qualified neurosurgeon. Patients were placed in a lateral position under general anesthesia during the whole procedure. All procedures were performed via a suboccipital retrosigmoid approach, and after revealing the sigmoid sinus and transverse sinus, the cerebrospinal fluid was fully released and the tumor was resected in pieces under the microscope. The blood supply of the tumor was recorded and confirmed by the senior neurosurgeon and the assistant together. Those who had less bleeding during tumor resection, the bleeding was easily removed by an aspirator, and the field under the microscope could always be kept clean, were recorded as poor blood supply; those who had more bleeding during tumor resection, and the blood in the field under the microscope was difficult to remove completely by using only one set of aspirator, were recorded as rich blood supply. The amount of intraoperative bleeding, which was estimated by neurosurgeon at the end of the surgery, was also recorded to assess the reliability of the grouping.
The region of interest (ROI) was drawn on T1-CE separately by two neurosurgeons with over 5 years of clinical experience using 3D-Slier software. Neurosurgeon A drew all ROIs manually; and Neurosurgeon B applied the Nvidia AI-Assisted Annotation (Nvidia AIAA) segmentation module 12 to semi-automatically draw ROIs by placing boundary-points and manually corrected misdrawing. The radiomics workflow in this study. The patient's MRI data was acquired. After uniform preprocessing, two neurosurgeons drew the ROI separately for feature extraction. Samples were grouped by intraoperative recording, the stable features screened by ICC entered the machine learning process to select the best model, which was used to compare the performance with the neurosurgeon's visual observation.  Models, respectively. The 5-repeats-3-fold cross-validation method was used, which divided the case samples into the training set and validation set at the ratio of 2:1, and repeated the validation 5 times to obtain a total of 15 predictions results from each, and plotted the receiver operating characteristic curve (ROC) graphs of crossvalidation for different feature selection methods and classifier combinations of validation sets respectively, and calculated their average area under curve (AUC). To diagnose potential problems with learning, the learning curve for the best combination is plotted. The best combination was selected by its performance (AUC). The accuracy, sensitivity, and F1-score, which considered both accuracy and sensitivity of the best model, were also calculated.
Artificial judgment. Artificial judgment was performed by two other neurosurgeons with over 5 years of clinical experience, they were blinded to the intraoperative recording, and MRI image features of both cohorts were identified by visual observation. Higher signal of the tumor on T1-CE sequences, the finding of multiple flow voids on the tumor surface or in the tumor parenchyma, and solid tumors with less cystic, marked the tumor prediction as rich in blood supply. Lower signal of the tumor on T1-CE sequences, no finding of obvious flow voids related to the tumor, and tumors with multiple cystic, marked the tumor prediction as poor in blood supply. The prediction results of two neurosurgeons were recorded and compared with the gold standard (intraoperative record), and the accuracy, sensitivity, and F1-score were calculated and compared with the prediction results of the machine-learning model.

Statistical analysis.
The statistical analysis of baseline data was performed using IBM SPSS Statistics 21.
The quantitative data was analyzed using Student's t test, and the qualitative data was analyzed using Pearson's Chi-square test, p-value < 0.05 was considered as statistical significance.
Ethics approval. Approval of the Ethical Committee of the Institute was taken for this retrospective study.

Results
Baseline data. Of 191 patients, there were 109 females and 82 males, aged from 20 to 82 years, with an average age of 50.0 years (SD = 12.1). Referring to the intraoperative records, 119 patients were assigned to the rich blood supply group, and 72 patients were assigned to the poor blood supply group. Patient baseline characteristics between the poor blood supply group and the rich blood supply group are shown in Table 1. The sex and age between the two groups showed no significant statistical difference (p > 0.05). The amount of intraoperative bleeding showed a statistically significant difference (p < 0.05) between the two groups with a mean of 165.3 ml (SD = 156.6) in the poor blood supply group and 251.9 ml (SD = 217.9) in the rich blood supply group. The distribution of intraoperative blood loss in the two groups is shown in Fig. 2. Table 1. Sex, age and intraoperative blood loss distribution between the poor blood supply group and the rich blood supply group. Mean ± standard deviation. Feature stability analysis. After the ICCs were calculated, 3918 features with high stability (ICC > 0. 8) were retained for the next step of the feature selection process, and features with low stability (ICC < 0.8) were excluded.
Feature selection and classification. 4 feature selection methods were combined with 4 classifiers, and a total of 16 combinations were applied. Table 2 shows the number of features extracted by different feature selection methods and their average AUC with different classifier combinations. LASSO had the best performance among the 4 feature selection methods, with an average AUC = 0.76 for its extracted features input to the 4 classifiers. The MLR classifier had the best performance among the 4 classifiers, with an average AUC = 0.76 for the 4 feature selection methods input to MLR for prediction. LASSO and MLR were selected as the best combination. The AUCs of all combinations are shown in Fig. 3A and Table 2. The learning curve (Fig. 3B) shows the change in model scores for the training set and cross-validation as the sample size increases. Figure 3C,D show the performance of this combination in 5-repeats-3-fold validations by ROC curves and the difference in the performance of the 4 classifiers when LASSO is used as the feature selection method.
In LASSO feature selection, the optimal λ = 0.02947 was selected from the bootstrap method, and we obtained 41 radiomic features with non-zero coefficients. To reduce the effect of overfitting, following the recommendation of previous studies 14, 15 , we kept only 12 features with the largest absolute values of coefficients according to the sample size in the training set divided in this study. The selected features and their coefficient weights in LASSO are reflected in Table 3. In the t test + LASSO, because there were too many features that were statistically different at the significant level (α = 0.05 or α = 0.01) in the t-test, we adjusted the significant level (α = 0.0005) and obtained 96 features, and then 25 features were obtained by LASSO selection, and again, we kept only 12 features with the highest absolute values of coefficient weights. In the Student t-test, we retained 10 features that were considered different between the two groups at significant level 0.00003. In the ANOVA, we set the VarianceThreshold to 1e 19 and obtained 14 features.
Model validation and estimation. The mean of the weighted average accuracy, sensitivity, F1 score, and accuracy in the prediction of tumor blood supply condition by two neurosurgeons in visual observation  www.nature.com/scientificreports/  www.nature.com/scientificreports/ was 0.67, 0.63, 0.64, and 0.63, respectively. The mean score of MLR in the 3-fold-5-repeats cross-validation of the machine-learning model was 0.787, the mean AUC was 0.88 with a 95% CI from 0.806 to 0.932, and the weighted mean precision, sensitivity, F1-score, and accuracy for predicting the rich blood supply were 0.87, 0.86, 0.83 and 0.83, respectively. Table 4 shows a comparison of the performance results between the two, with the machine-learning model outperforming the visual observation in all aspects of prediction performance.

Discussion
The concept of radiomics was introduced in 2012 to capture the intra-tumoural heterogeneity in a non-invasive way 16 , which provided clinicians with an easily accessible and low-cost means of mining the patient's imaging data for radiomic features that cannot be identified by visual observation, and applied artificial intelligence, machinelearning, or statistical approaches to analyze the acquired high-throughput data to guide clinical practice 17 .
During the surgery procedure, the blood supply of the VS can significantly affect the surgical operation. This is often due to untimely bleeding in hemorrhagic tumors, which makes it difficult for the operator to distinguish important local anatomical structures in the cerebellopontine angle region and makes electrocoagulation more likely to cause damage to adjacent tissue 9 . Some literature suggests a preoperative embolization for tumors with imaging data suggesting a significantly rich blood supply 6,18,19 . Furthermore, in cases with a rich blood supply, spare aspirators should be prepared and care should be taken not to allow raging bleeding to flow into the subarachnoid space, otherwise, the operator will need to devote considerable effort for hemostasis, resulting in the need for unscheduled blood transfusions or extended operation duration. In clinical practice, the degree of blood supply to the tumor is assessed mainly by visual identification of anatomical features on MRI and the degree of enhancement of the tumor on TI-CE. However, the prediction by visual observation of the two neurosurgeons in this study showed that this method was not quite reliable in practice (precision = 0.67, sensitivity = 0.63). Therefore, it is necessary to develop a tool that can more effectively predict the blood supply of VS preoperatively. In this study, we established a reproducible model by combining a multisequence radiomics method with a machine-learning classifier through repeated sampling and cross-validation, which can preoperatively predict the blood supply of VS more accurately. As shown from the comparison in Table 4, the machine-learning model has a significantly better performance than the visual observation in terms of judgment. Our model holds promise for providing surgeons with additional information during preoperative evaluation, such as better assessment of the required operation duration to improve anesthesia; adequate preparation of blood transfusion or autologous blood transfusion, and multiple sets of suction for backup in the operative field preoperatively for tumors with rich blood supply.
The main advantage of radiomics than visual observation is that the former is able to extract radiomics features that cannot be identified by the visual observation, so the former can obtain several orders of magnitude more variables that can be used as predictors than the latter. Take the T1-CE sequences of MRI of the poor blood supply group and the rich blood supply group shown on the left side of Fig. 1 as an example, surgeons can distinguish tumors in the poor blood supply group by multiple cystic lesions with low signal through visual observation, while the rich blood supply group is a homogeneous parenchymal tumor with higher signal of the tumor. However, in practice, the MRI of most VS cases is not as typical as the example, and the accuracy of these empirical judgment criteria is not sufficiently reliable. The morphological radiomics features in this study were all excluded in the feature selection, and the extracted radiomics features were all related to the grayscale intensity distribution, which contained certain interpretability. At common resolution contrasts, it is difficult for visual observation to discern subtle differences between grayscale values, and these characteristics are reflected in Fig. 4 in the form of pseudo-color maps, where radiomics downscales the two-dimensional grayscale images, then computes and extracts information from these resulting matrices. For example, the First-Order feature reflects the voxel intensity distribution within the image region defined by ROI; Gray Level Co-occurrence Matrix (GLCM) quantifies gray level dependencies in an image. The detailed explanation is available in the official documentation (https:// pyrad iomics. readt hedocs. io).
In the selection of patient MRI sequences, we included T1-CE as the background for drawing the ROI because most of the VS had significant enhancement on T1-CE 20,21 , and the boundary of the tumor could be clearly outlined. To obtain more voxels, we extracted the T1WI, T2WI, and T2-Flair sequences which were commonly used in the diagnosis of VS, and aligned them to the drawn ROI in T1-CE for co-registration 21,22 . The ADC sequence proved to be valuable in applications such as differential diagnosis, treatment response prediction, and prognostic judgment of VS 23,24 , especially in the determination of cystic and solidity in VS 25 , so we initially included Table 4. Comparison of the performance results between the neurosurgeons in visual observation and machine-learning model. F1 score F-score or F-measure is a measure of a test's accuracy which is calculated from the precision and recall of the test. Mean ± standard deviation. www.nature.com/scientificreports/ it in the study. However, the co-registration of ADC to T1-CE sequences was always unsatisfactory in terms of anatomical image correspondence, probably due to the distorted and uneven data caused by adjacent structures and inhomogeneous internal structures of the lesion 25 , which may lead to an increase in confounding factors and a potential decrease in reproducibility 26 . Giordano et al. applied multiple ROIs to reduce structural inhomogeneity of the lesion, but the workload of this method was too large, so we finally discarded the ADC sequences. To obtain a model with better performance, we chose several feature selection methods and classifiers commonly used in the current medical context for combination and finally selected LASSO for feature selection and MLR for modeling based on AUC. As we can see from Table 2 and Fig. 3A, LASSO has significantly higher reliability than other feature filtering methods, and this method has been widely used for high-dimensional data processing 27 , where the coefficients of unimportant features are compressed to zero by a penalty algorithm to achieve dimensionality reduction of the data. The performance of both MLR and SVM classifiers in our results is satisfactory (0.88 vs. 0.84), and both are applicable in medium-sized samples, with MLR demonstrating better performance in this study probably because it usually performs well on binary data.
In machine learning, too many features may lead to the occurrence of overfitting. The recommended ratio of sample size to predictor variables in different studies varies from 20:1 to 10:1, but most recommend a minimum of 10 observations per predictor variable 14,15 . In this study, we followed this principle in order to reduce the possibility of overfitting. As can be seen in Fig. 3B, the scores of the training set and cross-validation approach a satisfactory value and stabilize as the sample size increases, so we believe that there was no significant overfitting occurring in the best combination. The discussion of the transition of radiomics to practical clinical applications has always been a critical issue, and one of the most important aspects of this process is the reproducibility of the acquisition of radiomics features. To reduce the variation among cases in the MRI cases and among different sequences of the same case, we performed the same preprocessing of all the images, including co-registration, correction, normalization, and resampling with same parameters. To minimize human variation, we drew ROIs by two neurosurgeons using different methods and analyzed the extracted radiomics features of both groups for ICC, and excluded those features that differed significantly among groups. However, these basic processes do not necessarily guarantee practical applications. Schwier et al. pointed out that the type of image, pre-processing and ROIs used to evaluate features could significantly change the repeatability of certain features, so the recommended practice was to publish the details of processing and parameter configuration as much as possible 28 . Therefore, on this basis, we published details as much as possible in the whole operation. The yaml file used in the feature extraction process and the MLR model constructed in the feature classification process in this study are available on Supplemental material 1 and Supplemental material 2.
The determination of the actual blood supply classification relied mainly on surgeons' intraoperative assessment, though subjective errors were inevitable, we standardized the criteria as much as possible within our medical team, and some cases that were difficult to distinguish or whose descriptions were ambiguous when the blood supply was recorded intraoperatively were excluded from this study. We collected and analyzed the intraoperative blood loss of patients, and in our statistics we found that intraoperative bleeding could not be used as a basis for classification of blood supply. However, the results showed a statistical difference between the two groups in terms of intraoperative blood loss when relying on surgeon's intraoperative assessment for grouping, which to some extent reflected the reliability of the grouping. Besides, we collected hemoglobin concentrations from each patient during the perioperative period at 6:00 p.m. preoperatively on the day of surgery and at 6:00 p.m. postoperatively on the day following surgery, and no statistical or practical significance was found in the analysis. Previous studies have shown that methods relying on perioperative blood indicators or calculating the volume of suction fluid are inaccurate for calculating perioperative blood loss [29][30][31] , which may also support the negative results of our analysis of perioperative hemoglobin concentration changes. A feasible method of quantification is to calculate total intraoperative hemoglobin by measuring the hemoglobin concentration and fluid volume of the suction fluid, and this method is costly but relatively reliable and can be used as a method to further improve related studies. www.nature.com/scientificreports/

Conclusion
The radiomics machine-learning classifiers is an effective method to predict the blood supply of vestibular schwannoma by preoperative MRI data, which has a better performance than neurosurgeons' judgment by visual observation of MRI image and can provide more information for surgeons to help neurosurgeons make operative strategy.

Data availability
The datasets used or analysed during the current study are available from the corresponding author on reasonable request.