Differentiating solitary brain metastases from glioblastoma by radiomics features derived from MRI and 18F-FDG-PET and the combined application of multiple models

This study aimed to explore the ability of radiomics derived from both MRI and 18F-fluorodeoxyglucose positron emission tomography (18F-FDG-PET) images to differentiate glioblastoma (GBM) from solitary brain metastases (SBM) and to investigate the combined application of multiple models. The imaging data of 100 patients with brain tumours (50 GBMs and 50 SBMs) were retrospectively analysed. Three model sets were built on MRI, 18F-FDG-PET, and MRI combined with 18F-FDG-PET using five feature selection methods and five classification algorithms. The model set with the highest average AUC value was selected, in which some models were selected and divided into Groups A, B, and C. Individual and joint voting predictions were performed in each group for the entire data. The model set based on MRI combined with 18F-FDG-PET had the highest average AUC compared with isolated MRI or 18F-FDG-PET. Joint voting prediction showed better performance than the individual prediction when all models reached an agreement. In conclusion, radiomics derived from MRI and 18F-FDG-PET could help differentiate GBM from SBM preoperatively. The combined application of multiple models can provide greater benefits.

www.nature.com/scientificreports/ than human eyes can recognize. It offers good performance in assessing the pathophysiology of tumours and distinguishing tumour characteristics 12,13 . In recent years, radiomics has made considerable progress in tumour and nontumor disease diagnosis [14][15][16][17] . There have been several studies on the differentiation of GBM and SBM by radiomics derived from MRI. For example, radiomic features extracted from peritumoral oedema areas in T1-weighted contrast-enhanced imaging (T1C) and T2-weighted imaging (T2) were used to differentiate GBM from SBM 18,19 . The above studies have shown good potential, with limited radiomics effectiveness based only on MRI. An 18F-FDG-PET examination can reflect the metabolic characteristics of tumours at the molecular level, and it plays an essential role in tumour detection, staging, and efficacy evaluation 20 . With the precision and personalization of clinical treatment, the application value of 18F-FDG-PET in tumours has been increasingly recognized and promoted. Therefore, it is necessary to incorporate 18F-FDG-PET into brain tumour radiomics research. Radiomics based on 18F-FDG-PET has been used to differentiate lymphoma and glioma of the central nervous system 21 . In a previous study by Zhang 22 , different combinations of conventional MRI (cMRI), including T1C and T2, diffusion-weighted imaging (DWI) and 18F-FDG-PET images were explored to establish different radiomic models to differentiate SBM and GBM and found that the integrated model based on cMRI, DWI, and 18F-FDG-PET had the highest discriminative power between the two tumours. However, in the clinic, advanced sequences such as DWI are not as readily available as cMRI. Therefore, we hypothesize that the radiomics features derived from cMRI and 18F-FDG-PET can also better differentiate the two tumours than MRI alone. Some previous studies on radiomics have shown that each classifier has advantages and limitations 23,24 . It is difficult to choose an absolute optimal model. Therefore, we hope to build a variety of models and jointly apply these models to obtain greater benefits.

Materials and methods
Study population. We retrospectively collected the imaging data of brain tumours in 100 patients (50 SBMs, To avoid the inconsistency of the acquired image acquisition and scanning parameters, which may affect the radiological characteristics and quantitative analysis 25 , the MRI images of all cases were acquired from only one MRI scanner and the same as the PET. The inclusion criteria of patients were as follows: (a) glioblastoma or metastasis confirmed by surgery and pathology; (b) preoperative cranial MR imaging, including T2 and T1C, and preoperative cranial 18F-FDG-PET examination; and (c) the interval between preoperative MRI examination and 18F-FDG-PET examination was no more than two weeks. The exclusion criteria were as follows: (a) multiple tumours; (b) a history of brain tumour biopsy or treatment before MRI and 18F-FDG-PET examination; and (c) unqualified image quality with artefacts or tumour size less than 1 cm. The patient selection process flowchart is shown in Fig. S1. MRI/18F-FDG-PET protocol. MR images were obtained from the 3.0 T MRI system (Signa HDXT, GE Healthcare, Milwaukee, USA) with an 8-channel head coil. The main parameters of the T1C sequence were as follows: repetition time (TR) = 750 ms, echo time (TE) = 15 ms, slice thickness = 5 mm, and slice interval = 1 mm. The main parameters of the T2 sequence were as follows: TR = 8,000 ms, TE = 140 ms, flip angle = 90°, slice thicknesses = 5 mm, and interval = 1 mm.
A PET/CT scanner (Philips Gemini TF 64 PET/CT scanner) was used for 18F-FDG PET data acquisition. The participants fasted for at least 4 h before 18F-FDG (produced by Sumitomo accelerator of Japan with a radiochemical purity of > 95%), administered injection intravenously at a dose of 5.55 MBq/kg and then rested in a quiet, dim room for 40-60 min before PET/CT scanning. A PET/CT scan of the head was performed for a one-bed position (5 min/bed position) with a slice thickness of 2 mm. The 18F-FDG-PET images acquired from the PET/CT system were calibrated on the PET/CT workstation, on which the interpolation of the 18F-FDG-PET image in DICOM format was performed to double the physical resolution of the image.
Image preprocessing and segmentation. First, MRI and 18F-FDG-PET data were imported into DicomBrowser software (https:// nrg. wustl. edu/ softw are/ dicom-brows er) for data desensitization, and the desensitized images were loaded into 3D-Slicer (version 4.11, https:// www. slicer. org) for registration. T1C and 18F-FDG-PET images were registered separately based on the T2 images. A radiologist with 5 years of experience delineated the tumour and the oedema area around the tumour on the T2 images. After all delineations were complete, a neuroimaging doctor with 10 years of experience modified and determined the final delineated area. The region of interest (ROI) was copied to the corresponding layers of the registered T1C and 18F-FDG-PET images. In this way, the mask data for each of the three sequences were formed. The two doctors were unaware of the pathological types of all cases. Feature selection and model building. For all MRI data, the hybrid white-stripe method was used to perform signal intensity normalization to avoid data heterogeneity bias 26 . Referring to the Image Biomarker Standardization Initiative (IBSI), the radiomics features of T2, T1C, and 18F-FDG-PET were obtained by using Python's PyRadiomics package. All features were extracted from the original and derived images. The latter was processed by a wavelet filter (Wavelet) and Laplacian of Gaussian filter (LoG). The t test was performed on the features extracted from the GBM and SBM cases to eliminate features with no significant difference. The features selected by t test were then used to determine effective features using five dimensionality reduction methods as follows: linear discriminant analysis (LDA), principal component analysis (PCA), partial least squares regression (PLS), near-collar component analysis (NCA), and least absolute shrinkage and selection operator (LASSO) 27 . www.nature.com/scientificreports/ Both LDA and PCA are linear dimensionality reduction methods that transform the original n-dimensional dataset into a new dataset through an orthogonal transformation. The partial least squares method uses the basic relationship between the independent and dependent variables to model the covariance structure in the twovariable space to achieve dimensionality reduction. NCA uses the Mahalanobis distance as the distance measurement. The conversion matrix was obtained through the dimensionality reduction in original data and learned by continuously optimizing the classification accuracy. LASSO dimensionality reduction uses the L1 regularization linear regression method to perform dimensionality reduction and to zero part of the learned feature weight, thereby achieving feature sparseness and reducing the data dimensionality. Five classification algorithms were chosen: support vector machine (SVM), logistic regression (LR), K nearest neighbours (KNN), random forest (RF), and adaptive boosting (AdaBoost). SVM classification performance is excellent in a small sample of machine learning tasks 28 . The logistic regression (LR) classifier runs faster and has higher requirements for feature engineering 29 . The idea of the KNN classification algorithm is simple and effective, but there is also a large number of calculations during the classification process, which requires considerable memory 30 . Random forest (RF) reduces the risk of overfitting by averaging decision trees. It is virtually a stable classification method, but the calculation is complex and requires more time to train the model 31 . Adaptive boosting (AdaBoost) is a vital ensemble learning technology that enhances a weak learner with a prediction accuracy only slightly higher than random guessing into a strong learner with higher prediction accuracy 32 . However, the disadvantage of this classifier is that it is more sensitive to outliers. The entire dataset was split into a training cohort (GBM: n = 39, SBM: n = 41) and a validation cohort (GBM: n = 11, SBM: n = 9) by stratified sampling using computer-generated random numbers at a ratio of 8:2, and 25 models were generated by applying fivefold cross-validation with the five dimensionality reduction methods and the five classification algorithms. Nomenclature was adopted by combining the names of the dimensionality reduction method and the classification algorithms, e.g., "LASSO_LR": a combination of the LASSO dimensionality reduction method and the LR classification algorithms. Three model sets were built, in which 25 models with 18F-FDG-PET and MRI data were regarded as the integration set, 25 models with isolated MRI data were regarded as the MRI set, and 25 models with 18F-FDG-PET data alone were regarded as the PET set.

Scientific Reports
Individual and combined application of models. After the three model sets were built, the receiver operating curve (ROC) of each mode was drawn, and the area under the receiver operating curve (AUC) was also calculated. The differences in the average AUC of the three model sets were compared. The model set with the highest average AUC was selected and then ranked the 25 models according to the AUC level. To verify the performance stability of models of different AUC levels, 15 models with three levels of AUC were selected and equally divided into three groups. To present a certain level of difference in AUC value between the models of the three groups, we defined 5 models with the AUC ranking of 1-5th as Group A, 5 models ranked 11-15th as Group B, and 5 models ranked 21-25th as Group C.
Individual and combined application of five models in the three groups were performed. The same weighting and a simple majority vote method 33 were used to explore the combination performance of the five models in each group. During this process, each model was regarded as a specialist and provided with the same weight in the diagnosis. The final diagnosis was made according to the simple majority rule 34,35 . For instance, a case was determined to be GBM when more than three of the five models predicted it to be GBM. According to the consistency of voting results, three agreement patterns were obtained: 3A pattern referring to 3 models reaching an agreement that a case was predicted as GBM or SBM by three of the five models; 4A pattern referring to 4 models reaching an agreement; 5A pattern referring to 5 models reaching an agreement. Accuracy, sensitivity, and specificity were used to evaluate the performance of individual and joint voting prediction.
The entire workflow of our research is shown in Fig. 1.

Statistical analysis.
Pearson's chi-square test was used to compare the sex difference between GBM and SBMS in the entire data, training cohort, and validation cohort. Student's t test was applied to compare the age difference between GBM and SBM. The Mann-Whitney U test was used to compare the differences in the distribution of AUC between each two of the three model sets. All statistical analyses above were carried out with SPSS 19.0 statistical software (https:www. ibm. com/ produ cts/ spss-stati stics). Delong's test was performed with Python 3.8 (https:// www. python. org/ downl oads/ relea se/ python-380) for the difference in the AUC values of the models. All statistical tests were two-sided, and the statistical significance level was set at 0.05. P values of less than 0.05 were considered to be statistically significant.
Statement. This study complies with the Declaration of Helsinki, and research approval was granted from the Biomedical Research Ethics Committee of Chongqing Medical University with a waiver of researchinformed consent.

Results
No significant difference between GBM and SBM in age and gender was found in the entire data, the training and validation cohort in individual and joint voting prediction. No significant differences were found between GBM and MET in anatomical characteristics, necrosis appearance, or oedema appearance. See Table 1  www.nature.com/scientificreports/     The integration set was finally selected with an ACC of 0.67-0.89, a sensitivity of 0.66-0.88, and a specificity of 0.65-0.92 (Table 3). Specific performance indicators of the fivefold mean value for the validation cohort of the 15 models selected from the integration set are shown in Table 3. The results of individual and joint model voting prediction in the training and validation cohorts are shown in Table S1. Compared with the individual prediction, joint model voting prediction showed that different agreement patterns had different classification performances (Fig. 5). In Group A, the 5A pattern showed the highest sensitivity, specificity, and accuracy in the training cohort(0.96, 0.97, 0.97) and the same in the validation dataset(1.0,1.0,1.0); in Group B, the 5A and 3A pattern showed the highest sensitivity(both 1.0), the 4A pattern showed the highest specificity (1.0), and the 5A pattern showed the highest accuracy (0.98) in the training cohort, the 5A, and 3A patterns showed the highest specificity (both 1.0), the 4A pattern showed the highest sensitivity (1.0), and the 5A pattern showed the highest accuracy (0.90) in the validation cohort; in Group C, the 5A and 4A patterns showed the highest sensitivity (1.0), specificity (1.0) and accuracy (1.0) in the training cohort, the 5A pattern showed the highest sensitivity (1.0), and  The proportions of consistent patterns in different model groups were also different (Fig. 6).

Discussion
This study extended most previous radiomics studies that extracted features from cMRI sequences on enhancing tumour regions and peri-enhancing oedema regions to differentiate GBM from SBM and further incorporated 18F-FDG-PET features reflecting tumour molecular metabolism. MRI-based radiomics has been used to differentiate GBM from SBM in previous studies. Su et al. 36 , T1C-based radiomics analysis yielded AUC values of 0.82 and 0.81 in the training and validation cohorts, respectively. Ortiz et al. 37 , the AUC of the T1WI-based radiomics model was 0.896 ± 0.067. In the study of Bae et al. 19 , they extracted radiomics features from T1C tumour enhancement and T2 peritumoral oedema areas, establishing the best conventional model with an AUC of 0.89. We combined 18F-FDG-PET and cMRI features to build 25 multimodal radiomics models. Finally, the result was satisfactory in that the best model achieved an AUC value of 0.93, higher than previous studies based on cMRI alone mentioned above, although lower than the integrated model with an AUC value of 0.98 in Zhang's study 22 . In our study, the multimodal models integrating 18F-FDG-PET and cMRI improved AUC values compared to radiomics models derived from only 18F-FDG-PET and cMRI. This is consistent with our previous hypothesis that multimodal radiomics would better distinguish GBM from SBM. Unlike most previous studies in which  www.nature.com/scientificreports/  www.nature.com/scientificreports/ the best model was selected from multiple models, in this study, we obtained the best model and found that the combination of multiple models was more beneficial. In the three model sets, almost all the models using the LASSO feature selection method had higher AUC values, suggesting that LASSO is a reliable feature selection method. The classifier SVM can filter the most effective samples for the prediction task in massive feature data. In Group A of the integration set, the top 2 classifier is SVM, which proved that it also has strong generalization ability with a small data sample size. LASSO_SVM is considered the optimal model in all three model sets in our study (with an AUC of 0.93 in the integration set, 0.89 in the MRI set, and 0.85 in the PET set). This is consistent with the research results of Qian 38 , in which 84 models were built and LASSO_SVM was selected as the best model for an AUC of 0.9. The performance may be different among different models even based on the same data. Therefore, we explored the combined application value of different models instead of only choosing the best model. The combined application of multiple models is similar to multidisciplinary teamwork (MDT) in clinical practice. The collaboration between specialists in clinical practice is significant for making comprehensive and correct decisions 39,40 . In this study, the 5A pattern of the joint voting has improved accuracy, sensitivity, and specificity to varying degrees compared with the individual prediction. The performance of the 4A and 3A patterns all shows a downwards trend, and the average level of individual prediction outperformed the 3A pattern, which is similar to the results of Dong 41 . In the combined application of multiple models, it was interesting that the five models in Group A, with higher AUC values than Groups B and C, were more likely to reach an agreement (highest 5A pattern, lowest 3A pattern). In Groups B and C with lower AUC values of the models, the application of the 5A pattern improved the prediction performance more significantly. Similar results can also be found in the MRI and PET sets (Tables S3-S4 and Figs. S2-S5). Both this and previous studies 41 have shown that the method of combining multiple models can be more beneficial, especially when the model performance is not good. The benefits of applying this method will be more obvious. Although the performance of our model is not as good as the optimal model established by Zhang's study 22 , our study further confirms that the addition of PET features reflecting tumour metabolism can better distinguish SBM and GBM than the radiomics model based solely on cMRI features. On this basis, our study also provides a good solution for poor model performance in radiomics studies.
Of course, there are some limitations to this study. First, the image data were obtained by the same MR and PET/CT scanner. Therefore, the samples we obtained were relatively few. Although the results performed well, the generalization ability of each model still needs a large sample size for further verification. In practical work, it is difficult to obtain image data with consistent scanning parameters in different medical institutions or even in the same institution. In addition, the simple voting method with the same weight is adopted in the joint application of multiple models, and more joint application methods and comparisons with different methods can be further explored.

Conclusion
Radiomics derived from cMRI and 18F-FDG-PET can help differentiate GBM from SBM preoperatively, which may achieve greater benefits in clinical practice. Multimodal radiomics based on MRI and 18F-FDG-PET is expected to become a powerful research method for the differentiation of intracranial tumours. The combined application of multiple models inspired by MDT can generate extra benefit, especially when the performance of the model is mediocre. The combined application of multiple models can be used as a new method in radiomics research. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.