Abstract
This study tests the generalisability of three Brain Tumor Segmentation (BraTS) challenge models using a multi-center dataset of varying image quality and incomplete MRI datasets. In this retrospective study, DeepMedic, no-new-Unet (nn-Unet), and NVIDIA-net (nv-Net) were trained and tested using manual segmentations from preoperative MRI of glioblastoma (GBM) and low-grade gliomas (LGG) from the BraTS 2021 dataset (1251 in total), in addition to 275 GBM and 205 LGG acquired clinically across 12 hospitals worldwide. Data was split into 80% training, 5% validation, and 15% internal test data. An additional external test-set of 158 GBM and 69 LGG was used to assess generalisability to other hospitals’ data. All models’ median Dice similarity coefficient (DSC) for both test sets were within, or higher than, previously reported human inter-rater agreement (range of 0.74–0.85). For both test sets, nn-Unet achieved the highest DSC (internal = 0.86, external = 0.93) and the lowest Hausdorff distances (10.07, 13.87 mm, respectively) for all tumor classes (p < 0.001). By applying Sparsified training, missing MRI sequences did not statistically affect the performance. nn-Unet achieves accurate segmentations in clinical settings even in the presence of incomplete MRI datasets. This facilitates future clinical adoption of automated glioma segmentation, which could help inform treatment planning and glioma monitoring.
Similar content being viewed by others
Introduction
Clinically accurate segmentation and longitudinal volumetric analysis of glioma are helpful in treatment planning and response monitoring1,2. Volumetric analyses are not commonly used in clinical practice and are generally limited to crude 2D measurements in clinical trials. While this is the current standard for treatment response evaluation in trials3, poor prognosis and heterogeneous treatment response encourage quantitative analysis of tumors, especially for glioma due to their varied morphometry and infiltrative nature4,5,6,7. It is these two characteristics of glioma, along with heterogenous contrast enhancement, that complicate their manual delineation and further highlight the need for automated segmentation protocols in the clinical setting8,9,10. Indeed, baseline imaging and volumetric measurements are of particular importance to neurosurgeons and radiotherapists because tumor volume and functional anatomy are key factors for both risk and prognostic assessment of patients11,12.
The VASARI features have illustrated the importance of extracting such quantitative measures, but automation of segmentation and subsequent feature extraction is needed to enable widespread application13,14. Automated quantification could provide improvements in reporting time, treatment response monitoring, and overall efficiency across a neuroradiological service, but is dependent upon technical and clinical validation of the methods15,16,17. Deep learning has emerged as the preferred method for automated tumor segmentation6,18,19,20,21. Ideally, the clinical environment requires a fast algorithm that is robust to scanner variation and missing MRI sequences.
Since 2012, the annual Brain Tumor Segmentation (BraTS) Challenge has compared the performance of numerous AI-driven glioma segmentation algorithms18,22. However, these algorithms are trained and assessed on a highly curated dataset optimised for quality: each subject has a complete dataset of high-quality pre- and post-contrast T1-weighted (T1w and T1c, respectively), T2-weighted (T2w), and T2-weighted fluid-attenuated inversion recovery (FLAIR) images, which does not accurately reflect the realities of clinically-acquired MRI data. For example, a recent study using a model (DeepMedic) trained exclusively on BraTS data, achieved a median Dice similarity coefficient (DSC) of 0.81 on BraTS test data but only 0.49 on external clinical data 23.
The aim of the current study was to determine the performance and generalisability of three of the highest-performing models at recent BraTS challenges24,25,26 on real-world clinical data. Models have been trained with both BraTS data and another multi-centre dataset obtained from 12 different hospitals worldwide: the PICTURE project (www.pictureproject.nl)27,28,29,30,31. An external test set comprised of PICTURE data from hospitals not used in the training and validation phases was employed to assess the clinical applicability and determine the need for retraining models on a hospital’s own data. Furthermore, we use sparsified training, to account for missing sequences23, and assessed performance in patients with incomplete MRI datasets.
Materials and methods
All patients provided informed consent and data were obtained and anonymized according to the General Data Protection Regulation and Health Insurance Portability and Accountability Act. Local Institutional Review Board approval was obtained for all primary studies. For the the VU medical center Amsterdam the institutional review board approved of the experiments in this study under case nr. 2014.336. Of the patients involved in the current study, 40 were previously studied in an inter-rater agreement study by Visser et al.29. The 275 Glioblastoma patients from the PICTURE dataset were previously used in a study focused on robust tumor core segmentation in glioblastoma patients Eijgelaar et al.23. The study was carried out in concordance with the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) guidelines 32.
BraTS and PICTURE Datasets
We used manual segmentations of preoperative imaging of 1251 gliomas (unspecified mix of GBM and LGG) from the BraTS 2021 dataset and 275 GBM and 205 LGG (median age, 63.7 IQR [54.3–72.0] years; median survival, 323 [142–609] days; surgery extent: 348 resections, 83 biopsies, 49 unknown) from the PICTURE project. The PICTURE dataset was collected across 12 hospitals worldwide, all patients of at least 18 years old with a newly-diagnosed LGG, or GBM at first-time surgery between 1/1/2012 and 12/31/2013 were included. Since the PICTURE data was collected in 2012 and 2013, the classification of GBM and LGG was in line with WHO 2007 criteria. Demographics for the PICTURE dataset are documented in Appendix 1 of the supplementary material.
Missing scans
Both datasets contain pre-operative T1w, T1c, T2w, and FLAIR images. However, in the PICTURE dataset some patients had missing sequences, see Table 1 for a breakdown and Fig. 1 for examples of subjects with a missing FLAIR or T2. Only patients with at least T1c and either T2w or FLAIR were included to be able to manually segment all tissue classes. Out of 1731 total cases, there were 204 missing pre-contrast T1w, 186 missing T2w, and 19 missing FLAIR, see “Model training and testing” for details of the sparsified training used to account for missing sequences.
Pre-processing
T1w, T2w, and FLAIR images were rigidly registered to the T1c image. Subsequently, the T1c was registered to the SRI24 atlas (https://www.nitrc.org/projects/sri24/)33 using an affine transformation. The same transform was applied to the other MR sequences (T2w, FLAIR, T1w). All modalities were resampled to 1mm isotropic voxels in the SRI24 atlas space, the rigid and affine registrations were applied using a single interpolation step. All registrations and resampling were conducted using the Advanced Normalization Tools (ANTs)34. N4 bias field correction35 was used and skull stripping was performed with the “HD-bet” algorithm (https://github.com/MIC-DKFZ/HD-BET) 36.
Manual segmentations
For the PICTURE data, 275 GBM and 205 LGG cases were manually segmented into 3 classes consistent with the BraTS challenges – whole tumor (WT), tumor core (TC), and enhancing tumor (ET), see Fig. 1. The WT defines the full extent of the tumor, including the tumor core and oedema, indicated by hyperintensity on FLAIR and T2w. The TC is the main body of the tumor and most likely area of resection. The TC includes the enhancing tumor (ET) and necrosis.
Manual segmentations were carried out according to the VASARI Research Project (https://wiki.cancerimagingarchive.net/display/Public/VASARI+Research+Project). One rater (HP) with 9 years of brain MRI manual segmentation experience performed segmentations under the supervision and approval of an expert neuroradiologist (FB), using the semiautomatic SmartBrush tool (BrainLab, Feldkirchen, Germany). The rater’s performance was in line with experts29. All segmentations were exported on the T1c image. The segmentation was resampled to SRI24 atlas space using the same transform from the T1c to SRI24 registration.
Quality control
Visual quality control checks were carried out for incomplete coverage, skull stripping, registration errors, and incomplete segmentations. Overview images were generated to facilitate quality control. The images show the same axial, sagittal, and coronal view for all patients to assess the registration quality, as well as an axial view of the center of the tumor to verify the segmentation. Seven scans were not included due to poor image quality and five due to severe registration errors (as illustrated in Appendix 2).
Deep learning segmentation models
Three algorithms were selected for this study based on high performance in recent BraTS challenges18,22,37, availability of a user-friendly and reproducible implementation online, and the uniqueness of the algorithm, see Table 2.
Model training and testing
Models were trained with three-class segmentations (WT, TC, ET) for each tumor. The scans were randomly split in 80% training, 5% validation, and 15% internal test data, see Table 1. Test data was used to assess the performance of each model. Alongside the 15% internal test data, models were further assessed using an external test set of 158 GBM and 69 LGG patients from PICTURE hospitals not included in the training data, herein referred to as the external test set. This helped to gauge the generalisability and determine the future need for retraining algorithms on a new hospital’s unseen data.
In order to address missing sequences in the training data (Table 1), sparsified training was applied for all algorithms23. This study showed that performance drops substantially if not all sequences are available. This could be solved by inserting empty (zero-filled) scans in place of missing sequences, see the first column of Fig. 1. During training, the T1w, T2w, and FLAIR were additionally set to zero with independent probabilities of 20%, in line with the estimated frequency of missing sequences in the clinical setting23. We used the validation data to confirm convergence, the hyperparameters of all models were kept at the default values, as reported in the associated papers, or as used in the published code repositories (Table 2). All model training and testing was carried out using a machine equipped with an AMD Ryzen 9 3900X 12 core processor, 64GB RAM, and 1 NVIDIA RTX3090 (24GB) graphics processing unit (GPU).
Model performance assessment
In line with the BraTS challenges, tumor segmentations for each algorithm were assessed using median and inter-quartile range Dice similarity coefficient (DSC)38 and Hausdorff distance (HD)39 for all experiments. Results were generated using methods described by Taha and Hanbury40 and associated software.
Experiments and statistical analyses
Four separate experiments were performed. In experiment 1, DSC and HD from the internal and external test sets were analyzed separately for each model/tumor class using a paired two-tailed t-test to assess differences between each model. DSC and HD were also compared in the following experiments using independent samples (Welch’s) t-tests: experiment 2—GBMs vs LGGs on the internal and external test set to assess differences in performance on the differing tumor grades; experiment 3—GBMs and LGGs from the internal vs external test sets to assess the change in performance when segmenting external hospital data not previously seen by the models; and experiment 4—GBM and LGG patients with incomplete vs complete imaging datasets in the external test set to assess the change in performance when segmenting patients with incomplete imaging datasets from external hospital data not yet seen by the models. A single Bonferroni correction was applied for each experiment41. Outliers in box plots and overall outlier rates for each model and segmentation class were calculated using the IQR × 1.5 rule, i.e. outside [Q1 − 1.5 × IQR; Q3 + 1.5 × IQR]42,43.
Results
PICTURE dataset segmentations
See Fig. 2 for ground truth manual segmentation and automated segmentation examples from all three models. See Appendix 3 and 4 for GBM and LGG segmentation contours on a 4 T T1c scan along with two human experts’ manual segmentation, for all tumor classes.
Experiment 1—Segmentation performance on both test sets—which model achieved the best metrics?
Box plots showing median DSC and HD for all models and tumor classes on the internal and external test sets are presented in Fig. 3. nn-Unet achieved significantly higher DSCs than nvNet and lower HDs than both nvNet and DeepMedic for all tumor classes on both the internal and external test sets (p values < 0.0027). The raw metrics are displayed in Table 3.
Experiment 2—Segmentation performance on GBM vs LGG
Comparing performance between GBM and LGG, nn-Unet continued to provide the best quality results for both tumor grades, statistical comparisons are reported in Table 4. However, overall segmentation performance on the LGG was notably weaker than for GBM across all models, see Fig. 4 for box plots. DeepMedic showed the largest decrease in performance across all tumor classes.
Experiment 3—GBM segmentation performance on an external test set—do models need to be retrained for new hospital data?
As shown in experiment 1, nn-Unet produced the most favourable results when compared to the other models on both test sets. Table 3 shows the DSC and HD results for the internal test set (15 GBM, 23 LGG) and the external test set (158 GBM, 69 LGG) comprised of cases from hospitals not included in the training data, see Fig. 5 for box plots. nn-Unet showed the smallest absolute decrease and increase in respectively DSC and HD from the internal to the external test set for GBM WT, (DSC internal: 0.97, external: 0.95, p < 0.001*, HD internal: 7.34, external: 9.11, p = 0.958). All models’ DSC were slightly reduced on WT and TC for both HGG and LGG but remained within clinically-acceptable range18,22,44. However, the segmentation performance of ET improved in the external dataset for all models.
Experiment 4—Effect of missing MRI sequences on segmentation performance
Box plots showing DSC and HD of GBM and LGG patients with incomplete (44GBM, 55LGG) versus complete (114GBM, 14LGG) scans in the external test set are presented in Fig. 6 and Table 5 separately. For GBM, nn-Unet achieved the highest DSCs and lowest HD for all tumor classes on both incomplete and complete scans, with the exception of nvNet reaching a slightly lower HD on TC for incomplete scans. There were no statistically significant differences between the two groups for all models.
Outlier rates
For outliers according to DSC, based the IQR × 1.5 rule42,43, nn-Unet had the lowest outlier rate of all the models on the external test set (158 GBMs and 69 LGGs) at 3.65% of segmentations, for DeepMedic it was 8.75% and nvNet 10.28%. Outlier rates across the tumor classes were equally low for nn-Unet at 4.78% WT, 2.53% TC and 3.48% ET; DeepMedic recorded outliers at 7.88% WT, 7.24% TC, and 8.45% ET; and for nvNet 11.23% WT, 9.13% TC and 10.21% ET. A similar pattern was observed for outliers according to HD: 2.65% for nn-Unet, 36.21% for DeepMedic, and 6.78% for nvNet. Outlier rates across the tumor classes for nn-Unet at 3.58% WT, 2.56% TC and 1.37% ET. Rates for DeepMedic were 36.22% WT, 39.98% TC and 21.25% ET; and for nvNet were 8.98% WT, 8.02% TC and 5.58% ET.
Discussion
In this study, we compared the performance of three of the top performing BraTS challenge deep learning models for automated brain tumor segmentation in an external multi-centre hospital dataset (https://www.pictureproject.nl). We extended the valuable work of the BraTS challenge by increasing the number of training cases and using a less strictly curated, and therefore more clinically-relevant, dataset27,28,29,30,31. Subsequently, we tested the generalisability of the three models on an external test set comprised of data from hospitals not used in model training. Akin to the realities of clinical assessment, we further show the utility of these models when segmenting incomplete MRI datasets, due to acquisition protocols or patient-specific circumstances, sparsified training was applied to account for missing pulse-sequences23. Our results demonstrate that nn-Unet, when supplemented sparsified training, produces high DSC and low HD for glioma segmentations in real-world hospital data.
Clinical implications
Manual segmentations are the current gold standard in clinical practice, where an inter-rater variability of 0.74–0.85 DSC has been previously reported in the BraTS challenge18,22,44. All models’ median DSCs for both test sets were within this “clinically acceptable” inter-rater agreement range. However, manual segmentations are not a time-efficient process. Semi-automatic multi-class glioma segmentation using BrainVoyagerTM QX, ITK-Snap and 3D Slicer is reported to take an average of 18–41 min per patient45. On the whole, automated inference times in the current study were considerably lower than these reported semi-automated segmentation times, see Appendix 5 for all results. nn-Unet takes approximately 37 min of computer time to produce a segmentation using a CPU or only 4.5 min when a GPU is available, versus 18–41 min of human rater time.
The majority of median DSCs were within this clinically-acceptable range of 0.74–0.8518,22,44 when testing on an external test set with missing pulse-sequences, but there was a decrease in DSC for all models on the WT and TC, but not for the ET. The TC yielded the most accurate segmentations for both DSC and HD across models. Since the TC is the main body of the tumor and the most likely area of resection, our findings suggest that using nn-Unet with sparsified training may be an optimal combination for pre-surgical applications, with acceptable results in 97.47% of patients, based on the outlier rate of 2.53% for nn-Unet.
nn-Unet yielded the fewest outliers in all categories across all models. Furthermore, it showed the smallest reductions in segmentation performance on the external test set. There were also no statistically significant changes in segmentation quality when comparing complete versus incomplete imaging datasets. In line with other recent work, this suggests not all of the MRI sequences are necessary when models are augmented using sparsified training, or similar methods23,46,47. However, the lower WT DSCs indicate a heavier reliance on a full set of MRI sequences for WT segmentation, which is plausible given the hyperintensity of oedema on FLAIR and T2w. Previous studies have also used generative adversarial networks (GAN) to synthesise missing sequences with very promising results48, therefore direct comparison of this approach and the sparsified trained used in the current study is encouraged.
Interestingly, the nn-Unet original model used in this study came third and second in the 2017 and 2018 BraTS challenges, respectively. NvNet won the 2018 BraTS challenge but generated the lower DSCs in the current study. NvNet’s underwhelming results on incomplete datasets (Table 5) could be due to reduced effectiveness of the auto-encoder regularization in combination with sparsified training. DeepMedic won the 2017 BraTS challenge but generated the weakest HD in the current study, especially when predicting the LGG scans. The discrepancy in these findings demonstrates the value and relevance of testing models on unseen hospital-quality data with missing sequences, as we have in the current study.
Limitations
We performed segmentations in line with the same definitions of the BraTS challenge in order to facilitate comparison, see Fig. 1. However, these definitions may not be those used in the clinical setting. In the BraTS challenge, the WT includes oedema and associated infiltrations but in reality neurosurgeons and neuroradiologists would more often classify the edge of the “tumor core” as the clinical definition of the “whole tumor”, i.e. the enhancing and non-enhancing part of the core and its associated necrosis, not including oedema. While this definition might be a better representation of the truth, current MRI techniques make it very difficult to distinguish between oedema and non-enhancing infiltrative tumor. Further research is needed to accurately distinguish between non-enhancing tumor and oedema. Depending on the intended use case for automated glioma segmentations, having a less subjective, more consistent measurement may generate a more accurate representation of true tumor infiltration, and the associated increased (inter-rater) variability. The WHO glioma classification have been updated in 2021: WHO CNS5 has some variations by further advancing the role of molecular diagnostics in the classification of CNS tumors, but remains rooted in its established methods of histology and immunohistochemistry in tumor characterisation49. The classification of GBM and LGG is very relevant to a model trained on combined LGG and GBM data, especially when it works on all gliomas.
Furthermore, we did not target hyperparameter optimisation for the sparsified training, nor did we make specific architecture optimisations for training and testing these models on a much larger dataset. Peak performance may be improved by doing so, but we chose not to tweak hyperparameters in order to promote generalisability.
Future work
In our study, we have only used pre-operative scans, while post-operative and longitudinal scans are also clinically relevant for radiotherapy planning, quantitative follow-up, and automatic growth detection; however, pre-operative baseline measurements are required for these assessments. Future work should follow the BraTS challenge latest aims and include disease progression monitoring and overall survival prediction. Furthermore, the current approach relied on having at least the T1c scan available. While this is a safe assumption for most retrospective cohorts, this may be different for future cohorts due to ongoing efforts to reduce gadolinium use50. To support these sequences new models would have to be trained, however we have shown that sparsified training provides a simple solution to train models that are flexible to the available sequences.
The tested networks all use very different implementations, making it difficult to pinpoint which differences between the models best explain the observed performance differences. To gain a better understanding of which properties most affect performance, future development should focus on consolidating different models within a single framework and applying and testing changes gradually.
We have shown that sparsified training offers a simple solution to missing sequences that is easy to implement for different network architectures and frameworks. While dealing with missing sequences is important, and allows for the inclusion of larger (retrospective) cohorts, improving the availability of all sequences for future patients would tackle the problem at the root.
Conclusions
In this study, we have shown the feasibility of using sparsified training alongside three top-performing BraTS challenge models to produce high-quality glioma segmentations of real-world hospital data with missing sequences. When segmenting scans with incomplete MRI sequences there was no statistically significant decrease in performance. While performance was slightly reduced in an external test set, the segmentations remained within clinically acceptable ranges. nn-Unet was the most consistent performer with highest DSCs, lowest HDs, tightest IQRs, and smallest outlier rates across the vast majority of experiments.
Data availability
The BraTS data used in this study is available through http://braintumorsegmentation.org. The PICTURE data is available from the corresponding author, upon reasonable request.
Code availability
See Table 2 for links to all software availability and https://gitlab.com/picture-production/picture-qni-robust-glioma-segmentation/ for code used in this study. The nn-Unet model has also been integrated in the picture-nnunet python package https://gitlab.com/picture-production/picture-nnunet-package .
Abbreviations
- BraTS:
-
Multimodal brain tumor segmentation
- DM:
-
DeepMedic
- DSC:
-
Dice similarity coefficient
- ET:
-
Enhancing tumor
- FLAIR:
-
Fluid-attenuated inversion recovery
- GBM:
-
Glioblastoma
- HD:
-
Hausdorff distance
- IQR:
-
Interquartile range
- LGG:
-
Low grade glioma
- TC:
-
Tumor core
- VASARI:
-
Visually AccesSAble Rembrandt Images criteria VASARI Research Project (https://wiki.cancerimagingarchive.net/display/Public/VASARI+Research+Project)
- WT:
-
Whole tumor
References
Brindle, K. M., Izquierdo-García, J. L., Lewis, D. Y., Mair, R. J. & Wright, A. J. Brain tumor imaging. J. Clin. Oncol. 35, 2432–2438 (2017).
Verduin, M. et al. Noninvasive glioblastoma testing: Multimodal approach to monitoring and predicting treatment response. Dis. Markers 2018, 2908609 (2018).
Wen, P. Y. et al. Updated response assessment criteria for high-grade gliomas: Response assessment in neuro-oncology working group. J. Clin. Oncol. 28, 1963–1972 (2010).
Ellingson, B. M. et al. Consensus recommendations for a standardized Brain Tumor Imaging Protocol in clinical trials. Neuro-Oncology 17, 1188–1198. https://doi.org/10.1093/neuonc/nov095 (2015).
Gillies, R. J., Kinahan, P. E. & Hricak, H. Radiomics: Images are more than pictures, they are data. Radiology 278, 563–577 (2016).
Chang, K. et al. Automatic assessment of glioma burden: A deep learning algorithm for fully automated volumetric and bidimensional measurement. Neuro. Oncol. 21, 1412–1422 (2019).
Ellingson, B. M., Wen, P. Y. & Cloughesy, T. F. Modified criteria for radiographic response assessment in glioblastoma clinical trials. Neurotherapeutics 14, 307–320 (2017).
Bakas, S. et al. Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data 4, 1–13 (2017).
Deeley, M. A. et al. Comparison of manual and automatic segmentation methods for brain structures in the presence of space-occupying lesions: A multi-expert study. Phys. Med. Biol. 56, 4557–4577 (2011).
Vos, M. J. et al. Interobserver variability in the radiological assessment of response to chemotherapy in glioma. Neurology 60, 826–830 (2003).
Lanese, A., Franceschi, E. & Brandes, A. A. The risk assessment in low-grade gliomas: An analysis of the european organization for research and treatment of cancer (EORTC) and the radiation therapy oncology group (RTOG) criteria. Oncol. Ther. 6, 105–108 (2018).
Bennett, E. E. et al. The prognostic role of tumor volume in the outcome of patients with single brain metastasis after stereotactic radiosurgery. World Neurosurg. 104, 229–238 (2017).
Zhou, H. et al. MRI features predict survival and molecular markers in diffuse lower-grade gliomas. Neuro. Oncol. 19, 862–870 (2017).
Rios Velazquez, E. et al. Fully automatic GBM segmentation in the TCGA-GBM dataset: Prognosis and correlation with VASARI features. Sci. Rep. 5, 16822 (2015).
Goodkin, O. et al. The quantitative neuroradiology initiative framework: Application to dementia. Br. J. Radiol. 92, 20190365 (2019).
Grossmann, P. et al. Quantitative imaging biomarkers for risk stratification of patients with recurrent glioblastoma treated with bevacizumab. Neuro. Oncol. 19, 1688–1697 (2017).
Smits, M. & Van Den Bent, M. J. Imaging correlates of adult glioma genotypes. Radiology 284, 316–331 (2017).
Bakas, S. et al. Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BRATS challenge. 124, (2018).
Kickingereder, P. et al. Automated quantitative tumour response assessment of MRI in neuro-oncology with artificial neural networks: a multicentre, retrospective study. Lancet Oncol. 20, 728–740 (2019).
Shaver, M. M. et al. Optimizing neuro-oncology imaging: A review of deep learning approaches for glioma imaging. Cancers (Basel) 11, 829 (2019).
Wang, G., Li, W., Ourselin, S. & Vercauteren, T. Automatic brain tumor segmentation using cascaded anisotropic convolutional neural networks. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) Vol. 10670 LNCS 178–190 (Springer Verlag, 2018).
Menze, B. H. et al. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 34, 1993–2024 (2015).
Eijgelaar, R. S. et al. Robust deep learning–based segmentation of glioblastoma on routine clinical MRI scans using sparsified training. Radiol. Artif. Intell. 2, e190103 (2020).
Isensee, F. et al. nnU-Net: Self-adapting framework for U-net-based medical image segmentation. Informatik aktuell https://doi.org/10.1007/978-3-658-25326-4_7 (2019).
Kamnitsas, K. et al. DeepMedic for brain tumor segmentation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) Vol. 10154 LNCS 138–149 (Springer Verlag, 2016).
Myronenko, A. 3D MRI brain tumor segmentation using autoencoder regularization. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) Vol. 11384 LNCS, 311–320 (2019).
Eijgelaar, R. S. et al. Earliest radiological progression in glioblastoma by multidisciplinary consensus review. J. Neurooncol. 139, 591–598 (2018).
Eijgelaar, R. et al. Voxelwise statistical methods to localize practice variation in brain tumor surgery. PLoS One 14, 1–12 (2019).
Visser, M. et al. Inter-rater agreement in glioma segmentations on longitudinal MRI. NeuroImage Clin. 22, 101727 (2019).
Müller, D. M. J. et al. Comparing Glioblastoma Surgery Decisions Between Teams Using Brain Maps of Tumor Locations, Biopsies, and Resections. JCO Clin. Cancer Inform. 2, 1–12. https://doi.org/10.1200/cci.18.00089 (2019).
Müller, D. M. J. et al. Quantifying eloquent locations for glioblastoma surgery using resection probability maps. J. Neurosurg. JNS 134, 1091–1101 (2020).
Mongan, J., Moy, L. & Kahn, C. E. Checklist for artificial intelligence in medical imaging (CLAIM): A guide for authors and reviewers. Radiol. Artif. Intell. 2, e200029 (2020).
Rohlfing, T., Zahr, N. M., Sullivan, E. V. & Pfefferbaum, A. The SRI24 multichannel atlas of normal adult human brain structure. Hum. Brain Mapp. 31, 798–819 (2010).
Insight Journal (ISSN 2327–770X) - Advanced Normalization Tools: V1.0.
Tustison, N. J. et al. N4ITK: Improved N3 bias correction. IEEE Trans. Med. Imaging 29, 1310–1320 (2010).
Isensee, F. et al. Automated brain extraction of multisequence MRI using artificial neural networks. Hum. Brain Mapp. 40, 4952–4964 (2019).
Baid, U. et al. The RSNA-ASNR-MICCAI BraTS 2021 benchmark on brain tumor segmentation and radiogenomic classification (2021) doi:https://doi.org/10.48550/arxiv.2107.02314.
Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010).
Igual, L. et al. Supervised brain segmentation and classification in diagnostic of attention-deficit/hyperactivity disorder. In Proceedings of the 2012 International Conference on High Performance Computing and Simulation, HPCS 2012 182–187 (2012). doi:https://doi.org/10.1109/HPCSim.2012.6266909.
Taha, A. A. & Hanbury, A. Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool. BMC Med. Imaging 15, 1–28 (2015).
Armstrong, R. A. When to use the Bonferroni correction. Ophthalmic Physiol. Opt. 34, 502–508 (2014).
Hubert, M. & Van Der Veeken, S. Outlier detection for skewed data. J. Chemom. 22, 235–246 (2008).
Rousseeuw, P. J. & Hubert, M. Robust statistics for outlier detection. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 1, 73–79 (2011).
Perkuhn, M. et al. Clinical evaluation of a multiparametric deep learning model for glioblastoma segmentation using heterogeneous magnetic resonance imaging data from clinical routine. Invest. Radiol. 53, 1 (2018).
Fyllingen, E. H., Stensjøen, A. L., Berntsen, E. M., Solheim, O. & Reinertsen, I. Glioblastoma segmentation: Comparison of three different software packages. PLoS One 11, e0164891 (2016).
Di Ieva, A. et al. Application of deep learning for automatic segmentation of brain tumors on magnetic resonance imaging: A heuristic approach in the clinical scenario. Neuroradiology 63, 1253–1262 (2021).
Shen, Y. & Gao, M. Brain Tumor Segmentation on MRI with Missing Modalities. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) vol. 11492 LNCS, 417–428 (Springer Verlag, 2019).
Conte, G. M. et al. Generative adversarial networks to synthesize missing T1 and FLAIR MRI sequences for use in a multisequence brain tumor segmentation model. Radiology 299, 313–323 (2021).
Louis, D. N. et al. The 2021 WHO classification of tumors of the central nervous system: A summary. Neuro. Oncol. 23, 1231–1251 (2021).
Falk Delgado, A. et al. Diagnostic value of alternative techniques to gadolinium-based contrast agents in MR neuroimaging—a comprehensive overview. Insights Imaging 10, 1–15 (2019).
Wu, Y. & He, K. Group normalization. Int. J. Comput. Vis. 128, 742–755 (2020).
Kamnitsas, K. et al. Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med. Image Anal. 36, 61–78 (2017).
Acknowledgements
The authors would like to thank all patients whose data was used in this study. HP is a full-time employee of Deloitte. FB, SB, FP and JW are supported by the National Institute for Health Research (NIHR) biomedical research centre at UCLH. FP received a Guarantors of Brain fellowship 2017–2020 and is also supported by the Biomedical Research Centre initiative at University College London Hospitals (UCLH). The PICTURE project is sponsored by an unrestricted grant of Stichting Hanarth fonds, “Machine learning for better neurosurgical decisions in patients with glioblastoma”; a grant for public-private partnerships (Amsterdam UMC PPP-grant) sponsored by the Dutch government (Ministry of Economic Affairs) through the Rijksdienst voor Ondernemend Nederland (RVO) and Topsector Life Sciences and Health (LSH), “Picturing predictions for patients with brain tumors”; a grant from the Innovative Medical Devices Initiative program, project number 10-10400-96-14003; The Netherlands Organisation for Scientific Research (NWO), 2020.027; a grant from the Dutch Cancer Society, VU2014-7113 and the Anita Veldman foundation, CCA2018-2-17.
Author information
Authors and Affiliations
Contributions
Conceptualization: H.G.P, J.W., Y.H., F.P., P.d.W.H., F.B., R.S.E.; Data curation: H.G.P, J.W., I.K., D.M.J.M., O.G., S.B.V., S.B., P.A.R, H.A., L.B., M.R., T.S., M.C.N., M.S.B., S.L.H.J, W.B., W.A.V.d.B, J.F., S.J.H., A.J.S.I., B.K., G.W., A.K., M.W., A.H.Z., S.M.K., E.M., P.d.W.H., F.B., R.S.E. Methodology, Investigation, Formal analysis, and Validation: H.G.P, J.W., R.S.E.; Funding acquisition and project administration Y.H., F.P., P.d.W.H., F.B., R.S.E.; Software: J.W., R.S.E., Y.H. ; Writing—original draft: H.G.P., J.W., Y.H., F.P., P.d.W.H., F.B., R.S.E.; Writing—review & editing: all authors.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Pemberton, H.G., Wu, J., Kommers, I. et al. Multi-class glioma segmentation on real-world data with missing MRI sequences: comparison of three deep learning algorithms. Sci Rep 13, 18911 (2023). https://doi.org/10.1038/s41598-023-44794-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-023-44794-0
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.