Multi-class glioma segmentation on real-world data with missing MRI sequences: comparison of three deep learning algorithms

Pemberton, Hugh G.; Wu, Jiaming; Kommers, Ivar; Müller, Domenique M. J.; Hu, Yipeng; Goodkin, Olivia; Vos, Sjoerd B.; Bisdas, Sotirios; Robe, Pierre A.; Ardon, Hilko; Bello, Lorenzo; Rossi, Marco; Sciortino, Tommaso; Nibali, Marco Conti; Berger, Mitchel S.; Hervey-Jumper, Shawn L.; Bouwknegt, Wim; Van den Brink, Wimar A.; Furtner, Julia; Han, Seunggu J.; Idema, Albert J. S.; Kiesel, Barbara; Widhalm, Georg; Kloet, Alfred; Wagemakers, Michiel; Zwinderman, Aeilko H.; Krieg, Sandro M.; Mandonnet, Emmanuel; Prados, Ferran; de Witt Hamer, Philip; Barkhof, Frederik; Eijgelaar, Roelant S.

doi:10.1038/s41598-023-44794-0

Download PDF

Article
Open access
Published: 02 November 2023

Multi-class glioma segmentation on real-world data with missing MRI sequences: comparison of three deep learning algorithms

Hugh G. Pemberton^1,2^na1,
Jiaming Wu¹^na1,
Ivar Kommers³,
Domenique M. J. Müller³,
Yipeng Hu¹,
Olivia Goodkin^1,2,
Sjoerd B. Vos^1,2,
Sotirios Bisdas²,
Pierre A. Robe⁴,
Hilko Ardon⁵,
Lorenzo Bello⁶,
Marco Rossi⁶,
Tommaso Sciortino⁶,
Marco Conti Nibali⁶,
Mitchel S. Berger⁷,
Shawn L. Hervey-Jumper⁷,
Wim Bouwknegt⁸,
Wimar A. Van den Brink⁹,
Julia Furtner¹⁰,
Seunggu J. Han¹¹,
Albert J. S. Idema¹²,
Barbara Kiesel¹³,
Georg Widhalm¹³,
Alfred Kloet¹⁴,
Michiel Wagemakers¹⁵,
Aeilko H. Zwinderman¹⁶,
Sandro M. Krieg^17,18,
Emmanuel Mandonnet¹⁹,
Ferran Prados^1,20,21,
Philip de Witt Hamer³,
Frederik Barkhof^1,2,22 &
…
Roelant S. Eijgelaar³

Scientific Reports volume 13, Article number: 18911 (2023) Cite this article

2211 Accesses
2 Citations
3 Altmetric
Metrics details

Subjects

Abstract

This study tests the generalisability of three Brain Tumor Segmentation (BraTS) challenge models using a multi-center dataset of varying image quality and incomplete MRI datasets. In this retrospective study, DeepMedic, no-new-Unet (nn-Unet), and NVIDIA-net (nv-Net) were trained and tested using manual segmentations from preoperative MRI of glioblastoma (GBM) and low-grade gliomas (LGG) from the BraTS 2021 dataset (1251 in total), in addition to 275 GBM and 205 LGG acquired clinically across 12 hospitals worldwide. Data was split into 80% training, 5% validation, and 15% internal test data. An additional external test-set of 158 GBM and 69 LGG was used to assess generalisability to other hospitals’ data. All models’ median Dice similarity coefficient (DSC) for both test sets were within, or higher than, previously reported human inter-rater agreement (range of 0.74–0.85). For both test sets, nn-Unet achieved the highest DSC (internal = 0.86, external = 0.93) and the lowest Hausdorff distances (10.07, 13.87 mm, respectively) for all tumor classes (p < 0.001). By applying Sparsified training, missing MRI sequences did not statistically affect the performance. nn-Unet achieves accurate segmentations in clinical settings even in the presence of incomplete MRI datasets. This facilitates future clinical adoption of automated glioma segmentation, which could help inform treatment planning and glioma monitoring.

Prediction of tumor origin in cancers of unknown primary origin with cytology-based deep learning

Article Open access 16 April 2024

Demographic bias in misdiagnosis by computational pathology models

Article 19 April 2024

Microenvironmental reorganization in brain tumors following radiotherapy and recurrence revealed by hyperplexed immunofluorescence imaging

Article Open access 15 April 2024

Introduction

Clinically accurate segmentation and longitudinal volumetric analysis of glioma are helpful in treatment planning and response monitoring^1,2. Volumetric analyses are not commonly used in clinical practice and are generally limited to crude 2D measurements in clinical trials. While this is the current standard for treatment response evaluation in trials³, poor prognosis and heterogeneous treatment response encourage quantitative analysis of tumors, especially for glioma due to their varied morphometry and infiltrative nature^4,5,6,7. It is these two characteristics of glioma, along with heterogenous contrast enhancement, that complicate their manual delineation and further highlight the need for automated segmentation protocols in the clinical setting^8,9,10. Indeed, baseline imaging and volumetric measurements are of particular importance to neurosurgeons and radiotherapists because tumor volume and functional anatomy are key factors for both risk and prognostic assessment of patients^11,12.

The VASARI features have illustrated the importance of extracting such quantitative measures, but automation of segmentation and subsequent feature extraction is needed to enable widespread application^13,14. Automated quantification could provide improvements in reporting time, treatment response monitoring, and overall efficiency across a neuroradiological service, but is dependent upon technical and clinical validation of the methods^15,16,17. Deep learning has emerged as the preferred method for automated tumor segmentation^{6,18,19,20,21}. Ideally, the clinical environment requires a fast algorithm that is robust to scanner variation and missing MRI sequences.

Since 2012, the annual Brain Tumor Segmentation (BraTS) Challenge has compared the performance of numerous AI-driven glioma segmentation algorithms^18,22. However, these algorithms are trained and assessed on a highly curated dataset optimised for quality: each subject has a complete dataset of high-quality pre- and post-contrast T1-weighted (T1w and T1c, respectively), T2-weighted (T2w), and T2-weighted fluid-attenuated inversion recovery (FLAIR) images, which does not accurately reflect the realities of clinically-acquired MRI data. For example, a recent study using a model (DeepMedic) trained exclusively on BraTS data, achieved a median Dice similarity coefficient (DSC) of 0.81 on BraTS test data but only 0.49 on external clinical data ²³.

The aim of the current study was to determine the performance and generalisability of three of the highest-performing models at recent BraTS challenges^24,25,26 on real-world clinical data. Models have been trained with both BraTS data and another multi-centre dataset obtained from 12 different hospitals worldwide: the PICTURE project (www.pictureproject.nl)^{27,28,29,30,31}. An external test set comprised of PICTURE data from hospitals not used in the training and validation phases was employed to assess the clinical applicability and determine the need for retraining models on a hospital’s own data. Furthermore, we use sparsified training, to account for missing sequences²³, and assessed performance in patients with incomplete MRI datasets.

Materials and methods

All patients provided informed consent and data were obtained and anonymized according to the General Data Protection Regulation and Health Insurance Portability and Accountability Act. Local Institutional Review Board approval was obtained for all primary studies. For the the VU medical center Amsterdam the institutional review board approved of the experiments in this study under case nr. 2014.336. Of the patients involved in the current study, 40 were previously studied in an inter-rater agreement study by Visser et al.²⁹. The 275 Glioblastoma patients from the PICTURE dataset were previously used in a study focused on robust tumor core segmentation in glioblastoma patients Eijgelaar et al.²³. The study was carried out in concordance with the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) guidelines ³².

BraTS and PICTURE Datasets

We used manual segmentations of preoperative imaging of 1251 gliomas (unspecified mix of GBM and LGG) from the BraTS 2021 dataset and 275 GBM and 205 LGG (median age, 63.7 IQR [54.3–72.0] years; median survival, 323 [142–609] days; surgery extent: 348 resections, 83 biopsies, 49 unknown) from the PICTURE project. The PICTURE dataset was collected across 12 hospitals worldwide, all patients of at least 18 years old with a newly-diagnosed LGG, or GBM at first-time surgery between 1/1/2012 and 12/31/2013 were included. Since the PICTURE data was collected in 2012 and 2013, the classification of GBM and LGG was in line with WHO 2007 criteria. Demographics for the PICTURE dataset are documented in Appendix 1 of the supplementary material.

Missing scans

Both datasets contain pre-operative T1w, T1c, T2w, and FLAIR images. However, in the PICTURE dataset some patients had missing sequences, see Table 1 for a breakdown and Fig. 1 for examples of subjects with a missing FLAIR or T2. Only patients with at least T1c and either T2w or FLAIR were included to be able to manually segment all tissue classes. Out of 1731 total cases, there were 204 missing pre-contrast T1w, 186 missing T2w, and 19 missing FLAIR, see “Model training and testing” for details of the sparsified training used to account for missing sequences.

Table 1 Breakdown of data used in this study from the BraTS and PICTURE datasets (https://www.pictureproject.nl), and missing data totals from each hospital, as well as a breakdown of the train, validation, test, and external test sets.

Full size table

Pre-processing

T1w, T2w, and FLAIR images were rigidly registered to the T1c image. Subsequently, the T1c was registered to the SRI24 atlas (https://www.nitrc.org/projects/sri24/)³³ using an affine transformation. The same transform was applied to the other MR sequences (T2w, FLAIR, T1w). All modalities were resampled to 1mm isotropic voxels in the SRI24 atlas space, the rigid and affine registrations were applied using a single interpolation step. All registrations and resampling were conducted using the Advanced Normalization Tools (ANTs)³⁴. N4 bias field correction³⁵ was used and skull stripping was performed with the “HD-bet” algorithm (https://github.com/MIC-DKFZ/HD-BET) ³⁶.

Manual segmentations

For the PICTURE data, 275 GBM and 205 LGG cases were manually segmented into 3 classes consistent with the BraTS challenges – whole tumor (WT), tumor core (TC), and enhancing tumor (ET), see Fig. 1. The WT defines the full extent of the tumor, including the tumor core and oedema, indicated by hyperintensity on FLAIR and T2w. The TC is the main body of the tumor and most likely area of resection. The TC includes the enhancing tumor (ET) and necrosis.

Manual segmentations were carried out according to the VASARI Research Project (https://wiki.cancerimagingarchive.net/display/Public/VASARI+Research+Project). One rater (HP) with 9 years of brain MRI manual segmentation experience performed segmentations under the supervision and approval of an expert neuroradiologist (FB), using the semiautomatic SmartBrush tool (BrainLab, Feldkirchen, Germany). The rater’s performance was in line with experts²⁹. All segmentations were exported on the T1c image. The segmentation was resampled to SRI24 atlas space using the same transform from the T1c to SRI24 registration.

Quality control

Visual quality control checks were carried out for incomplete coverage, skull stripping, registration errors, and incomplete segmentations. Overview images were generated to facilitate quality control. The images show the same axial, sagittal, and coronal view for all patients to assess the registration quality, as well as an axial view of the center of the tumor to verify the segmentation. Seven scans were not included due to poor image quality and five due to severe registration errors (as illustrated in Appendix 2).

Deep learning segmentation models

Three algorithms were selected for this study based on high performance in recent BraTS challenges^18,22,37, availability of a user-friendly and reproducible implementation online, and the uniqueness of the algorithm, see Table 2.

Table 2 Summary of the deep learning algorithms tested in this study.

Full size table

Model training and testing

Models were trained with three-class segmentations (WT, TC, ET) for each tumor. The scans were randomly split in 80% training, 5% validation, and 15% internal test data, see Table 1. Test data was used to assess the performance of each model. Alongside the 15% internal test data, models were further assessed using an external test set of 158 GBM and 69 LGG patients from PICTURE hospitals not included in the training data, herein referred to as the external test set. This helped to gauge the generalisability and determine the future need for retraining algorithms on a new hospital’s unseen data.

In order to address missing sequences in the training data (Table 1), sparsified training was applied for all algorithms²³. This study showed that performance drops substantially if not all sequences are available. This could be solved by inserting empty (zero-filled) scans in place of missing sequences, see the first column of Fig. 1. During training, the T1w, T2w, and FLAIR were additionally set to zero with independent probabilities of 20%, in line with the estimated frequency of missing sequences in the clinical setting²³. We used the validation data to confirm convergence, the hyperparameters of all models were kept at the default values, as reported in the associated papers, or as used in the published code repositories (Table 2). All model training and testing was carried out using a machine equipped with an AMD Ryzen 9 3900X 12 core processor, 64GB RAM, and 1 NVIDIA RTX3090 (24GB) graphics processing unit (GPU).

Model performance assessment

In line with the BraTS challenges, tumor segmentations for each algorithm were assessed using median and inter-quartile range Dice similarity coefficient (DSC)³⁸ and Hausdorff distance (HD)³⁹ for all experiments. Results were generated using methods described by Taha and Hanbury⁴⁰ and associated software.

Experiments and statistical analyses

Four separate experiments were performed. In experiment 1, DSC and HD from the internal and external test sets were analyzed separately for each model/tumor class using a paired two-tailed t-test to assess differences between each model. DSC and HD were also compared in the following experiments using independent samples (Welch’s) t-tests: experiment 2—GBMs vs LGGs on the internal and external test set to assess differences in performance on the differing tumor grades; experiment 3—GBMs and LGGs from the internal vs external test sets to assess the change in performance when segmenting external hospital data not previously seen by the models; and experiment 4—GBM and LGG patients with incomplete vs complete imaging datasets in the external test set to assess the change in performance when segmenting patients with incomplete imaging datasets from external hospital data not yet seen by the models. A single Bonferroni correction was applied for each experiment⁴¹. Outliers in box plots and overall outlier rates for each model and segmentation class were calculated using the IQR × 1.5 rule, i.e. outside [Q1 − 1.5 × IQR; Q3 + 1.5 × IQR]^42,43.

Results

PICTURE dataset segmentations

See Fig. 2 for ground truth manual segmentation and automated segmentation examples from all three models. See Appendix 3 and 4 for GBM and LGG segmentation contours on a 4 T T1c scan along with two human experts’ manual segmentation, for all tumor classes.

Experiment 1—Segmentation performance on both test sets—which model achieved the best metrics?

Box plots showing median DSC and HD for all models and tumor classes on the internal and external test sets are presented in Fig. 3. nn-Unet achieved significantly higher DSCs than nvNet and lower HDs than both nvNet and DeepMedic for all tumor classes on both the internal and external test sets (p values < 0.0027). The raw metrics are displayed in Table 3.

Table 3 Median DSC and HD for all models and tumor classes on the internal test set GBMs (n = 15) and external GBMs test set (n = 158).

Full size table

Experiment 2—Segmentation performance on GBM vs LGG

Comparing performance between GBM and LGG, nn-Unet continued to provide the best quality results for both tumor grades, statistical comparisons are reported in Table 4. However, overall segmentation performance on the LGG was notably weaker than for GBM across all models, see Fig. 4 for box plots. DeepMedic showed the largest decrease in performance across all tumor classes.

Table 4 DSC and HD for all models and tumor classes on all GBM’s (n = 173) and all LGGs (n = 92) in the internal test set plus external test set combined.

Full size table

Experiment 3—GBM segmentation performance on an external test set—do models need to be retrained for new hospital data?

As shown in experiment 1, nn-Unet produced the most favourable results when compared to the other models on both test sets. Table 3 shows the DSC and HD results for the internal test set (15 GBM, 23 LGG) and the external test set (158 GBM, 69 LGG) comprised of cases from hospitals not included in the training data, see Fig. 5 for box plots. nn-Unet showed the smallest absolute decrease and increase in respectively DSC and HD from the internal to the external test set for GBM WT, (DSC internal: 0.97, external: 0.95, p < 0.001*, HD internal: 7.34, external: 9.11, p = 0.958). All models’ DSC were slightly reduced on WT and TC for both HGG and LGG but remained within clinically-acceptable range^18,22,44. However, the segmentation performance of ET improved in the external dataset for all models.

Experiment 4—Effect of missing MRI sequences on segmentation performance

Box plots showing DSC and HD of GBM and LGG patients with incomplete (44GBM, 55LGG) versus complete (114GBM, 14LGG) scans in the external test set are presented in Fig. 6 and Table 5 separately. For GBM, nn-Unet achieved the highest DSCs and lowest HD for all tumor classes on both incomplete and complete scans, with the exception of nvNet reaching a slightly lower HD on TC for incomplete scans. There were no statistically significant differences between the two groups for all models.

Table 5 Median DSC and HD for all models and tumor classes on patients in the external test set with missing sequences (n = 44GBM + 55LGG) and subjects with complete scans (n = 114GBM + 14LGG).

Full size table

Outlier rates

For outliers according to DSC, based the IQR × 1.5 rule^42,43, nn-Unet had the lowest outlier rate of all the models on the external test set (158 GBMs and 69 LGGs) at 3.65% of segmentations, for DeepMedic it was 8.75% and nvNet 10.28%. Outlier rates across the tumor classes were equally low for nn-Unet at 4.78% WT, 2.53% TC and 3.48% ET; DeepMedic recorded outliers at 7.88% WT, 7.24% TC, and 8.45% ET; and for nvNet 11.23% WT, 9.13% TC and 10.21% ET. A similar pattern was observed for outliers according to HD: 2.65% for nn-Unet, 36.21% for DeepMedic, and 6.78% for nvNet. Outlier rates across the tumor classes for nn-Unet at 3.58% WT, 2.56% TC and 1.37% ET. Rates for DeepMedic were 36.22% WT, 39.98% TC and 21.25% ET; and for nvNet were 8.98% WT, 8.02% TC and 5.58% ET.

Discussion

In this study, we compared the performance of three of the top performing BraTS challenge deep learning models for automated brain tumor segmentation in an external multi-centre hospital dataset (https://www.pictureproject.nl). We extended the valuable work of the BraTS challenge by increasing the number of training cases and using a less strictly curated, and therefore more clinically-relevant, dataset^{27,28,29,30,31}. Subsequently, we tested the generalisability of the three models on an external test set comprised of data from hospitals not used in model training. Akin to the realities of clinical assessment, we further show the utility of these models when segmenting incomplete MRI datasets, due to acquisition protocols or patient-specific circumstances, sparsified training was applied to account for missing pulse-sequences²³. Our results demonstrate that nn-Unet, when supplemented sparsified training, produces high DSC and low HD for glioma segmentations in real-world hospital data.

Clinical implications

Manual segmentations are the current gold standard in clinical practice, where an inter-rater variability of 0.74–0.85 DSC has been previously reported in the BraTS challenge^18,22,44. All models’ median DSCs for both test sets were within this “clinically acceptable” inter-rater agreement range. However, manual segmentations are not a time-efficient process. Semi-automatic multi-class glioma segmentation using BrainVoyagerTM QX, ITK-Snap and 3D Slicer is reported to take an average of 18–41 min per patient⁴⁵. On the whole, automated inference times in the current study were considerably lower than these reported semi-automated segmentation times, see Appendix 5 for all results. nn-Unet takes approximately 37 min of computer time to produce a segmentation using a CPU or only 4.5 min when a GPU is available, versus 18–41 min of human rater time.

The majority of median DSCs were within this clinically-acceptable range of 0.74–0.85^18,22,44 when testing on an external test set with missing pulse-sequences, but there was a decrease in DSC for all models on the WT and TC, but not for the ET. The TC yielded the most accurate segmentations for both DSC and HD across models. Since the TC is the main body of the tumor and the most likely area of resection, our findings suggest that using nn-Unet with sparsified training may be an optimal combination for pre-surgical applications, with acceptable results in 97.47% of patients, based on the outlier rate of 2.53% for nn-Unet.

nn-Unet yielded the fewest outliers in all categories across all models. Furthermore, it showed the smallest reductions in segmentation performance on the external test set. There were also no statistically significant changes in segmentation quality when comparing complete versus incomplete imaging datasets. In line with other recent work, this suggests not all of the MRI sequences are necessary when models are augmented using sparsified training, or similar methods^23,46,47. However, the lower WT DSCs indicate a heavier reliance on a full set of MRI sequences for WT segmentation, which is plausible given the hyperintensity of oedema on FLAIR and T2w. Previous studies have also used generative adversarial networks (GAN) to synthesise missing sequences with very promising results⁴⁸, therefore direct comparison of this approach and the sparsified trained used in the current study is encouraged.

Interestingly, the nn-Unet original model used in this study came third and second in the 2017 and 2018 BraTS challenges, respectively. NvNet won the 2018 BraTS challenge but generated the lower DSCs in the current study. NvNet’s underwhelming results on incomplete datasets (Table 5) could be due to reduced effectiveness of the auto-encoder regularization in combination with sparsified training. DeepMedic won the 2017 BraTS challenge but generated the weakest HD in the current study, especially when predicting the LGG scans. The discrepancy in these findings demonstrates the value and relevance of testing models on unseen hospital-quality data with missing sequences, as we have in the current study.

Limitations

We performed segmentations in line with the same definitions of the BraTS challenge in order to facilitate comparison, see Fig. 1. However, these definitions may not be those used in the clinical setting. In the BraTS challenge, the WT includes oedema and associated infiltrations but in reality neurosurgeons and neuroradiologists would more often classify the edge of the “tumor core” as the clinical definition of the “whole tumor”, i.e. the enhancing and non-enhancing part of the core and its associated necrosis, not including oedema. While this definition might be a better representation of the truth, current MRI techniques make it very difficult to distinguish between oedema and non-enhancing infiltrative tumor. Further research is needed to accurately distinguish between non-enhancing tumor and oedema. Depending on the intended use case for automated glioma segmentations, having a less subjective, more consistent measurement may generate a more accurate representation of true tumor infiltration, and the associated increased (inter-rater) variability. The WHO glioma classification have been updated in 2021: WHO CNS5 has some variations by further advancing the role of molecular diagnostics in the classification of CNS tumors, but remains rooted in its established methods of histology and immunohistochemistry in tumor characterisation⁴⁹. The classification of GBM and LGG is very relevant to a model trained on combined LGG and GBM data, especially when it works on all gliomas.

Furthermore, we did not target hyperparameter optimisation for the sparsified training, nor did we make specific architecture optimisations for training and testing these models on a much larger dataset. Peak performance may be improved by doing so, but we chose not to tweak hyperparameters in order to promote generalisability.

Future work

In our study, we have only used pre-operative scans, while post-operative and longitudinal scans are also clinically relevant for radiotherapy planning, quantitative follow-up, and automatic growth detection; however, pre-operative baseline measurements are required for these assessments. Future work should follow the BraTS challenge latest aims and include disease progression monitoring and overall survival prediction. Furthermore, the current approach relied on having at least the T1c scan available. While this is a safe assumption for most retrospective cohorts, this may be different for future cohorts due to ongoing efforts to reduce gadolinium use⁵⁰. To support these sequences new models would have to be trained, however we have shown that sparsified training provides a simple solution to train models that are flexible to the available sequences.

The tested networks all use very different implementations, making it difficult to pinpoint which differences between the models best explain the observed performance differences. To gain a better understanding of which properties most affect performance, future development should focus on consolidating different models within a single framework and applying and testing changes gradually.

We have shown that sparsified training offers a simple solution to missing sequences that is easy to implement for different network architectures and frameworks. While dealing with missing sequences is important, and allows for the inclusion of larger (retrospective) cohorts, improving the availability of all sequences for future patients would tackle the problem at the root.

Conclusions

In this study, we have shown the feasibility of using sparsified training alongside three top-performing BraTS challenge models to produce high-quality glioma segmentations of real-world hospital data with missing sequences. When segmenting scans with incomplete MRI sequences there was no statistically significant decrease in performance. While performance was slightly reduced in an external test set, the segmentations remained within clinically acceptable ranges. nn-Unet was the most consistent performer with highest DSCs, lowest HDs, tightest IQRs, and smallest outlier rates across the vast majority of experiments.

Data availability

The BraTS data used in this study is available through http://braintumorsegmentation.org. The PICTURE data is available from the corresponding author, upon reasonable request.

Code availability

See Table 2 for links to all software availability and https://gitlab.com/picture-production/picture-qni-robust-glioma-segmentation/ for code used in this study. The nn-Unet model has also been integrated in the picture-nnunet python package https://gitlab.com/picture-production/picture-nnunet-package .

Abbreviations

BraTS:: Multimodal brain tumor segmentation
DM:: DeepMedic
DSC:: Dice similarity coefficient
ET:: Enhancing tumor
FLAIR:: Fluid-attenuated inversion recovery
GBM:: Glioblastoma
HD:: Hausdorff distance
IQR:: Interquartile range
LGG:: Low grade glioma
TC:: Tumor core
VASARI:: Visually AccesSAble Rembrandt Images criteria VASARI Research Project (https://wiki.cancerimagingarchive.net/display/Public/VASARI+Research+Project)
WT:: Whole tumor

References

Brindle, K. M., Izquierdo-García, J. L., Lewis, D. Y., Mair, R. J. & Wright, A. J. Brain tumor imaging. J. Clin. Oncol. 35, 2432–2438 (2017).
Article CAS PubMed Google Scholar
Verduin, M. et al. Noninvasive glioblastoma testing: Multimodal approach to monitoring and predicting treatment response. Dis. Markers 2018, 2908609 (2018).
Article PubMed PubMed Central Google Scholar
Wen, P. Y. et al. Updated response assessment criteria for high-grade gliomas: Response assessment in neuro-oncology working group. J. Clin. Oncol. 28, 1963–1972 (2010).
Article PubMed Google Scholar
Ellingson, B. M. et al. Consensus recommendations for a standardized Brain Tumor Imaging Protocol in clinical trials. Neuro-Oncology 17, 1188–1198. https://doi.org/10.1093/neuonc/nov095 (2015).
Article PubMed PubMed Central Google Scholar
Gillies, R. J., Kinahan, P. E. & Hricak, H. Radiomics: Images are more than pictures, they are data. Radiology 278, 563–577 (2016).
Article PubMed Google Scholar
Chang, K. et al. Automatic assessment of glioma burden: A deep learning algorithm for fully automated volumetric and bidimensional measurement. Neuro. Oncol. 21, 1412–1422 (2019).
Article PubMed PubMed Central Google Scholar
Ellingson, B. M., Wen, P. Y. & Cloughesy, T. F. Modified criteria for radiographic response assessment in glioblastoma clinical trials. Neurotherapeutics 14, 307–320 (2017).
Article PubMed PubMed Central Google Scholar
Bakas, S. et al. Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data 4, 1–13 (2017).
Article Google Scholar
Deeley, M. A. et al. Comparison of manual and automatic segmentation methods for brain structures in the presence of space-occupying lesions: A multi-expert study. Phys. Med. Biol. 56, 4557–4577 (2011).
Article CAS PubMed PubMed Central Google Scholar
Vos, M. J. et al. Interobserver variability in the radiological assessment of response to chemotherapy in glioma. Neurology 60, 826–830 (2003).
Article CAS PubMed Google Scholar
Lanese, A., Franceschi, E. & Brandes, A. A. The risk assessment in low-grade gliomas: An analysis of the european organization for research and treatment of cancer (EORTC) and the radiation therapy oncology group (RTOG) criteria. Oncol. Ther. 6, 105–108 (2018).
Article PubMed PubMed Central Google Scholar
Bennett, E. E. et al. The prognostic role of tumor volume in the outcome of patients with single brain metastasis after stereotactic radiosurgery. World Neurosurg. 104, 229–238 (2017).
Article PubMed Google Scholar
Zhou, H. et al. MRI features predict survival and molecular markers in diffuse lower-grade gliomas. Neuro. Oncol. 19, 862–870 (2017).
Article CAS PubMed PubMed Central Google Scholar
Rios Velazquez, E. et al. Fully automatic GBM segmentation in the TCGA-GBM dataset: Prognosis and correlation with VASARI features. Sci. Rep. 5, 16822 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Goodkin, O. et al. The quantitative neuroradiology initiative framework: Application to dementia. Br. J. Radiol. 92, 20190365 (2019).
Article PubMed PubMed Central Google Scholar
Grossmann, P. et al. Quantitative imaging biomarkers for risk stratification of patients with recurrent glioblastoma treated with bevacizumab. Neuro. Oncol. 19, 1688–1697 (2017).
Article CAS PubMed PubMed Central Google Scholar
Smits, M. & Van Den Bent, M. J. Imaging correlates of adult glioma genotypes. Radiology 284, 316–331 (2017).
Article PubMed Google Scholar
Bakas, S. et al. Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BRATS challenge. 124, (2018).
Kickingereder, P. et al. Automated quantitative tumour response assessment of MRI in neuro-oncology with artificial neural networks: a multicentre, retrospective study. Lancet Oncol. 20, 728–740 (2019).
Article PubMed Google Scholar
Shaver, M. M. et al. Optimizing neuro-oncology imaging: A review of deep learning approaches for glioma imaging. Cancers (Basel) 11, 829 (2019).
Article PubMed Google Scholar
Wang, G., Li, W., Ourselin, S. & Vercauteren, T. Automatic brain tumor segmentation using cascaded anisotropic convolutional neural networks. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) Vol. 10670 LNCS 178–190 (Springer Verlag, 2018).
Menze, B. H. et al. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 34, 1993–2024 (2015).
Article PubMed Google Scholar
Eijgelaar, R. S. et al. Robust deep learning–based segmentation of glioblastoma on routine clinical MRI scans using sparsified training. Radiol. Artif. Intell. 2, e190103 (2020).
Article PubMed PubMed Central Google Scholar
Isensee, F. et al. nnU-Net: Self-adapting framework for U-net-based medical image segmentation. Informatik aktuell https://doi.org/10.1007/978-3-658-25326-4_7 (2019).
Article Google Scholar
Kamnitsas, K. et al. DeepMedic for brain tumor segmentation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) Vol. 10154 LNCS 138–149 (Springer Verlag, 2016).
Myronenko, A. 3D MRI brain tumor segmentation using autoencoder regularization. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) Vol. 11384 LNCS, 311–320 (2019).
Eijgelaar, R. S. et al. Earliest radiological progression in glioblastoma by multidisciplinary consensus review. J. Neurooncol. 139, 591–598 (2018).
Article PubMed PubMed Central Google Scholar
Eijgelaar, R. et al. Voxelwise statistical methods to localize practice variation in brain tumor surgery. PLoS One 14, 1–12 (2019).
Article Google Scholar
Visser, M. et al. Inter-rater agreement in glioma segmentations on longitudinal MRI. NeuroImage Clin. 22, 101727 (2019).
Article CAS PubMed PubMed Central Google Scholar
Müller, D. M. J. et al. Comparing Glioblastoma Surgery Decisions Between Teams Using Brain Maps of Tumor Locations, Biopsies, and Resections. JCO Clin. Cancer Inform. 2, 1–12. https://doi.org/10.1200/cci.18.00089 (2019).
Article Google Scholar
Müller, D. M. J. et al. Quantifying eloquent locations for glioblastoma surgery using resection probability maps. J. Neurosurg. JNS 134, 1091–1101 (2020).
Article Google Scholar
Mongan, J., Moy, L. & Kahn, C. E. Checklist for artificial intelligence in medical imaging (CLAIM): A guide for authors and reviewers. Radiol. Artif. Intell. 2, e200029 (2020).
Article PubMed PubMed Central Google Scholar
Rohlfing, T., Zahr, N. M., Sullivan, E. V. & Pfefferbaum, A. The SRI24 multichannel atlas of normal adult human brain structure. Hum. Brain Mapp. 31, 798–819 (2010).
Article PubMed Google Scholar
Insight Journal (ISSN 2327–770X) - Advanced Normalization Tools: V1.0.
Tustison, N. J. et al. N4ITK: Improved N3 bias correction. IEEE Trans. Med. Imaging 29, 1310–1320 (2010).
Article PubMed PubMed Central Google Scholar
Isensee, F. et al. Automated brain extraction of multisequence MRI using artificial neural networks. Hum. Brain Mapp. 40, 4952–4964 (2019).
Article PubMed PubMed Central Google Scholar
Baid, U. et al. The RSNA-ASNR-MICCAI BraTS 2021 benchmark on brain tumor segmentation and radiogenomic classification (2021) doi:https://doi.org/10.48550/arxiv.2107.02314.
Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010).
MathSciNet MATH Google Scholar
Igual, L. et al. Supervised brain segmentation and classification in diagnostic of attention-deficit/hyperactivity disorder. In Proceedings of the 2012 International Conference on High Performance Computing and Simulation, HPCS 2012 182–187 (2012). doi:https://doi.org/10.1109/HPCSim.2012.6266909.
Taha, A. A. & Hanbury, A. Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool. BMC Med. Imaging 15, 1–28 (2015).
Article Google Scholar
Armstrong, R. A. When to use the Bonferroni correction. Ophthalmic Physiol. Opt. 34, 502–508 (2014).
Article PubMed Google Scholar
Hubert, M. & Van Der Veeken, S. Outlier detection for skewed data. J. Chemom. 22, 235–246 (2008).
Article CAS Google Scholar
Rousseeuw, P. J. & Hubert, M. Robust statistics for outlier detection. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 1, 73–79 (2011).
Article Google Scholar
Perkuhn, M. et al. Clinical evaluation of a multiparametric deep learning model for glioblastoma segmentation using heterogeneous magnetic resonance imaging data from clinical routine. Invest. Radiol. 53, 1 (2018).
Article Google Scholar
Fyllingen, E. H., Stensjøen, A. L., Berntsen, E. M., Solheim, O. & Reinertsen, I. Glioblastoma segmentation: Comparison of three different software packages. PLoS One 11, e0164891 (2016).
Article PubMed PubMed Central Google Scholar
Di Ieva, A. et al. Application of deep learning for automatic segmentation of brain tumors on magnetic resonance imaging: A heuristic approach in the clinical scenario. Neuroradiology 63, 1253–1262 (2021).
Article MathSciNet PubMed Google Scholar
Shen, Y. & Gao, M. Brain Tumor Segmentation on MRI with Missing Modalities. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) vol. 11492 LNCS, 417–428 (Springer Verlag, 2019).
Conte, G. M. et al. Generative adversarial networks to synthesize missing T1 and FLAIR MRI sequences for use in a multisequence brain tumor segmentation model. Radiology 299, 313–323 (2021).
Article PubMed Google Scholar
Louis, D. N. et al. The 2021 WHO classification of tumors of the central nervous system: A summary. Neuro. Oncol. 23, 1231–1251 (2021).
Article CAS PubMed PubMed Central Google Scholar
Falk Delgado, A. et al. Diagnostic value of alternative techniques to gadolinium-based contrast agents in MR neuroimaging—a comprehensive overview. Insights Imaging 10, 1–15 (2019).
Article Google Scholar
Wu, Y. & He, K. Group normalization. Int. J. Comput. Vis. 128, 742–755 (2020).
Article Google Scholar
Kamnitsas, K. et al. Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med. Image Anal. 36, 61–78 (2017).
Article PubMed Google Scholar

Download references

Acknowledgements

The authors would like to thank all patients whose data was used in this study. HP is a full-time employee of Deloitte. FB, SB, FP and JW are supported by the National Institute for Health Research (NIHR) biomedical research centre at UCLH. FP received a Guarantors of Brain fellowship 2017–2020 and is also supported by the Biomedical Research Centre initiative at University College London Hospitals (UCLH). The PICTURE project is sponsored by an unrestricted grant of Stichting Hanarth fonds, “Machine learning for better neurosurgical decisions in patients with glioblastoma”; a grant for public-private partnerships (Amsterdam UMC PPP-grant) sponsored by the Dutch government (Ministry of Economic Affairs) through the Rijksdienst voor Ondernemend Nederland (RVO) and Topsector Life Sciences and Health (LSH), “Picturing predictions for patients with brain tumors”; a grant from the Innovative Medical Devices Initiative program, project number 10-10400-96-14003; The Netherlands Organisation for Scientific Research (NWO), 2020.027; a grant from the Dutch Cancer Society, VU2014-7113 and the Anita Veldman foundation, CCA2018-2-17.

Author information

These authors contributed equally: Hugh G. Pemberton and Jiaming Wu.

Authors and Affiliations

Centre for Medical Image Computing (CMIC), University College London, London, UK
Hugh G. Pemberton, Jiaming Wu, Yipeng Hu, Olivia Goodkin, Sjoerd B. Vos, Ferran Prados & Frederik Barkhof
Neuroradiological Academic Unit, UCL Queen Square Institute of Neurology, University College London, London, UK
Hugh G. Pemberton, Olivia Goodkin, Sjoerd B. Vos, Sotirios Bisdas & Frederik Barkhof
Neurosurgical Center Amsterdam, Amsterdam UMC, Vrije Universiteit, Amsterdam, The Netherlands
Ivar Kommers, Domenique M. J. Müller, Philip de Witt Hamer & Roelant S. Eijgelaar
Department of Neurology & Neurosurgery, University Medical Center Utrecht, Utrecht, The Netherlands
Pierre A. Robe
Department of Neurosurgery, St. Elisabeth Hospital, Tilburg, The Netherlands
Hilko Ardon
Neurosurgical Oncology Unit, Department of Oncology and Hemato-Oncology, Università degli Studi di Milano, Milan, Italy
Lorenzo Bello, Marco Rossi, Tommaso Sciortino & Marco Conti Nibali
Department of Neurological Surgery, University of California, San Francisco, CA, USA
Mitchel S. Berger & Shawn L. Hervey-Jumper
Department of Neurosurgery, Medical Center Slotervaart, Amsterdam, The Netherlands
Wim Bouwknegt
Department of Neurosurgery, Isala Hospital, Zwolle, The Netherlands
Wimar A. Van den Brink
Department of Biomedical Imaging and Image-Guided Therapy, Medical University Vienna, Vienna, Austria
Julia Furtner
Department of Neurological Surgery, Stanford University, Stanford, USA
Seunggu J. Han
Department of Neurosurgery, Northwest Clinics, Alkmaar, The Netherlands
Albert J. S. Idema
Department of Neurosurgery, Medical University Vienna, Vienna, Austria
Barbara Kiesel & Georg Widhalm
Department of Neurosurgery, Medical Center Haaglanden, The Hague, The Netherlands
Alfred Kloet
Department of Neurosurgery, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
Michiel Wagemakers
Department of Clinical Epidemiology and Biostatistics, Academic Medical Center, Amsterdam, The Netherlands
Aeilko H. Zwinderman
TUM-Neuroimaging Center, Klinikum rechts der Isar, Technische Universität München, Munich, Germany
Sandro M. Krieg
Department of Neurosurgery, Klinikum rechts der Isar, Technische Universität München, Munich, Germany
Sandro M. Krieg
Department of Neurosurgery, Lariboisière Hospital, APHP, Paris, France
Emmanuel Mandonnet
Department of Neuroinflammation, Faculty of Brain Sciences, Queen Square MS Centre, UCL Institute of Neurology, University College London, London, UK
Ferran Prados
e-Health Center, Universitat Oberta de Catalunya, Barcelona, Spain
Ferran Prados
Radiology & Nuclear Medicine, VU University Medical Center, Amsterdam, the Netherlands
Frederik Barkhof

Authors

Hugh G. Pemberton
View author publications
You can also search for this author in PubMed Google Scholar
Jiaming Wu
View author publications
You can also search for this author in PubMed Google Scholar
Ivar Kommers
View author publications
You can also search for this author in PubMed Google Scholar
Domenique M. J. Müller
View author publications
You can also search for this author in PubMed Google Scholar
Yipeng Hu
View author publications
You can also search for this author in PubMed Google Scholar
Olivia Goodkin
View author publications
You can also search for this author in PubMed Google Scholar
Sjoerd B. Vos
View author publications
You can also search for this author in PubMed Google Scholar
Sotirios Bisdas
View author publications
You can also search for this author in PubMed Google Scholar
Pierre A. Robe
View author publications
You can also search for this author in PubMed Google Scholar
Hilko Ardon
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Bello
View author publications
You can also search for this author in PubMed Google Scholar
Marco Rossi
View author publications
You can also search for this author in PubMed Google Scholar
Tommaso Sciortino
View author publications
You can also search for this author in PubMed Google Scholar
Marco Conti Nibali
View author publications
You can also search for this author in PubMed Google Scholar
Mitchel S. Berger
View author publications
You can also search for this author in PubMed Google Scholar
Shawn L. Hervey-Jumper
View author publications
You can also search for this author in PubMed Google Scholar
Wim Bouwknegt
View author publications
You can also search for this author in PubMed Google Scholar
Wimar A. Van den Brink
View author publications
You can also search for this author in PubMed Google Scholar
Julia Furtner
View author publications
You can also search for this author in PubMed Google Scholar
Seunggu J. Han
View author publications
You can also search for this author in PubMed Google Scholar
Albert J. S. Idema
View author publications
You can also search for this author in PubMed Google Scholar
Barbara Kiesel
View author publications
You can also search for this author in PubMed Google Scholar
Georg Widhalm
View author publications
You can also search for this author in PubMed Google Scholar
Alfred Kloet
View author publications
You can also search for this author in PubMed Google Scholar
Michiel Wagemakers
View author publications
You can also search for this author in PubMed Google Scholar
Aeilko H. Zwinderman
View author publications
You can also search for this author in PubMed Google Scholar
Sandro M. Krieg
View author publications
You can also search for this author in PubMed Google Scholar
Emmanuel Mandonnet
View author publications
You can also search for this author in PubMed Google Scholar
Ferran Prados
View author publications
You can also search for this author in PubMed Google Scholar
Philip de Witt Hamer
View author publications
You can also search for this author in PubMed Google Scholar
Frederik Barkhof
View author publications
You can also search for this author in PubMed Google Scholar
Roelant S. Eijgelaar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: H.G.P, J.W., Y.H., F.P., P.d.W.H., F.B., R.S.E.; Data curation: H.G.P, J.W., I.K., D.M.J.M., O.G., S.B.V., S.B., P.A.R, H.A., L.B., M.R., T.S., M.C.N., M.S.B., S.L.H.J, W.B., W.A.V.d.B, J.F., S.J.H., A.J.S.I., B.K., G.W., A.K., M.W., A.H.Z., S.M.K., E.M., P.d.W.H., F.B., R.S.E. Methodology, Investigation, Formal analysis, and Validation: H.G.P, J.W., R.S.E.; Funding acquisition and project administration Y.H., F.P., P.d.W.H., F.B., R.S.E.; Software: J.W., R.S.E., Y.H. ; Writing—original draft: H.G.P., J.W., Y.H., F.P., P.d.W.H., F.B., R.S.E.; Writing—review & editing: all authors.

Corresponding author

Correspondence to Roelant S. Eijgelaar.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pemberton, H.G., Wu, J., Kommers, I. et al. Multi-class glioma segmentation on real-world data with missing MRI sequences: comparison of three deep learning algorithms. Sci Rep 13, 18911 (2023). https://doi.org/10.1038/s41598-023-44794-0

Download citation

Received: 03 August 2023
Accepted: 12 October 2023
Published: 02 November 2023
DOI: https://doi.org/10.1038/s41598-023-44794-0

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Prediction of tumor origin in cancers of unknown primary origin with cytology-based deep learning

Demographic bias in misdiagnosis by computational pathology models

Microenvironmental reorganization in brain tumors following radiotherapy and recurrence revealed by hyperplexed immunofluorescence imaging

Introduction

Materials and methods

BraTS and PICTURE Datasets

Missing scans

Pre-processing

Manual segmentations

Quality control

Deep learning segmentation models

Model training and testing

Model performance assessment

Experiments and statistical analyses

Results

PICTURE dataset segmentations

Experiment 1—Segmentation performance on both test sets—which model achieved the best metrics?

Experiment 2—Segmentation performance on GBM vs LGG

Experiment 3—GBM segmentation performance on an external test set—do models need to be retrained for new hospital data?

Experiment 4—Effect of missing MRI sequences on segmentation performance

Outlier rates

Discussion

Clinical implications

Limitations

Future work

Conclusions

Data availability

Code availability

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links