Segmentation of glioblastomas in early post-operative multi-modal MRI with deep neural networks

Extent of resection after surgery is one of the main prognostic factors for patients diagnosed with glioblastoma. To achieve this, accurate segmentation and classification of residual tumor from post-operative MR images is essential. The current standard method for estimating it is subject to high inter- and intra-rater variability, and an automated method for segmentation of residual tumor in early post-operative MRI could lead to a more accurate estimation of extent of resection. In this study, two state-of-the-art neural network architectures for pre-operative segmentation were trained for the task. The models were extensively validated on a multicenter dataset with nearly 1000 patients, from 12 hospitals in Europe and the United States. The best performance achieved was a 61% Dice score, and the best classification performance was about 80% balanced accuracy, with a demonstrated ability to generalize across hospitals. In addition, the segmentation performance of the best models was on par with human expert raters. The predicted segmentations can be used to accurately classify the patients into those with residual tumor, and those with gross total resection.


Introduction
Glioblastoma, the most common malignant primary brain cancer, requires a multidisciplinary treatment approach comprising maximum safe surgical resection, followed by concurrent radiation and chemotherapy 1 .Even so, median survival in unselected patients is only 12 months 2 .Due to the high invasiveness, a complete resection of all tumor cells is not possible.Still, extensive surgical resections are associated with longer survival 3 , but as surgically induced neurological impairment is associated with shorter survival 4 , extent of resection (EOR) and surgical strategy, for example resection or biopsy only, needs to be weighed up against risks in individual patients.
The EOR is calculated as the ratio between the surgically-removed and pre-operative tumor volume , which relies on an accurate segmentation of the tumor tissue in both pre-and post-operative MR scans.In recent years, a large body of work has focused exclusively on automated segmentation of pre-operative glioblastoma, yet the task of residual tumor segmentation from early post-operative MRI (EPMR) has gained less attention from the research community.In current practice, the residual tumor size is estimated manually through eye-balling 5 , or using crude measures such as the bi-dimensional product of the largest axial diameter of the contrast enhancing residual tumor, according to the Response Assessment in Neuro-Oncology (RANO) criteria 6 .Manual volume segmentations are more sensitive but expertise-dependent and time-consuming, with high inter-and intra-rater variability 5,7 .An automated method for post-operative tumor volume segmentation from EPMR would therefore be beneficial.
Glioblastoma segmentation from pre-operative MR scans has received a lot of attention in the literature in recent years.Many contributions were motivated by the MICCAI Brain Tumor Segmentation (BraTS) Challenge 8 .With the emergence of fully convolutional neural networks (CNNs) 9 , deep learning-based approaches have nearly completely replaced more conventional methods in medical image segmentation 10 .Variants of the U-Net architecture 11 have facilitated the basis-architecture for the majority of auto-segmentation algorithms, including DeepMedic 12 , Attention U-Net 13 , and the recently established nnU-Net 14 , with state-of-the-art results in several medical image segmentation benchmarks.The winning submissions in the BraTS challenge in 2021 and 2022 were an extension of the nnU-Net architecture 15 , and an ensemble of three state-of-the art architectures for medical image segmentation, comprising nnU-Net, DeepSeg, and DeepSCAN 16 , respectively.In the absence of a publicly available dataset for residual tumor segmentation from EPMR, the literature on this problem is sparse when compared to the pre-operative segmentation task.Semi-automatic methods, combining of one or several voxel-or shapebased image segmentation algorithm, have been proposed from intensity thresholding (e.g., Otsu and relative entropy) [17][18][19] , fuzzy algorithms 18 , Gaussian mixture model 20 , morphological operations 19 , region-based active contours 21,22 , the level set approach [21][22][23] , and CNNs 24 .Unfortunately, these methods relied on user inputs, either by manual initialisation, or by interactive refinement of the resulting segmentation.They are therefore challenging to use in clinical practice, and in large datasets.In addition, all validation studies were solely performed on single-center local datasets, consisting of 15 to 37 patients, making if difficult to demonstrate the generalizability of the proposed methods.
Regarding fully automated approaches, Meier et al. 25 presented an automated method based on decision forests for residual tumor segmentation using EPMR from 19 patients.A more recent work by Ghaffari et al. 26 proposed to fine-tune a 3D densely connected U-Net, pre-trained on the BraTS20 dataset, on a local dataset of 15 post-operative glioblastomas.However, the MR scans were all acquired for radiation therapy planning and not within the recommended time frame to acquire EPMR scans, within 72 hours after surgical resection 6 .Deep learning approaches have recently shown to outperform more traditional algorithms on most image segmentation tasks, including segmentation of pre-operative glioblastomas 15,16 .The utmost requirement is the number of included patients and the quality of the MR images comprising a study dataset.Preferably, the data should originate from different locations, to evaluate the ability of the trained models to generalize across different hospitals, scanners, or clinical practice.
In this work, we determine the performance of two CNN architectures to segment residual enhancing glioblastoma on early post-operative scans.The selected architectures are the nnU-Net, state-of-the-art for pre-operative glioblastoma segmentation, and AGU-Net, an architecture developed for pre-operative segmentation of brain tumors.These architectures have both demonstrated excellent performance on pre-operative segmentation in previous studies on pre-operative brain tumor segmentation [27][28][29] , and they exhibit different strengths and weaknesses.The automatic results are compared with manual segmentations, using different combinations of MRI scans in a large dataset consisting of paired pre-and early post-operative MRI scans from 956 patients in 12 medical centers in Europe and the United States.Extensive validation studies are presented to identify the best architecture configuration, quantify the performances and ability to generalize, and highlight potential relevance for use in clinical practice.Finally, the best performing models are made publicly available and integrated into the open software Raidionics 29 .The cohorts are subsets of a broader dataset, thoroughly described previously for their pre-operative content 30 , for patients with available EPMR data.All EPMR scans were acquired within 72 hours after surgery, with the exception of the UTR center where the limit used was up to one week post-surgery.The recommended time frame for acquiring the EPMR scans has been stated in the National Comprehensive Cancer Network (NCCN) recommendations 31 , in order to maximize differences between residual enhancing tumor and enhancement due to post-surgical changes in the tissue 32,33 .For each patient in the dataset, the following post-operative MRI sequences were acquired: T1-weighted (T1w), gadolinium-enhanced T1-weighted (T1w-CE), and T2-weighted fluid attenuated inversion recovery (FLAIR).

Materials & Method
The residual tumor tissue was manually segmented in 3D in T1w-CE MR scans by trained annotators, supervised by expert neuroradiologists and neurosurgeons.The manual segmentation was performed using all available standard MR sequences, and residual tumor tissue was defined as enhancing tissue in the T1w-CE scan, but darker in the T1w scan.Hence, blood was distinguished from residual tumor by a hyperintense signal on T1w scans.For each patient, a further post-operative distinction can be made between cases showcasing residual tumor (RT) in EPMR scans and cases presenting a gross total resection (GTR), defined as a residual tumor volume of less than 0.175 ml. 34.The cut-off was chosen to reduce risk of interpretation problems when distinguishing between tumour enhancement and that of non-specific enhancement, such as small vessels or enhancing pia mater.Under this paradigm, 352 patients (35%) in our dataset had a GTR, whereas the remaining 604 patients had residual tumor.An overview of the data from the 12 hospitals is shown in Table 1, and some examples are illustrated in Figure 1.In addition, 20 patients out of the 73 in the AMS cohort have been annotated eight times in total, by four novices raters and four experts raters.This cohort has been used in a previous study to evaluate the inter-rater variability of tumor annotation from annotators with different levels of experience 7 , and will be referred to as the inter-rater variability dataset in the remainder of the document.

Segmentation process
Figure 2. Overall residual tumor segmentation pipeline from EPMR scans and classification between gross total resection or residual tumor.The registration is performed using the SyN approach from ANTs, multiple input configurations using different combinations of MR sequences were considered (noted from A to E), and two architectures were evaluated: nnU-Net and AGU-Net.
Similar to our previous work on pre-operative glioblastoma segmentation 27 , the following two competitive CNN architectures were selected for the task of voxel-wise segmentation of residual tumor tissue: patch-wise nnU-Net 14 and full-volume AGU-Net 28 .
Multiple MR sequences combinations can be considered as input for the CNN architectures.In an attempt to minimize essential input and following typical incremental assessment by neuroradiologists, these combinations of input sequences were considered for the automated segmentations of post-operative tumor: (A) the EPMR T1w-CE scan only, (B) the EPMR T1w-CE and EPMR T1w, to potentially distinguish between blood and residual tumor, (C) all standard EPMR sequences: T1w-CE, T1w, and FLAIR scans, (D) the EPMR T1w-CE and EPMR T1w, and the pre-operative T1w-CE MR scan and corresponding tumor segmentation mask, and (E) all standard EPMR sequences: T1w-CE, T1w, and FLAIR scans, and the pre-operative T1w-CE MR scan and corresponding tumor segmentation mask.An overview of the whole segmentation pipeline with the different input designs and subsequent steps is presented in the following sections, and illustrated in Fig. 2.

Pre-processing
For proper anatomical consistency across the different inputs sequences, an initial image-to-image registration procedure was performed.The EPMR T1w-CE scan was elected as the reference space and all subsequent volumes were registered to it using the SyN diffeomorphic method 35 from the Advanced Normalization Tools (ANTs) framework 36 .Skull-stripping was subsequently performed on all input MR scans, based on the brain mask from the EPMR T1w-CE scan.All brain masks were automatically generated using a pre-trained slab-wise AGU-Net model with input shape 256 × 192 × 32 voxels.For the nnU-Net architecture, the pre-processing was automatically decided by the framework based on the dataset, and all inputs were resampled to 0.5 × 0.5 × 1.0 mm 3 spacing and zero-mean normalized.For the AGU-Net architecture, the full-resolution analysis required a lower resolution, and therefore all inputs were resampled to an isotropic 1.0 mm 3 spacing, resized to 128 × 128 × 144 voxels, and zero-mean normalized.

Training specifications for the nnU-Net architecture
Architecture design From the nnU-Net framework analysis of the dataset, the 3D full-resolution U-Net with the following parameters was recommended, using 192 × 128 × 80 voxels as input patch size.The network used five levels, downsampling using strided convolution layers, and upsampling using transposed convolution layers.Kernel size of 1 × 3 × 3 voxels for the first level, 3 × 3 × 3 for the remaining four levels, and filter sizes of {32, 64, 128, 256, 320} were used for each level, respectively.The loss function was a combination of the Dice score and cross-entropy.A stride of one was used for the convolution layers.
Network training All models were trained from scratch for 1000 epochs using a stochastic gradient descent with Nesterov momentum optimizer (momentum=0.99).One epoch was defined as 250 batch iterations with a batch size of two.On-the-fly data augmentations were performed comprising rotation, scaling, additive Gaussian noise, Gaussian blur, brightness and contrast augmentation, and gamma augmentation.

Training specifications for the AGU-Net architecture
Architecture design The AGU-Net, as described by Bouget et al. 28 , is a 3D U-Net architecture with an integrated attentiongated mechanism, with five block levels using filter sizes of {16, 32, 128, 256, 256}, respectively.The input size of the network was set to 128 × 128 × 144 × S, with S being the number of sequences used as input.The architecture also uses multi-scale input and deep supervision.The class-averaged Dice loss, excluding the background, was used for training the different models.
Network training All models were initialized using pre-trained weights from the best pre-operative glioblastoma segmentation model 27 , and only the input layer was adapted to account for the different input combinations considered.The Adam optimizer was used with an initial learning rate of 1 × 10 −3 , and the training was stopped after 30 consecutive epochs without validation loss improvement.Gradient accumulation 37 was performed to increase the batch size from 2 samples to 32, tackling graphics processing unit (GPU) memory limitations for large batch training.Data augmentation techniques were leveraged including horizontal and vertical flipping, random rotations in the range [−20 • , 20 • ], and a translation of up to 10% of the axis dimension.Each augmentation was performed with a probability of 50% for each training sample.

Post-processing and GTR classification
During inference, residual tumor tissue was predicted by each trained model, resulting in a probability map of the same resolution as the EPMR T1w-CE scan.A binary mask was then generated from the probability map, using the best threshold determined from the validation studies.The binary mask was further refined by filtering out potential noise, inherent to the voxel-wise segmentation task, by applying a connected components analysis and removing any identified object smaller than 20 voxels.Finally, the refined binary mask was used to assess whether gross total resection has been achieved for the patient.

Validation studies
In this work, the trained models were assessed based on their ability to perform segmentation of the residual tumor and to classify patients into those with gross total resection and those with residual tumor.For the segmentation task, only two classes are considered, whereby a positive voxel exhibits tumor tissue, whereas a negative voxel represents either background or normal tissue.For the classification task, a rest tumor volume threshold was selected to serve as cut-off value.

Protocols
The validation studies presented in this work were conducted following a five-fold cross-validation, summarized in Table 2. First, all patients from 11 out of the 12 the hospitals in our dataset, excluding the AMS cohort, were split into five hospital-stratified folds, with an approximately balanced number of patients in each fold.The remaining 73 patients from the AMS hospital were 5/13 kept as an hold-out test set.For each iteration of the cross-validation, four folds were used for training, the remaining fifth fold was used for validation, and the hold-out set was used for test.
This approach presents similar benefits to the leave-one-hospital-out strategy used in previous work 27 , with the advantage of a reduced training time.Finally, predictions over the test set were generated by ensembling over the predictions obtained by each of the five trained models.An average pooling voting scheme was applied to each of the model predictions, to produce a single softmax prediction.

Metrics
To evaluate the models' voxel-wise performance on the task of residual tumor segmentation, Dice scores were computed between the ground truth annotation and the post-processed binary prediction mask.The Dice scores are reported for the subgroup of patients with residual tumor tissue according to the ground truth annotation, labelled as the 'positive' (P) group.The Dice scores for the subgroup of patients with residual tumor according to the ground truth annotation and the network predictions, labelled as the 'true positive' (TP) group, are also reported.Pooled estimates, when computed from each fold's results, are reported for each measurement as mean and standard deviation (indicated by ±) in the tables.
For the patient-wise classification task of distinguishing patients with gross total resection and patients with residual tumor, a standard sensitivity and specificity approach was conducted represented by the balanced accuracy score (noted bAcc).A residual tumor volume below the clinical volume threshold was thus counted as a negative (i.e., GTR) and as positive otherwise (i.e., RT).Following this consideration, a patient was considered a true positive (TP) if both the ground truth annotation residual tumor volume and detected residual tumor volume were ≥ 0.175 ml, for any given Dice score (i.e., ≥ 0.01).Conversely, if both volumes were < 0.175 ml, the patient was labelled as a true negative (TN).Patients where the ground truth volume was above the threshold volume and the prediction was below were marked as false negatives (FN), and false positive (FP) vice versa.
In the case of inter-rater variability, the Jaccard score, closely related to the Dice score by J = D 2−D , was used to compare the models' performance.The Jaccard was chosen for easy comparison with a previously published work on the same dataset 7 .

Experiments
The following three experiments were conducted in this study: (i) Residual tumor segmentation performance study: using the 5-fold cross-validation protocol and segmentation metrics, both nnU-Net and AGU-Net architectures' segmentation performances were compared for the five combinations of input sequences.
(ii) Gross total resection classification performance study: using the 5-fold cross-validation protocol, classification metrics, and best input combination identified in the first experiment, both architectures were compared in terms of ability to classify between gross total resection and residual tumor patients.
(iii) Inter-rater variability study: the best model from each architecture was benchmarked in terms of segmentation performance against the performance of novice and expert annotators, using the inter-rater variability dataset.For each patient, a consensus agreement annotation has been created using a majority voting approach.Using all eight annotations from both experts and novices, a voxel was defined as belonging to a tumor if annotated by more than half of the annotators.The models' binary predictions and the eight inter-rater annotations were then compared against the ground truth annotations (as used in the hold-out test set) and the consensus annotations.

Results
The studies were performed using multiple machines with the two following specifications: (i) Intel Core Processor (Broadwell, no TSX, IBRS) central processing unit (CPU) with 16 cores, 64GB of RAM, Tesla V100S (32GB) dedicated GPU, and a regular hard-drive and (ii) a GPU server with a total of 256 CPU cores, 2TB of RAM, and six NVIDIA A100-SXM4 (80GB) cards.The AGU-Net architecture was implemented in Python 3.6 with the TensorFlow v1.13.1 library 38 .For the nnU-Net architecture, Python 3.8, PyTorch v1.13.1 39 , and the nnU-Net framework v1.7.0 14  Segmentation performances across both architectures, for all input sequences combinations, and only for patients with residual tumor are summarized in Table 3.For both architectures, the lowest average Dice score over the external test set was obtained with configuration A, indicating that solely using T1w-CE MR scans is insufficient for identifying post-operative residual tumor.The addition of the T1w scan as input (i.e., configuration B) provides at least a 5% improvement in Dice scores over the test set for both architectures.This illustrates the additional value of the T1w sequence, presumably due to better distinction between blood and tumor.The inclusion of the FLAIR scan in input configuration C slightly degraded the Dice score compared to input configuration B. Finally, the inclusion of pre-operative data does not seem to improve the performance for any architecture, as the Dice scores for input configuration D are again slightly lower than for configuration B. Further addition of the FLAIR scan in input configuration E leads to a minor decrease in Dice scores compared to configuration D. For both architectures, input configuration B yielded the highest Dice scores on the test set.The highest Dice and true positive Dice scores were obtained with the nnU-Net architecture trained on input configuration B, with respectively 59% and 61% Dice on the validation and test sets.Overall, performances obtained across the test set are stable, in support of generalizability.Likewise, performances over the validation sets from the cross-validation protocol are consistent for inputs configurations B to E. The same results trends can be observed across both architectures for the true positive Dice, although slightly higher for the positive Dice using the nnU-Net architecture.
Looking at patient-wise performances, models trained with the nnU-Net architecture achieve perfect recall across all configurations for both the validation and test sets.Whereas the patch-wise strategy followed allows for segmenting smaller structures, the loose criterion to consider a network prediction as true positive further strengthens this aspect.Indeed, only a few correctly overlapping voxels between the prediction and the ground truth are needed for residual tumor to be considered satisfactorily identified patient-wise.Due to the full volume approach, models trained with AGU-Net generally struggle to identify small elements, as indicated by an overall around 80% across the board.Conversely, the opposite trend can be noticed in regards to patient-wise precision performance.Models trained with nnU-Net tend to perform more erroneous predictions as indicated by average precision scores below 70%, whereas AGU-Net models tend to be more precise with precision scores up to 95%.Ultimately, F1-scores are very similar between models from both architectures across the different input configurations, as they combine recall and precision performances.
From the segmentation performances analysis, the best results have been obtained with the nnU-Net architecture using input configuration D. Visual comparisons are provided in Figure 3 between the two architectures using the best input configuration for some patients from the test set, one featured per row.In the top row, both models achieved excellent segmentation with a Dice score above 90%.In the second row, a multifocal post-operative residual tumor case is featured whereby the AGU-Net model produced one false positive component as can be seen in red in the 3D representation.For the third row, a challenging multifocal and fragmented residual tumor case is displayed where both models failed to segment the largest component.Finally, in the last row, oversegmentation was performed using both models leading to Dice scores below 40%.

Figure 3.
Segmentation comparison between the manual ground truth (in blue) and the binary predictions (in red) for the two architectures using configuration D, over the test set.One patient is featured per row, the patient-wise Dice is reported in white, and a 3D representation of the overlap is included (best viewed digitally and in color).

Gross total resection classification performance study
Classification performances between patients with residual tumor and gross total resections, across both architectures and for all input configurations, are reported in Table 4.The first noticeable result is the overall tendency of the nnU-Net architecture to oversegment, resulting in a perfect recall over both the test set and validation set, for a really poor specificity often below 30%.Overall, nnU-Net achieves balanced accuracy scores barely above 0.5 for all input configurations, which means the classification performance is only slightly better compared to the average score of random guessing (i.e., 0.5).Conversely, models trained with the AGU-Net architecture are more conservative leading to higher specificity scores, up to 90% for input configuration C, and reasonably high recall/sensitivity values above 80%.In contrast to segmentation performances, the successive addition of MR scans within the input configuration lead to improved classification performances for both architectures.One apparent difference is the added value of the FLAIR sequence with the AGU-Net architecture, further increasing the specificity and balanced accuracy, unlike performances with the nnU-Net architecture.
From the classification performances analysis, the best results on the test set according to the balanced accuracy have been obtained with the AGU-Net architecture using input configuration C.However, the best results on the validation sets are achieved with input configuration E. In a clinical scenario, a high sensitivity has higher priority than a high specificity, as long as the trade-off is reasonable.AGU-Net trained with input configuration E is therefore the preferred model for classification.This configuration achieves the highest sensitivity for all input configurations while still achieving a reasonable specificity,

Inter-rater variability study
For the 20 patients constituting the inter-rater variability dataset, a comparison of the Jaccard scores, obtained between each rater and the best model from each architecture, are reported in Figure 4.As all configurations B-E yielded very similar segmentation performance scores, the selected best configuration for each architecture were the models that produced the best trade-off between sensitivity and specificity for the classification task.For the nnU-Net, configuration D was selected as the best configuration as this model achieved the highest specificity out of all the trained nnU-Net models, and for the AGU-Net, configuration E was selected as this model yielded the highest sensitivity while maintaining a reasonable specificity.Using the ground truth annotation from the test dataset as a reference segmentation, both architectures achieved Jaccard scores within a variability range very similar to that of the novice and expert annotators.With the consensus agreement annotation as a reference segmentation, the AGU-Net model achieved slightly poorer Jaccard scores than the majority of the expert human raters but remained within the variability range of the novice annotators.The nnU-Net model achieved scores similar to the variability range of the expert raters, also when compared to the consensus agreement annotation.

Discussion
In this multicenter study, the feasibility of post-operative residual tumor segmentation with deep neural networks was assessed.Two state-of-the-art architectures for pre-operative glioblastoma segmentation were compared: nnU-Net and AGU-Net.Both architectures were trained on five different combinations of early post-operative and pre-operative MR scans as input, and benchmarked in terms of segmentation and classification performances compared with manual rating.The main finding is that automatic segmentation performances are comparable to human rater performance on real world MRI scans, requiring early post-operative T1w-CE and T1w MRI scans only.In addition, the trained automated models have shown promising ability to classify patients who underwent gross total resection from patients exhibiting post-operative residual tumor.The multimodal and multicentric dataset in this study is the largest cohort used for the task of early post-operative glioblastoma segmentation, with a total of 956 patients.Regarding the dataset curation, our strict inclusion criteria required availability of all four MR scans as input (i.e., post-operative T1w-CE, T1w, FLAIR, and pre-operative T1w-CE) for each patient.Whereas this decision was motivated by a simpler method design, approximately 150 patients were excluded as one or more MR scans were missing.A relaxation of the inclusion criteria would increase the size of the dataset, and open the 9/13 possibility to generate a more diverse set of input MR scans, including for example T2-weighted images.Ideally, the trained methods should be able to deal with a sparse set of inputs, where one or more MR scans are missing.The trained models should be used off the shelf, by replacing missing sequences with empty volumes, synthetically generated sequences, or allowing missing inputs using sparsified learning techniques 40 .
In their ability to segment post-operative tumor, nnU-Net and AGU-Net exhibit strengths and weaknesses inherent to their design.Through a patch-wise approach, nnU-Net models are able to segment relatively small structures, having access to more fine-grained details from the use of MR scans close to their initial resolution.Considering the relatively small volumes and fragmented state for residual tumors, nnU-Net models are able to achieve higher Dice score and recall performances.On the other hand, models trained using the AGU-Net approach are following a full volume approach, largely downsampling the input MR scans, hindering the ability for detecting smaller structures.However, such models appear to be more conservative in their predictions, hence heavily reducing the amount of false positives enabling to reach high precision performances.Regarding the different input configurations, the biggest impact on segmentation performances comes from combining EPMR T1w-CE and T1w scans, which corresponds to the favored approach as well in clinical practice.The inclusion of additional MR sequences seems to add little to segmentation performances.Adapting the convolution blocks, filter sizes, or other elements of the architectures might be needed for letting the number of trainable parameters to evolve according to the number of inputs, instead of a fixed amount of parameters.
The validation studies described in this article served the two purposes of investigating the predictive ability and capacity to generalize of the trained models.This is obtained through the use of a unique test set, and equally distributed hospital-stratified validation sets.Our selection for a specific hold-out hospital as a test set was based on the availability of manual annotations from multiple raters, allowing to perform, in addition, an inter-rater variability study.Regarding the computation of the reported metrics, the rationale for only including the true positive patients in the segmentation performances lies in the Dice score computation itself.Indeed, cases with a GTR preclude calculation of a Dice score.Therefore, the validation studies include a separate experiment to classify patients into those with a GTR and those with residual tumor.
The inter-rater variability study demonstrated that residual tumor segmentation performance is on par with the average human expert annotator performance, when evaluated against an independent ground truth segmentation.Even when evaluated against the consensus agreement annotation, which is by definition biased towards each of the human annotators included in the study, the best segmentation model achieves scores similar to the individual expert annotations, and still outperforms the novice annotators.The consensus agreement annotation based on a majority voting scheme over all annotations from the eight different annotators should be considered the gold standard for defining the residual tumor.However, this is not achievable in a real-world clinical scenario, where even an exact delineation of the tumor remnant from one human annotator is rarely performed.The proposed automatic method for residual tumor segmentation should thus be considered an acceptable alternative to the current standard practice for evaluating the tumor remnant after surgery, as the average performance of the method lies within the variability range of individual expert annotators.Such segmentation performances are even achieved with the exclusive use of post-operative MR sequences as model inputs (T1w-CE, T1w, and FLAIR), whereas the addition of pre-operative information (pre-operative T1w-CE and label) retains the model performance on similar levels.Thus, in clinical practice, our trained models could be deployed even in the absence of pre-operative scans, as long as at least the T1w-CE and T1w post-operative sequences are available, to establish an automated and relatively fast method for the segmentation task.On a second level, the output segmentation masks can be used to differentiate between patients with remnant tumor after surgery and gross total resection patients, with increasing balanced accuracy performance as more sequences are added to the model inputs.Our early post-operative glioblastoma segmentation models have been made freely available in the Raidionics environment 1 .
In spite of promising reported performances, the task of early post-operative glioblastoma segmentation is far from accomplished.The full extent of residual tumor, often very fragmented around the resection cavity, is never wholly captured.In future work, the pre-operative MR scans and tumor location should be better leveraged as the residual tumor is bound to lie in its vicinity.Focusing the search solely within a region of interest might help retaining a higher image resolution, for better segmentation of small structures.Nevertheless, competitive pre-and post-operative glioblastoma segmentation models are now publicly available, opening the door to clinically-oriented validation studies.Assuming a positive outcome, the use of automatic models and methods would be highly beneficial in a clinical setting to collect parameters currently obtained through eyeballing or diameter estimation, hence yielding reproducible and deterministic significance.

Conclusion
In this study, two state-of-the-art neural network architectures for glioblastoma segmentation were trained and thoroughly validated on a large cohort of 956 patients.Automatic segmentation performances are on par with human rater performance on real world MRI scans, requiring early post-operative T1w-CE and T1w MRI scans only.In addition, the presented models have shown promising readiness for automatically distinguishing between patients who underwent gross total resection, and patients with residual tumor.The prognostic value of the automated method should be assessed in future studies.

2. 1 DataFigure 1 .
Figure 1.Dataset examples for four patients, separated by white dash-lines.For each patient, an axial view from the EPMR T1w-CE, EPMR T1w, EPMR FLAIR, and pre-operative T1w-CE are displayed.Outlines of the manually annotated tumors are shown in green.

Figure 4 .
Figure 4. Inter-rater Jaccard score variability over a subset of the AMS cohort.To the left, the ground truth annotation used for training served as segmentation of reference.To the right, the reference segmentation was a consensus agreement between annotations from all raters.

Table 1 .
Dataset distributions and statistics across the twelve hospitals, represented by their acronyms.RT: residual tumor, GTR: gross total resection.

Table 2 .
Distribution of hospitals and patient samples featured in the 5-fold validation sets and hold-out test set.

13 3.1 Residual tumor segmentation performance studyTable 3 .
Segmentation performances for patients with residual tumor, for both architectures, all input configurations, and over the validation and test sets.

Table 4 .
Gross total resection versus residual tumor classification performances for both architectures, all input configurations, and over the validation and test sets.