The Medical Segmentation Decathlon

International challenges have become the de facto standard for comparative assessment of image analysis algorithms. Although segmentation is the most widely investigated medical image processing task, the various challenges have been organized to focus only on specific clinical tasks. We organized the Medical Segmentation Decathlon (MSD)—a biomedical image analysis challenge, in which algorithms compete in a multitude of both tasks and modalities to investigate the hypothesis that a method capable of performing well on multiple tasks will generalize well to a previously unseen task and potentially outperform a custom-designed solution. MSD results confirmed this hypothesis, moreover, MSD winner continued generalizing well to a wide range of other clinical problems for the next two years. Three main conclusions can be drawn from this study: (1) state-of-the-art image segmentation algorithms generalize well when retrained on unseen tasks; (2) consistent algorithmic performance across multiple tasks is a strong surrogate of algorithmic generalizability; (3) the training of accurate AI segmentation models is now commoditized to scientists that are not versed in AI model training.


Introduction
Machine learning is beginning to revolutionize many fields of medicine, with success stories ranging from the accurate diagnosis and staging of diseases [1], to the early prediction of adverse events [2] and the automatic discovery of antibiotics [3].In this context, a large amount of literature has been dedicated to the automatic analysis of medical images [4].Semantic segmentation refers to the process of transforming raw medical images into clinically relevant, spatially structured information, such as outlining tumor boundaries, and is an essential prerequisite for a number of clinical applications, such as radiotherapy planning [5] and treatment response monitoring [6].It is so far the most widely investigated medical image processing task, with about 70% of all biomedical image analysis challenges dedicated to it [7].With thousands of algorithms published in the field of biomedical image segmentation per year [8], however, it has become challenging to decide on a baseline architecture as starting point when designing an algorithm for a new given clinical problem.
International challenges have become the de-facto standard for comparative assessment of image analysis algorithms given a specific task [7].Yet, a deep learning architecture well-suitable for a certain clinical problem (e.g., segmentation of brain tumors) may not necessarily generalize well to different, unseen tasks (e.g., vessel segmentation in the liver).Such a "generalizable learner", which in this setting would represent a fully automated method that can learn any segmentation task given some training data and without the need for human intervention, would provide the missing technical scalability to allow many new applications in computer-aided diagnosis, biomarker extraction, surgical intervention planning, disease prognosis, etc.To address this gap in the literature, we proposed the concept of the Medical Segmentation Decathlon (MSD), an international challenge dedicated to identifying a general-purpose algorithm for medical image segmentation.The competition comprised ten different data sets with various challenging characteristics, as shown in Fig. 1.The participants were allowed to submit only one solution, able to solve all problems without changing the architecture or hyperparameters.
The contribution of this paper is three-fold: (1) We are the first to organize a biomedical image analysis challenge in which algorithms compete in a multitude of both tasks and modalities.More specifically, the underlying data set has been designed to feature some of the representative difficulties typically encountered when dealing with medical images, such as small data sets, unbalanced labels, multi-site data and small objects.(2) Based on the MSD, we released the first open framework for benchmarking medical segmentation algorithms with a specific focus on generalizability.(3) By monitoring the winning algorithm, we show that generalization across various clinical applications is possible with one single framework.
In the following, we will first describe the challenge design, including the organization, mission, data sets and assessment method, in Section 2, followed by the presentation of the results in Section 3, in which we present the submitted methods and rankings as well as the results for the live challenge.We conclude with a discussion in Section 4.

Methods
This section is organized according to the EQUATOR 3 guideline BIAS (Biomedical Image Analysis Chal-lengeS) [9], a recently published guideline specifically designed for the reporting of biomedical image analysis challenges.It comprises information on challenge organization and mission, as well as the data sets and assessment methods used to evaluate the submitted results.

Challenge organization
The Decathlon challenge was organized at the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2018, held in Granada, Spain.After the main challenge event at MICCAI, a live challenge was opened for submissions which is still open and regularly receives new submissions (more than 500 as of May 15th 2021).
The MSD challenge aimed to test the ability of machine learning algorithms to accurately segment a large collection of prescribed regions of interest, as defined by ten different data sets, each corresponding to a different anatomical structure (see Fig. 1) and to at least one medical imaging task [10].The challenge itself consisted of two phases: In the first phase, named the development phase, the training cases (comprising images and labels) for seven data sets were released, namely for brain, liver, heart, hippocampus, prostate, lung, and pancreas.Participants were expected to download the data, develop a general purpose learning algorithm, train the algorithm on each task's training data independently and without human interaction (no task-specific manual parameter settings), run the learned model on each task's test data, and submit the segmentation results.Each team was only allowed to make one submission per day to avoid model overfit, and the results were presented in form of a live leaderboard on the challenge website,4 visible to the public.Note that participants were only able to see the average performance obtained by their algorithm on the test data of the seven development tasks.
The purpose of the second phase of the challenge, named the mystery phase, was to investigate whether algorithms were able to generalize to unseen segmentation tasks.Teams that submitted to the first phase and completed all necessary steps were invited to download three more data sets (images and labels), i.e., hepatic vessels, colon, and spleen.They were allowed to train their previously developed algorithm on the new data, without any modifications to the method itself.Segmentation results of the mystery phase could only be submitted once.A detailed description of the challenge organization is summarized in Appendix A, following the form introduced in [9].

The Decathlon mission
Medical image segmentation, i.e., the act of labeling or contouring structures of interest in medical imaging data, is a task of crucial importance, both clinically and scientifically, as it allows the quantitative characterization of regions of interest.When performed by human raters, image segmentation is very time-consuming, thus limiting its clinical usage.Algorithms can be used to automate this segmentation process, but, classically, a different algorithm had to be developed for each segmentation task.The goal of the MSD challenge was finding a single algorithm, or learning system, that would be able to generalize and work accurately across multiple different medical segmentation tasks, without the need for any human interaction.
The tasks of the Decathlon challenge were chosen as a representative sample of real-world applications, so as to test for algorithmic generalizability to these.Different axes of complexity were explicitly explored: the type and number of input modalities, the number of regions of interest, their shape and size, and the complexity of the surrounding tissue environment (see Fig. 1).Detailed information of each data set is provided in Section 2.3 and Table 1.

Challenge data sets
The Decathlon challenge made ten data sets available online [11], where each data set had between one and three region-of-interest (ROI) targets (17 targets in total).Importantly, all data sets have been released with a permissive copyright-license (CC-BY-SA 4.0), thus allowing for data sharing, redistribution, and commercial usage, and subsequently promoting the data set as a standard test-bed for all users.The images (2,633 in total) were acquired across multiple institutions, anatomies and modalities during real-world clinical applications.All images were de-identified and reformatted to the Neuroimaging Informatics Technology Initiative (NIfTI) format https://nifti.nimh.nih.gov.All images were transposed (without resampling) to the most approximate right-anterior-superior coordinate frame, ensuring the data matrix x-y-z direction was consistent.Lastly, non-quantitative modalities (e.g., MRI) were robust min-max scaled to the same range.For each segmentation task, a pixel-level label annotation was provided depending on the definition of each specific task.For 8 out of 10 data sets, two thirds of the data were released as training sets (images and labels) and one third as test set (images without labels).As the remaining two tasks (brain tumor and liver) consisted of data from two well-known challenges, the original training/test split was preserved.
Table 1 presents a summary of the ten data sets, including the modality, image series, ROI targets and data set size.A brief description of each data set is provided below.
• Development Phase (1st) contained seven data sets with thirteen target regions-of-interest in total: -Brain: The data set consists of 750 multiparametric magnetic resonance images (mp-MRI) from patients diagnosed with either glioblastoma or lower-grade glioma.The sequences used were native T1-weighted (T1), post-Gadolinium (Gd) contrast T1-weighted (T1-Gd), native T2-weighted (T2), and T2 Fluid-Attenuated Inversion Recovery (FLAIR).The corresponding target ROIs were the three tumor sub-regions, namely edema, enhancing, and non-enhancing tumor.This data set was selected due to the challenge of locating these complex and heterogeneously-located targets.The data was acquired from 19 different institutions and contained a subset of the data used in the 2016 and 2017 Brain Tumor Segmentation (BraTS) challenges [12,13,14].
-Heart: The data set consists of 30 mono-modal MRI scans of the entire heart acquired during a single cardiac phase (free breathing with respiratory and electrocardiogram (ECG) gating).The corresponding target ROI was the left atrium.This data set was selected due to the combination of a small training data set with large anatomical variability.The data was acquired as part of the 2013 Left Atrial Segmentation Challenge (LASC) [15].
-Hippocampus: The data set consists of 195 MRI images acquired from 90 healthy adults and 105 adults with a non-affective psychotic disorder.T1-weighted MPRAGE was used as the imaging sequence.The corresponding target ROIs were the anterior and posterior of the hippocampus, defined as the hippocampus proper and parts of the subiculum.This data set was selected due to the precision needed to segment such a small object in the presence of a complex surrounding environment.The data was acquired at the Vanderbilt University Medical Center, Nashville, US.
-Liver: The data set consists of 201 contrast-enhanced CT images from patients with primary cancers and metastatic liver disease, as a consequence of colorectal, breast, and lung primary cancers.The corresponding target ROIs were the segmentation of the liver and tumors inside the liver.This data set was selected due to the challenging nature of having significant label unbalance between large (liver) and small (tumor) target region of interests (ROIs).The data was acquired in the IRCAD Hôpitaux Universitaires, Strasbourg, France and contained a subset of patients from the 2017 Liver Tumor Segmentation (LiTS) challenge [16].
-Lung: The data set consists of preoperative thin-section CT scans from 96 patients with non-small cell lung cancer.The corresponding target ROI was the tumors within the lung.This data set was selected due to the challenge of segmenting small regions (tumor) in an image with a large field-of-view.Data was acquired via the Cancer Imaging.Archive5 -Prostate: The data set consists of 48 prostate multiparametric MRI (mpMRI) studies comprising T2-weighted, Diffusion-weighted and T1-weighted contrast enhanced series.A subset of two series, transverse T2-weighted and the apparent diffusion coefficient (ADC) was selected.The corresponding target ROIs were the prostate peripheral zone (PZ) and the transition zone (TZ).
This data set was selected due to the challenge of segmenting two adjoined regions with very large inter-subject variability.The data was acquired at Radboud University Medical Center, Nijmegen Medical Centre, Nijmegen, The Netherlands.
-Pancreas: The data set consists of 421 portal-venous phase CT scans of patients undergoing resection of pancreatic masses.The corresponding target ROIs were the pancreatic parenchyma and pancreatic mass (cyst or tumor).This data set was selected due to label unbalance between large (background), medium (pancreas) and small (tumor) structures.The data was acquired in the Memorial Sloan Kettering Cancer Center, New York, US.
• Mystery Phase (2nd) contained three (hidden) data sets with four target regions-of-interest in total: -Colon: The data set consists of 190 portal venous phase CT scans of patients undergoing resection of primary colon cancer.The corresponding target ROI was colon cancer primaries.This data set was selected due to the challenge of the heterogeneous appearance, and the annotation difficulties.
The data was acquired in the Memorial Sloan Kettering Cancer Center, New York, US.
-Hepatic Vessels: The data set consists of 443 portal venous phase CT scans obtained from patients with a variety of primary and metastatic liver tumors.The corresponding target ROIs were the vessels and tumors within the liver.This data set was selected due to the tubular and connected nature of hepatic vessels neighboring heterogeneous tumors.The data was acquired in the Memorial Sloan Kettering Cancer Center, New York, US.
-Spleen: The data set consists of 61 portal venous phase CT scans from patients undergoing chemotherapy treatment for liver metastases.The corresponding target ROI was the spleen.This data set was selected due to the large variations in the field-of-view.The data was acquired in the Memorial Sloan Kettering Cancer Center, New York, US.
Table 1: Summary of the ten data sets of the Medical Segmentation Decathlon.Used abbreviations: mp-MRI-multiparametricmagnetic resonance imaging, FLAIR-fluid-attenuated inversion recovery, T1w-T1 weighted image, T1 \w Gd-post-Gadolinium (Gd) contrast T1-weighted image, T2w-T2 weighted image, CT-computed tomography, PZ-peripheral zone, TZ-transition zone.Two widely known semantic segmentation metrics were used to evaluate the submitted approaches, namely the Dice Similarity Coefficient (DSC) [17] and the Normalised Surface Distance (NSD) [18], both computed on 3D volumes.The implementation of both metrics can be downloaded in the form of a Jupyter notebook from the challenge website, 6 .The metrics DSCandN SD were chosen due to their popularity, rank stability [10], and smooth, well-understood and well-defined behavior when ROIs do not overlap.Having simple and rank-stable metrics also allows the statistical comparison between methods.It is important to note that the proposed metrics are not task-specific nor task-optimal, and thus, they do not fulfill the necessary criteria for clinical algorithmic validation of each task, as discussed in Section 4.3.
A so-called significance score was determined for each algorithm a, separately for each task/target ROI c i and metric m j ∈ {DSC, N SD} and referred to as s i,j (a).Similarly to what was used to infer the ranking across the different BRATS tasks [19], the significance score was computed according to the following fourstep process: 1. Performance assessment per case: Determine performance m j (a l , t ik ) of all algorithms a l , with l = {1, . . ., N A }, for all test cases t ik , with k = {1, . . ., N i }, where N A is the number of competing algorithms and N i is the number of test cases in competition c i .Set m j (a l , t ik ) to 0 if its value is undefined.2. Statistical tests: Perform a Wilcoxon signed-rank pairwise statistical test between algorithms (a l , a l ), with values m j (a l , t ik ) − m j (a l , t ik ), ∀k = {1, ..., N i }. 3. Significance scoring: s i,j (a l ) then equals the number of algorithms performing significantly worse than a l , according to the statistical test (per comparison α = 0.05, not adjusted for multiplicity).4. Significance ranking: The ranking is computed from the scores s i,j (a l ), with the highest score (rank 1) corresponding to the best algorithm.Note that shared scores/ranks are possible.If a task has multiple target ROI, the ranking scheme is applied to each ROI separately, and the final ranking per task is computed as the mean significance rank.
The final score for each algorithm over all tasks of the development phase (the seven development tasks) and over all tasks of the mystery phase (the three mystery tasks) was computed as the average of the respective task's significance ranks.The full validation algorithm was defined and released prior to the start of the challenge, and available on the decathlon website. 7o investigate ranking uncertainty and stability, bootstrapping methods were applied with 1,000 bootstrap samples as described in [10].The statistical analysis was performed using the open-source R toolkit challengeR8 , version 1.0.1 [20], for analyzing and visualizing challenge results .The original rankings computed for the development and mystery phases were compared to the ranking lists based on the individual bootstrap samples.The correlation of pairwise rankings was determined via Kendall's τ [21], which provides values between −1 (for reverse ranking order) and 1 (for identical ranking order).

Monitoring of the challenge winner and algorithmic progress
To investigate our hypothesis that a method capable of performing well on multiple tasks will generalize its performance to an unseen task, and potentially even outperform a custom-designed task-specific solution, we monitored the winner of the challenge for a period of two years.Specifically, we reviewed the rank analysis and leaderboards presented in the corresponding article [8], as well as the leaderboard of challenges from the grand-challenge.orgwebsite organized in 2020.We also reviewed further articles mentioning the new state-of-the-art method nnU-Net [22].Finally, as the MSD challenge submission was reopened after the challenge event (denoted the "MSD Live Challenge"), we monitored submissions for new algorithmic approaches which achieve state-of-the-art performance, in order to probe new areas of scientific interest and development.

Challenge submissions
In total, 180 teams registered for the challenge, from which 31 submitted fully-valid and complete results for the development phase.From these, a subset of 19 teams submitted final and valid results for the mystery phase.Among the methods that fulfilled all the criteria to move to the mystery phase, all methods were based on convolutional neural networks (CNNs), with the U-Net [23] being the most frequently used base architecture-employed by more than half of the teams (64%).The most commonly used loss function was the DSC loss (29%), followed by the cross entropy loss (21%).Fig. 2 provides a complete list of both network architectures and loss functions used in the challenge.61% of the teams used the adaptive moment estimation (Adam) optimizer [24], while the stochastic gradient descent (SGD) [25] was used by 33% of the teams.
Table 2: Details of the participating teams' methods.

Method description of top three algorithms
In the following, the top three methods are briefly described.Table 2 provides an overview over all methods that were submitted for the mystery phase.For more details, see Appendix B.

nnU-Net
The key idea of nnU-Net's method was to use a fully automated dynamic adaptation of the segmentation pipeline, done independently for each task in the MSD, based on an analysis of the respective training data set.Image pre-processing, network topologies and post-processing were determined fully automatically and considered more important than the actual architecture [8].nnU-Net was based on the U-Net architecture [23] with the following modifications: the use of leaky ReLU, instance normalization and strided convolutions for downsampling [8].It further applied a combination of augmentation strategies, namely affine transformation, non-linear deformation, intensity transformation (similar to gamma correction), mirroring along all axes and random crop.The sum of the DSC and cross entropy loss was used, while utilizing the Adam optimizer.The method applied a purposely defined ensembling strategy in which four different architectures were used.The selection of the task-specific optimal combination was found automatically via cross-validation on the training set.

NVDLMED
The key idea of NVDLMED's method was to use a fully-supervised uncertainty-aware multi-view cotraining strategy [26].They achieved robustness and generalization by initializing the model from 2D pre-trained models and using three views of networks to gain more 3D information through the multi-view co-training process.They further used a resampling strategy to cope with the differences among the ten tasks.The NVDLMED team utilized a 3D version of the ResNet [27] with anisotropic 3D kernels [26].The team further applied a combination of augmentation strategies, namely affine transformation, geometric leftright flip and random crop.The DSC loss and the SGD optimizer were employed.NVDLMED ensembled three models, each trained on a different view (coronal, saggital and axial).

K.A.V.athlon
The key idea of K.A.V.athlon's method was a generalization strategy in the spirit of AutoML [28].The process was designed to train and predict automatically using given image data and description without any parameter change or intervention by a human.K.A.V.athlon's method was based on a combination of the V-Net and U-Net architectures with the addition of a Squeeze-and-Excitation (SE) block and a residual block.The team further applied different types of augmentation, namely affine transformation, noise application, geometric left-right flip, random crop, and blurring.The DSC loss with a thresholded ReLU (threshold 0.5) and the Adam optimizer were employed.No ensembling strategy was used.

Individual performances and rankings
The DSC values for all participants for the development phase and the mystery phase are provided as box-plots in Figs. 3 and 4, respectively.For tasks with multiple target ROIs (e.g., edema, non-enhancing tumor and enhancing tumor segmentation for the brain data set), the box-plots were color-coded according to the ROI.The distribution of the NSD metric values was comparable to the DSC values.
It can be seen that the performance of the algorithms as well as their robustness depends crucially on the task and target ROI.The median of the mean DSC computed considering all test cases of a single task over all participants ranged from 0.16 (colon cancer segmentation (the mystery phase), cf.The rankings for the challenge are shown in Table 3.The winning method (nnU-Net) was extremely robust with respect to the different tasks and target regions for both phases (cf.Figs. 3 and 4).Ranks 2 and 3 switched places (K.A.V.athlon and NVDLMED) for both the development and mystery phase.Fig. 5 further shows the ranks of all algorithms for all thirteen target regions of the development phase (red) and all four target regions of the mystery phase in form of a box-plot.Many teams show a large variation in their ranks across target ROIs.The lowest rank difference of three ranks was achieved for team nnU-Net (minimum rank: 1, maximum rank: 4; the development phase) and the largest rank difference of sixteen ranks is obtained for team Whale (minimum rank: 2, maximum rank: 18; the development phase).
To investigate ranking robustness for different ranking methods, line plots [20] are provided in Appendix C for all individual target regions.Furthermore, a comparison of the achieved ranks of algorithms for 1,000 bootstrapped samples is provided in the form of a stacked frequency plot [20] in Fig. D. 16.For each participant, the frequency of the achieved ranks is provided for every task individually.It can be easily seen from both uncertainty analyses that team nnU-Net implemented an extremely successful method that was at rank 1 for nearly every tasks and bootstrap sets.
The agreement of the original rankings computed for the development phase and the mystery phase and the ranking lists based on the individual bootstrap samples was determined via Kendall's τ .The median (interquartile range (IQR)) Kendall's τ was 0.94 (0.91, 0.95) for the colon task, 0.99 (0.98, 0.99) for the hepatic vessel task and 0.92 (0.89, 0.94) for the spleen task.This shows that the rankings for the mystery phase were stable against small perturbations.

Impact of the challenge winner
In the two years after the challenge, the winning algorithm, nnU-Net (with sometimes minor modification) competed in a total of 53 further segmentation tasks.The method won 33 out of 53 tasks with a median rank of 1 (interquartile range (IQR) of (1;2)) in the 53 tasks [8], for example being the winning method of the famous BraTS challenge in 2020 9 .This confirmed our hypothesis that a method capable of performing well on multiple tasks will generalize well to a previously unseen task and potentially outperform a customdesigned solution.The method further became the new state-of-the-art method and was used in several segmentation challenges by other researchers.For instance, eight nnU-Net derivatives were ranked in the top 15 algorithms of the 2019 Kidney and Kidney Tumor Segmentation Challenge (KiTS) 10 [8], the MICCAI challenge with the most participants in the year 2019.Nine out of the top ten algorithms in the COVID-19 Lung CT Lesion Segmentation Challenge 2020 (COVID-19-20) 11 built their solutions on top of nnU-Net (98 participants in total).As demonstrated in [22], nine out of ten challenge winners in 2020 built solutions on top of nnU-Net.

Discussion
We organized the first biomedical image segmentation challenge, in which algorithms competed in ten different disciplines.We showed that it is indeed possible that one single algorithm can generalize over various different applications without human-based adjustments.This was further demonstrated by monitoring the winning method for two years to show the continuation of the generalizability to other segmentation tasks.
In the following sections, we will discuss specific aspects of the MSD challenge, namely the challenge infrastructure, data set, assessment method and outcome.

Challenge infrastructure
The participating teams were asked to submit their results in the form of a compressed archive to the grand-challenge.orgplatform.For the development phase, a fully automated validation script was run for each submission and the leaderboard was updated accordingly.Each team was allowed to submit one solution per day.In contrast, for the mystery phase, only one valid submission per algorithm could be submitted to prevent over-fitting.
Despite the above-mentioned policies, there were attempts to create multiple accounts so that a team could test their method beyond the allowed limit, a problem which was found due to result's similarity between certain accounts.Teams who were found to be evading the rules were disqualified.Identity verification and fraud detection tooling has now been added to grand-challenge.org to help organisers mitigate this problem in the future.
Possibly, a better way of controlling overfitting, or possible forms of cheating (e.g.manual labeling of results [29]) would have been to containerize the algorithms using Docker containers and for inference to be run by the organisers.This approach was unfortunately not possible at the time of the organization of MSD due to the lack of computational resources to run inference on all data for all participants.Thanks to a partnership with Amazon Web Services (AWS), the grand-challenge.orgplatform now offers the possibility to upload Docker container images that can participate in challenges and made available to researchers for processing new scans.With the recent announcement of a partnership between NVIDIA and the MICCAI 2020 and 2021 conferences, and the increased standardization of containers, such a solution should be adopted for further iterations of the MSD challenge.

Challenge data set
In the MSD, we presented a unique data set, including ten heterogeneous tasks from various body parts and regions-of-interest, numerous modalities and challenging characteristics.MSD is the largest and most comprehensive medical image segmentation data set available to date.The MSD data set has been downloaded more than 2,000 times in its first year alone, via the main challenge website. 12The data set has recently been accepted into the AWS Open Data registry, 13 allowing for unlimited download and availability.
The data set is also publicly available under a Creative Commons license CC-BY-SA4.0,allowing broad (including commercial) use.Due to data set heterogeneity, and usage in generalizability and domain adaptation research, it is likely to be very valuable for the biomedical image analysis community in the long term.
Regarding limitations, the MSD data set was gathered from retrospectively acquired and labeled data from many different sources, resulting in heterogeneous imaging protocols, differences in annotation procedures, and limiting the annotations to a single human rater.While the introduction of additional annotators would have benefited the challenge by allowing inter-rater reliability estimates and possibly improve the reliability of annotations, this was not possible due to restricted resources and the scale of the data.As shown in [30], several annotators are often necessary to overcome issues related to inter-observer variability.Furthermore, the dataset only consists of radiological data, we can therefore only draw conclusions for this application.Other areas like dermatology, pathology or ophthalmology were not covered.

Challenge assessment
Two common segmentation metrics have been used to evaluate the participant's methods, namely the DSC, an overlap measure, and the NSD, a distance-based metric.The choice of the right metrics was heavily discussed, as it is extremely important for the challenge outcome and interpretation.Some metrics are more suitable for specific clinical use-cases than others [31].For instance, the DSC metric is a good proxy for comparing large structures but should not be used intensively for very small objects, as singlepixel differences may already lead to substantial changes in the metric scores.However, to ensure that the results are comparable across all ten tasks, a decision was taken to focus on the two above-mentioned metrics, rather than using clinically-driven task-specific metrics.
Comparability was another issue for the ranking as the number of samples varied heavily across all tasks and target ROIs, which made a statistical comparison difficult.We therefore decided to use a ranking approach similar to the evaluation of the popular BraTS challenge, 14 which was based on a Wilcoxon-signed rank pairwise statistical test between algorithms.The rank of each algorithm was determined (independently per task and ROI) by counting the number of competing algorithms with a significantly worse performance.This strategy avoided the need of similar sample sizes for all tasks and reduced the need for task-specific weighting and score normalisation.
Identifying an appropriate ranking scheme is a non-trivial challenge.It is important to note that each task of the MSD data set comprised one to three different target ROIs, introducing a hierarchical structure within the data set.Starting from a significance ranking for each target ROI, we considered two different aggregation schemes: (1) averaging the significance ranks across all target ROIs; (2) averaging the significance ranks per task (data set) and averaging those per-task ranks for the final rank.The drawback of (1) is that a possible bias between tasks might be introduced, as tasks with multiple target ROIs (e.g. the brain task with three target ROIs) would be over-weighted.We therefore chose ranking scheme (2) to avoid this issue.This decision was made prior to the start of the challenge, as per the challenge statistical analysis protocol.A post-challenge analysis was performed to test this decision, and results found that overall ranking structure remained unchanged.The first three ranks were preserved, only minor changes (1 to 2 ranks) were seen in a couple of examples at the middle and end of the rank list.As shown in Appendix C, changing the ranking scheme will typically lead to different rankings in the end, but we observed the first three ranks to be robust across various ranking variations.More complex ranking schemes were discussed among organisers, such as modeling the variations across tasks and target ROIs with a linear mixed model [32].As explainability and a clear articulation of the ranking procedure was found to be important, it was ultimately decided to use significance ranking.The performance of algorithms varied dramatically across the different tasks, as shown in Figs. 3, 4 and Appendix C. For the development phase, the median algorithmic performance, defined as the median of the mean DSC, changed widely across tasks, with lowest being the tumor mass segmentation of the pancreas data set (0.21, Table C.9) and the highest median for the liver segmentation (0.94, Table C.7).The performance drop was much more modest for the best performing method nnU-Net (0.52 and 0.93 median DSC for the pancreas mass and liver ROI, respectively), demonstrating that methods have varying degrees of learning resiliency to the challenges posed by each task.The largest difference within one task was also obtained for the pancreas data set, with a median of the mean DSC of 0.69 for the pancreas ROI, and 0.21 for the pancreas tumor mass, which is likely explained by the very small relative intensity difference between the pancreas and its tumor mass.In the mystery phase, colon cancer segmentation received the lowest median DSC (0.16, Table C.11), and the spleen segmentation the highest median DSC (0.94, Table C.13).Similarly to the development phase, a much smaller drop in performance (0.56 and 0.96 for colon and spleen respectively) was observed in the top ranking method.Most of the observed task-specific performances reflect the natural difficulty and expected inter-rater variability of the tasks: Liver and spleen are large organs that are easy to detect and outline [33], whereas pancreas and colon cancers are much harder to segment as annotation experts themselves often do not agree on the correct outlines [34,35].We also observed that the challenging characteristics of each task (presented in Fig. 1) had some non-trivial effect on algorithmic performance, a problem which was exacerbated in lower-ranking methods.For example, some methods struggled to segment regions such as the lung cancer mass, pancreas mass, and colon cancer primaries, achieving a mean DSC below 0.1.These regions, characterized by small, non-obvious and heterogeneous masses, appear to represent a particularly challenging axis of algorithmic learning complexity.The number of subjects in the training dataset (only 30 subjects for the heart task), the size and resolution of the images (large liver images and small hippocampus images), and complex region shapes (e.g.brain tumours) were not found to introduce significant inter-team performance differences.

Methods
As summarised in Fig. D.16, nnU-Net was ranked first on both the development and mystery phases.Under the proposed definition of a "generalizable learner", the winning method was found to be the most generalizable approach across all MSD tasks given the comparison methodology, with a significant performance margin.The K.A.V.athlon and NVDLMED teams were ranked second and third during the development phase, respectively; their ranks were swapped (third and second, respectively) during the mystery phase.We observed small changes in team rankings between the development and mystery phases for top ranking teams; within the top 8 teams, no team changed their ranking by more than 2 positions from the development to the mystery phase.This correlation between development and mystery rankings suggest limited amount of methodological overfitting to the development phase, and that the proposed ranking approach is a good surrogate of expected task performance.We observed some algorithmic commonalities between top methods, such the use of ensembles, intensity and spatial normalization augmentation, the use of Dice loss, the use of Adam as an optimiser, and some degree of post-processing (e.g.region removal).While none of these findings are surprising, they provide evidence towards a reasonable choice of initial parameters for new methodological developments.We also observed that the most commonly applied architecture across participants was the U-Net, used by 64% of teams.Some evidence was found that architectural adjustments to the baseline U-Net approach are less important than other relevant algorithmic design decisions, such as data augmentation and data set split/cross-validation methodology, as demonstrated by the winning methodology.Note that similar findings, albeit in a different context and applied to ResNet, have been recently observed [36].

The years after the challenge
Following the challenge event at MICCAI 2018, the competition was opened again for rolling submissions.This time participants were asked to submit results for all ten data sets15 in a single phase.In total, 742 users signed up.To restrict the exploitation of the submission system for other purposes, only submissions with per-task metric values different from zero were accepted as valid, resulting in only 17 complete and valid submissions.In order to avoid overfit but still allow for some degree of methodological development, each team was allowed submit their results 15 times.The winner of the 2018 MSD challenge (nnU-Net, denoted as Isensee on the live challenge), submitted to the live challenge leaderboard on the 6 th of December 2019, and held the first position for almost one year, until the 30 th of October 2020.
Since for the live challenge teams were allowed to tune their method on all ten data sets, the minimum value of the data set specific median DSC improved quite substantially from the 2018 MSD challenge, as shown in Fig E .17.The two hardest tasks during the 2018 MSD challenge were the segmentation of the tumor inside the pancreas, with an overall median of the mean DSC of 0.21 over all participants (0.37 for the top five teams) and the segmentation of the colon cancer primaries, with an overall median of the mean DSC of 0.16 over all participants (0.41 for the top five teams).The worst task for the rolling challenge was the segmentation of the non-enhancing tumor segmentation inside the brain, with a median DSC of 0.47.
At the other end of the spectrum was the spleen segmentation task, where the median task DSC over all participants was 0.94 during the 2018 challenge, and improved to 0.97 for the rolling challenge.These observations suggest that the ability for multiple methods to solve the task has improved, with methods performing slightly better on harder tasks and significantly better on easy tasks.
In 2019 and 2020, the rolling challenges have resulted in three methods that superseded the winning results of the 2018 MSD challenge.Within these two follow-up years, two main trends were observed: the first major trend is the continuous and gradual improvement of "well performing" algorithms, such as the heuristics and task fingerprinting of the nnU-Net method; the second major trend that was observed was the rise of Neural Architecture Search (NAS) [37] among the top teams.More specifically, both the third and the current [38] (as of April 2021) leader of the rolling challenge used this approach.NAS optimizes the network architecture itself to each task in a fully automated manner.Such an approach uses a network-configuration fitness function that is optimised independently for each task, thus providing an empirical approach for network architectural optimisation.When compared to heuristic methods (e.g.nnU-Net), NAS appears to result in improved algorithmic performance at the expense of increased computational cost.

Conclusion
Machine-learning based semantic segmentation algorithms are becoming increasingly general purpose and accurate, but have historically required significant field-specific expertise to use.The MSD challenge was set up to investigate how accurate fully-automated image segmentation learning methods can be on a plethora of tasks with different types of task complexity.Results from the MSD challenge have demonstrated that fully automated methods can now achieve state-of-the-art performance without the need for manual parameter optimisation, even when applied to previously unseen tasks.A central hypothesis of the MSD challengethat an algorithm which works well and automatically on several tasks should also work well on other unseen tasks-has been validated among the challenge participants and across tasks.This hypothesis was further corroborated by monitoring the generalizability of the winning method in the two years following the challenge, where we found that nnU-Net achieved state-of-the-art performance on many tasks including against task-optimized networks.While it is important to note that many classic semantic segmentation problems (e.g.domain shift and label accuracy) remain, and that methodological progress (e.g.NAS and better heuristics) will continue pushing the boundaries of algorithmic performance and generalizability, the MSD challenge has demonstrated that the training of accurate semantic segmentation networks can now be fully automated.This commoditization of semantic segmentation methods has the potential to allow non machine learning experts (e.g.clinicians, medical physicists,and applied scientists) to better contribute to, and possibly even independently develop, these techniques.

Acknowledgements
Part of this work was funded by the Helmholtz Imaging Platform (HIP), a platform of the Helmholtz Incubator on Information and Data Science.We would like to thank Minu D. Tizabi for proof-reading the manuscript.This research was supported by the Bavarian State Ministry of Science and the Arts and coordinated by the Bavarian Research Institute for Digital Transformation.Team CerebriuDIKU gratefully acknowledges support from the Independent Research Fund Denmark through the project U-Sleep (project number 9131-00099B).Ronald M. Summers is supported by the Intramural Research Program of the National Institutes of Health Clinical Center Research reported in this publication was partly supported by the National Institutes of Health (NIH) under award numbers NCI:U01CA242871, NCI:U24CA189523, NINDS:R01NS042645.The content of this publication is solely the responsibility of the authors and does not represent the official views of the NIH.The method presented by BCVUniandes was made in collaboration with Silvana Castillo, from Universidad de los Andes.James Meakin received grant funding from Amazon Web Services.and the runner up of the mystery phase all received an NVIDIA Titan V prize.Ranking and results on each data set for each method were announced publicly at the challenge event at MICCAI and the postchallenge leaderboard is also publicly available.Finally, we asked the participating teams to fill out a form with details about their methods.All team members replying to the survey were listed as co-authors of the paper.Participants were allowed to publish their methods independently from the challenge paper.
All teams were asked to submit the results of the development phase as a compressed archive to the grand-challenge.orgplatform.A fully automated validation script was run for each submission immediately after submission and results were published on the development phase leaderboard.Each team was allowed one submission per day to partially mitigate overfitting.The last submission of each team by the development phase deadline was used for validation.Teams were then asked to submit details of their methods prior to being given access to the mystery phase data.The submission deadline for the mystery phase results was set to two weeks after the mystery phase data release, also in compressed archive form.Only a single valid submission was accepted for each the mystery phase participant, and results for the mystery phase were only revealed at the public challenge event during MICCAI 2018, in Granada, Spain.
The implementation of the metrics used in the challenge, namely the DSC and NSD, were provided as a Python Notebook [39] by the challenge organisers, prior to the challenge deadlines.The algorithms for statistical validation were also provided.
As some of the participating teams were working under intellectual property restrictions, it was decided that public code availability was not mandatory for participation as to maximise participation; participants were, however, encouraged to make their code available to the public.
The challenge was organised without specific funding, and mostly via in-kind time contributions of its organisers.The challenge was sponsored by NVIDIA, who provided the graphics processing unit (GPU) cards as awards (approximate value of $7,500), Google DeepMind, who provided an in-kind implementation of the NSD metric, and by RSIP Vision, who provided media support and challenge dissemination.None of the sponsors had any influence in the organisation of the challenge, nor were they given any form of privileged access to either the data or any other type of information.
Only two of the organisers, both from KCL, had access to all test cases; namely M. Jorge Cardoso and Michela Antonelli.The KCL organisers committed to not participate in the challenge.Only two copies of the full test data currently exist (beyond the data providers of each independent task), one at KCL's servers as a backup, and one on the grand-challenge.orgvalidation server.

Figure 1 :
Figure 1: Overview of the ten different tasks of the Medical Segmentation Decathlon (MSD).The challenge comprised different target regions, modalities and challenging characteristics and was separated into seven known tasks (blue; the development phase) and three mystery tasks (gray; the mystery phase).Used abbreviations: MRI --magnetic resonance imaging, mp-MRI --multiparametric-magnetic resonance imaging, CT --computed tomography.

Figure 2 :
Figure 2: Base network architectures (left) and loss functions (right) used by the participants of the 2018 Decathlon challenge who provided full algorithmic information.

Figure 3 :
Figure 3: Dot-and box-plots of the DSC values of all participating algorithms for the seven tasks of the development phase, color-coded by the target regions.box-plots represent descriptive statistics over all test cases.The median value is shown by the black horizontal line within the box, the first and third quartiles as the lower and upper border of the box, respectively, and the 1.5 interquartile range by the vertical black lines.Outliers are shown as black circles.The raw DSC values are provided as gray circles.Used abbreviations: PZ-peripheral zone, TZ-transition zone.

Figure 4 :
Figure 4: Dot-and box-plots of the DSC values of all participating algorithms for the three tasks of the mystery phase, colorcoded by the target regions.box-plots represent descriptive statistics over all test cases.The median value is shown by the black horizontal line within the box, the first and third quartiles as the lower and upper border of the box, respectively, and the 1.5 interquartile range by the vertical black lines.Outliers are shown as black circles.The raw DSC values are provided as gray circles.

Figure 5 :
Figure 5: box-plots of ranks for all participating algorithms over all seven tasks and thirteen target regions of the development phase (red) and all three tasks and four target regions of the mystery phase (blue).The median value is shown by the black vertical line within the box, the first and third quartiles as the lower and upper border of the box, respectively, and the 1.5 interquartile range by the horizontal black lines.Individual ranks are shown as gray circles.
Table C.11) to 0.94 (liver (the development phase), cf.Table C.7 and spleen segmentations (the mystery phase), cf.Table C.13).The full list of values are provided in Appendix C.

Table 3 :
Rankings for the development phase and the mystery phase, median and interquartile range (IQR) of the Dice Similarity Coefficient (DSC) values of all team.The ranking was computed as described in Section 2.4.1.

Table C .
5: Mean Dice Similiarty Coefficient (DSC) values for all participating teams for all tasks (left atrium) of the heart data set (the development phase).Table C.6: Mean Dice Similiarty Coefficient (DSC) values for all participating teams for all tasks (anterior, posterior) of the hippocampus data set (the development phase).Table C.7: Mean Dice Similiarty Coefficient (DSC) values for all participating teams for all tasks (liver, cancer) of the liver data set (the development phase).Table C.8: Mean Dice Similiarty Coefficient (DSC) values for all participating teams for all tasks (cancer) of the lung data set (the development phase).Table C.9: Mean Dice Similiarty Coefficient (DSC) values for all participating teams for all tasks (pancreas, mass) of the pancreas data set (the development phase).Table C.10: Mean Dice Similiarty Coefficient (DSC) values for all participating teams for all tasks (PZ, TZ) of the prostate data set (the development phase).Table C.12: Mean Dice Similiarty Coefficient (DSC) values for all participating teams for all tasks (vessel, tumor) of the hepatic vessel data set (the mystery phase).