Nucleus segmentation across imaging experiments: the 2018 Data Science Bowl

Segmenting the nuclei of cells in microscopy images is often the first step in the quantitative analysis of imaging data for biological and biomedical applications. Many bioimage analysis tools can segment nuclei in images but need to be selected and configured for every experiment. The 2018 Data Science Bowl attracted 3,891 teams worldwide to make the first attempt to build a segmentation method that could be applied to any two-dimensional light microscopy image of stained nuclei across experiments, with no human interaction. Top participants in the challenge succeeded in this task, developing deep-learning-based models that identified cell nuclei across many image types and experimental conditions without the need to manually adjust segmentation parameters. This represents an important step toward configuration-free bioimage analysis software tools.

M icroscopy is a central technology of biomedical research, mentioned in nearly one million PubMed-indexed scientific papers to date (PubMed search 'microscopy OR microscope OR microscopic' , accessed 7 October 2018). Increasingly, the images produced are analyzed quantitatively [1][2][3] . Various microscopy techniques allow capturing structural and functional properties of biological model systems, including cultured cells, tissues and organoids. As microscopy makes progress to capture such systems in greater detail and throughput and as the development of novel assays reveals more complex properties of living organisms, the need for robust and easy to use microscopy image analysis methods becomes critical to answer a wider variety of biological questions.
Many image analysis workflows involve the identification (segmentation) of cell nuclei as a first step to extract meaningful biological signals. Research studies may involve counting cells, tracking moving populations, localizing proteins and classifying phenotypes or profiling treatments; in all of these and more, the nucleus is a reliable compartment of reference for identifying single cells in microscopy images.
However, selecting strategies to segment nuclei is not an easy task for nonexpert users in regular biology labs. Most existing user-friendly bioimage analysis tools 4-6 identify nuclei using classical segmentation algorithms such as thresholding 7 , watershed 8 or active contours 9 . These need to be configured for each study to account for different microscopy modalities, scales and experimental conditions, often requiring great expertise to select the algorithm that suits the problem and to adjust its parameters. For advanced users, the choice can also be daunting, considering that hundreds of papers are published every year presenting new methods for cell and nucleus segmentation. And even under controlled experimental conditions, no single parameter choice can segment all images correctly, because classical algorithms can fail to adapt to the heterogeneity of biological samples or can be sensitive to technical artifacts [10][11][12] . Altogether, this situation slows down the pace of research and hinders biological laboratories from adopting imaging technologies owing to the time and expertise required.
Here, we explore the idea of creating a segmentation model that can identify the nucleus of cells automatically in a diverse set of stained two-dimensional (2D) light microscopy images without human interaction. Such a model could power future robotic microscopes to facilitate a wide range of biological applications, by finding and counting nuclei in images in real time across cell types, staining types, magnification and in spite of experimental variations. A single trained model effective across instruments, stains and cell types would improve the experience of biologists and speed their research. Classical algorithms for identifying nuclei in microscopy images follow very similar computational strategies, with varying parameters or configurations (Methods, Supplementary Note 3). Our goal was to investigate whether any modern solutions, such as large capacity deep-learning models, could provide a single unifying solution without requiring manual configuration.
Biological image segmentation on the basis of machine learning already exists in user-friendly software, such as Ilastik 13 and ImageJ 14 , and recent studies confirm the usefulness of this approach 15 . Deep learning has shown great potential to solve difficult problems in cellular image analysis 16 , and neural network models for image segmentation also exist [17][18][19][20] . However, existing solutions require users to create models that are customized for each experiment, taking time to prepare annotations, train models and/or configure algorithms. We instead aimed to create a generic, reusable model that is trained once and can be shared and run on a variety of fluorescence microscopy experiments without additional user intervention. We envisioned software tools for nucleus segmentation that can be used with the same ease and robustness as face detectors in natural images; they just work, without users having to train models or to configure settings and under varying lighting and scenery conditions. This paper reports the results of the 2018 Data Science Bowl, which challenged participants to segment nuclei in a variety of 2D light microscopy images without the need for any manual interaction or adjustment. The competition provided participants with a training set of images comprising problems (images containing nuclei) along with the corresponding solutions (segmentation masks for the nuclei) and test sets of images for which they had to generate the segmentations using a two-stage evaluation protocol. Importantly, the holdout set was comprised of 15 diverse image sets from biological experiments that were not present in the training set, to realistically evaluate how well the algorithms perform across different experimental conditions. This is the first time that nucleus segmentation methods have been challenged to generalize by operating blindly on unseen biological experiments without user interaction or additional annotation/training.

Results
The 2018 Data Science Bowl. For the competition, we created a dataset with 37,333 manually annotated nuclei in 841 2D images from more than 30 experiments across different samples, cell lines, microscopy instruments, imaging conditions, operators, research facilities and staining protocols. The annotations were manually made by a team of expert biologists that followed a collaborative workflow (Methods), and we call these 'target masks' instead of ground truth, given that each annotation was created by a single expert and reviewed by the rest. Researchers around the world freely contributed the images and agreed to a Creative Commons 0 license (public domain); our team's annotations are similarly freely available. This dataset is publicly accessible in the Broad Bioimage Benchmark Collection with accession number BBBC038.
The challenge was run for a total of 3 months in which participants had access to the training set (with target masks) and the first-stage test sets (with target masks withheld). The evaluation of competitors' predictions on the withheld masks of the first-stage test set powered the leaderboard, and in the final week of the competition, a second-stage test set (with target masks withheld) was released to determine the challenge winners. This second-stage evaluation aimed to assess the robustness of models to segment new images from new experiments and to also evaluate the ability of the models to run completely autonomous segmentation without user interaction. To deter manual intervention on the images, the second-stage holdout set had 3,200 images with approximately 100,000 single nuclei that had to be segmented in <7 d, only a small fraction of which had accompanying manually defined target masks and were actually used for scoring (Methods). Participants uploaded their segmentation masks to the Kaggle server (https://www.kaggle. com/c/data-science-bowl-2018), which validated against the real masks hidden from the public, using a quantitative score to rank participants (Methods).
A total of 17,929 competitors signed up for the competition in 3,891 teams during the first stage and 739 teams made successful entries during the second stage to compete for US$170,000 in cash and prizes. Overall, participants submitted a total of 68,017 submissions throughout the duration of the competition. This contest fostered the development of new methods with contributions of data scientists around the world that usually do not work on microscopy images, bringing state-of-the-art innovations. The top three participants, among many other competitors in the challenge, made their solutions open source, which will facilitate their adoption and extension by the wider scientific community.
Top solutions improve usability and accuracy of nucleus segmentation. In the second-stage evaluation, competitors were not permitted to use any nonautomated, image-specific configuration. As a result, the winning models yielded a major improvement in usability compared to current practices for microscopy image segmentation, which need either algorithm selection and tuning (for classical methods) or manual annotations (for machine-learning methods) for different image sets.
When compared to a reference segmentation obtained with classical image processing techniques adapted for the holdout image sets using minimal user intervention (Methods), we found that 85 candidate algorithms from the challenge yielded higher accuracy (Fig. 1a). In particular, the top three solutions outperformed these minimally tuned classical algorithms by a large margin, producing better segmentations over all coverage thresholds (Fig. 1b). We evaluated the accuracy of the segmentations using metrics common in computer vision research for object segmentation (Methods).
Importantly, the segmentations obtained with classical image processing algorithms, unlike the methods in the challenge, required manual configuration; there exists no classical algorithm that could claim to produce reasonable results on 15 diverse image sets with no user intervention. In contrast, a novice with no prior exposure to bioimage analysis performed substantially worse than the expert and the top-scoring deep-learning models (Fig. 1c). The novice and expert invested 5 h and 3 h of work to achieve their corresponding segmentation results. Embedded in a user-friendly interface, as has already been prototyped in the NucleAIzer system 21 , the top models would require no configuration time.
The classical methods were tested by first organizing images into five groups using visual inspection, then analysis pipelines were created in the open source software, CellProfiler 5 (Methods). These pipelines are representative of widely adopted techniques surveyed recently in the literature of microscopy image segmentation 12,22,23 , though they are likely suboptimal solutions because the techniques were not fully optimized for each of the 15 image sets independently to prevent overfitting and in keeping with the no-or low-configuration mission of the data challenge.
In addition to reference segmentations using classical techniques, we also evaluated the performance of the top-scoring models to deep-learning models trained separately for each type of images. We chose U-Net 17 , a popular deep-learning-based method to solve microscopy image segmentation problems, including nucleus segmentation 20 . The learning capacity of a single regular U-Net was not sufficient to capture the experimental variation in the challenge (such solutions entered the competition); therefore, we trained five models to reduce variance, as in the case of classical algorithms, and applied image pre and post-processing routines specific to each group (Supplementary Note 4). The results show that even spending ~20 h of hands-on time (development and training time not included in our estimates), these models did not reach competitive performance compared to the top solution (Fig. 1c). Several factors contributed to this result: limited learning capacity of the evaluated U-Net relative to the top models, reduced number of training examples in the five groups after splitting and experimental variability of the test sets.
Finally, we asked an additional annotator to create target masks for a subset of images in the test set and we observed interobserver variability (Supplementary Fig. 3) a well-known problem 24 . Interestingly, the top performing model agreed on boundary annotations more often with each annotator than they agreed with each other (Supplementary Fig. 3). This suggests that the model fitted smooth boundaries that were close to the edge of nuclei, whereas manual annotations may be biased by subjective noise ( Supplementary Fig. 5). The performance of the top model was also more similar to humans than to classical algorithms in terms of segmentation accuracy ( Supplementary Fig. 3). Although we did not investigate this result extensively, it suggests that the top models may reach human-annotator-like performance with similar error rates.
Best-performing solutions segment a diversity of microscopy images. The most challenging aspect of the competition was that the holdout set included microscopy images from 15 different biological experiments, including various 2D light microscopy types, acquisition equipment and biological conditions. This is in contrast to previous research studies that optimize nucleus segmentation methods individually for each image set or type 12,18,22,[25][26][27][28] . From a visual standpoint, we identified five groups of images comprising nuclei of very different appearances, including two major types of light microscopy ( Top participants stood out by making models that generalized well across diverse image types and experimental variation (Fig. 2a), and despite a heavily unbalanced dataset (Fig. 2b). Dataset biases can mislead the performance of machine-learning models 29,30 ; lessrepresented image types were indeed challenging to segment for the average participant (Fig. 2a). With the largest group of images, containing 80% of the training examples, top participants reached a maximum accuracy of 0.90 in the test set, and with the smallest group, containing 0.6% of the examples, they reached a maximum accuracy of 0.55. In all cases, their performance surpassed the reference CellProfiler segmentations, as well as the average participant, by a large margin.
Best-performing solutions reduce segmentation errors. The solutions of the top three teams were significantly better than the reference segmentations according to the competition score and other metrics that we used to analyze the results (Figs. 1-3 and Supplementary Figs. 3 and 4). The competition score was an aggregated metric that considered multiple factors of segmentation quality, including precision, recall and object coverage (Methods) and could be deconstructed in multiple ways to understand performance and error modes.
First, we assessed performance at a single-object coverage threshold to interpret the differences in accuracy between the models. When a threshold equal to 0.5 was chosen (common in previous works 12,31 ), the top performing model got an F1 accuracy of 0.889, compared to 0.819 for the CellProfiler reference (Supplementary Table 3). Note that these results were the average across all images from 15 different experiments in the second-stage evaluation, 12 cell lines and five image types. When we considered fluorescent images with only small nuclei, the F1 accuracy of the top performing model was 0.932, whereas CellProfiler obtained an F1 score of 0.844 (Supplementary Table 2). Using a threshold of 0.7 (Fig. 2a) challenges methods by requiring a larger minimum object coverage. We observed that the top three models all surpassed the CellProfiler reference for three image types (small fluorescent, purple and pink and purple tissues). For the other two image types (big fluorescent and grayscale tissue), all but one model performed worse than   In general, the top three models reached similar aggregated performance, but exhibited different behavior and error modes. The accuracy results can be disaggregated by image type across multiple object coverage thresholds (Fig. 2c), allowing us to identify the strengths and weaknesses of each strategy. For instance, the second place solution was the best for large fluorescent nuclei (blue line) but poorer for grayscale tissue (green line).
We conducted other performance analyses on the top three models relative to the CellProfiler reference (Supplementary Note 6) and observed reduced error rates from the models, which missed fewer objects, successfully separated merged nuclei (Supplementary Fig. 6) and improved precision and recall ( Supplementary Fig. 7). All these observations support the idea that the top three solutions can make configuration-free segmentation of 2D stained nuclei a reality.
Top algorithms were on the basis of deep convolutional neural networks. The majority of participants used deep convolutional neural networks (CNNs), a popular technique to solve computer vision tasks 32 , as well as various microscopy image and pathology problems 16,26 . A wide variety of CNN architectures can be used for image segmentation and participants designed creative solutions to improve segmentation accuracy. Interestingly, the top three participants used very different solutions: an ensemble of U-Nets, a fully convolutional feature pyramid network (FPN) and a Mask-RCNN (region-based CNN) model. Their performance is summarized in Table 1 and the main characteristics of each model are described below and in the Methods. Figure 3 presents example segmentations obtained by these models together with a reference segmentation obtained by CellProfiler.
Best-performing solution. A. Buslaev, V. Durnov and S. Seferbekov formed the [ods.ai] topcoders team, and introduced a highly optimized, multinetwork (ensemble) model with sophisticated data augmentation and data post-processing. Coordinating all these elements in a successful solution was a major achievement because models with larger learning capacity may overfit and fail to perform well with new images. Instead, this solution generalized well to the holdout of 15 image sets. In terms of computational requirements, this was the most demanding solution, as a single image needs to be processed by 32 different neural networks using graphics processing units (GPUs). In addition, the post-processing steps need to check and combine the predicted objects from all 32 outputs. Altogether, this system was the most accurate, albeit at a high computational cost and complexity. More details in the Methods section.
Second-best-performing solution. M. Jiang (team name: Jacobkie) presented a solution with a good balance of accuracy and speed; only a single neural network was used to process new images. Her solution introduced several innovations that can be adopted in other models, such as a loss function that penalizes errors taking into account object size (small objects have as equal weight as large objects), the use of distance maps instead of binary masks as a target for learning and pretraining with natural-image, object detection datasets. More details in the Methods section.

NaTure MeThoDS
Third-best-performing solution. A. Lopez-Urrutia (team name: Deep Retina) presented a solution on the basis of a single neural network that processed regions with candidate objects instead of using a fully convolutional approach. The base model is known as Mask-RCNN 33 , which is a popular architecture for object detection and instance segmentation in natural images. The simplicity of the solution was attractive, as various implementations of this solution existed and could be adapted to this problem by retraining the output layers with the right data. In addition, the Mask-RCNN model was actively investigated in the computer vision community, making innovations readily available to the nucleus segmentation problem. More details in the Methods section.
Other solutions, participants and strategies. Apart from the top three methods, the fourth place solution 34,35 by the team 'Nuclear Vision' fell 0.04 points behind third place and combined classical watershed transform with modern deep learning 36 . A stage-one U-Net (direction net) was used to predict the direction vector of a pixel inside nuclei and to the nearest nucleus boundary. Another stage-two U-Net (water transform net) estimated the watershed levels and output the masks, eroded masks and mask centers. Such methods may be useful for automatic or interactive segmentation, whereby the traditional watershed transform energy landscape is replaced by the output of learned deep networks.

Fig. 3 | Example segmentation maps for various images obtained by the top three participants and the CellProfiler reference. The segmentation maps
show pixel-wise alignments between target segmentation masks and predicted segmentations. If the masks align correctly, pixels in the boundaries are colored white. If the target mask or part of it is missing, pixels in the boundaries are colored blue. If the predicted segmentation is introduced in a region without real object, the boundary pixels are red. rows show information about each method and columns show performance metrics. Core model, type of machine-learning algorithm used to solve the task, with the number indicating how many neural networks were used in the solution. The names of neural networks are explained in the main text. Competition score, metric used during the competition to rank participants in the scoreboard (https:// www.kaggle.com/c/data-science-bowl-2018#evaluation). The rest of the performance metrics were computed offline after the competition ended for analysis purposes only. Average F1 is the accuracy metric closely related to the official score, which treats the segmentation problem as a binary decision problem (correctly segmented or not) for each object. The average F1 score was computed at different IoU thresholds between target masks and estimated segmentations and then averaged across all thresholds. By setting a single IoU threshold, we could count how many objects were correctly segmented (true positives or recall at 0.7 IoU), how many were missed (false negatives or Missed at 0.7 IoU) and how many false objects were introduced (false positives or extra at 0.7 IoU). a Note that the CellProfiler reference segmentations were generated with a different experimental protocol involving manual adjustment of pipelines for five image types in the test set. More details are provided in the Methods section.

FOCUS | AnAlysis
AnAlysis | FOCUS

NaTure MeThoDS
The team 'Creepy ReLU' 37 generated synthetic images using CycleGAN 38 . They showed that color stains could be transferred from one image to another. However they did not have time to train on synthetic images; this approach might have improved the top solutions in the competition.
On a sociological note, the competition brought experts from different domains together to share software and cloud resources and to develop ideas. In particular, team 'minerva.ml' 39 open sourced their code and development process at the start of the competition and provided a cloud-based platform that Kagglers could use. The author of the open source Matterport Mask-RCNN implementation (https://github.com/matterport/Mask_RCNN) also participated in the competition 40 and provided a complete software pipeline tool for training, visualization and submission. We noted that the third and fifth-place teams in the competition based their solutions on this Matterport implementation. This shows that the quality of open source implementation was high, included suitable options and parameters and could be used off the shelf.

Discussion
The 2018 Data Science Bowl presented the challenge of automatically finding nuclei in a large variety of unseen microscopy images, with no configuration step. This was the first documented attempt to produce a model that could segment the stained nuclei of cells in 15 biological experiments, across experimental conditions, acquisition equipment and source laboratory. The main goal of the challenge was to investigate generic segmentation strategies that could be automatically applied to many imaging experiments with no further user intervention. This approach may reduce the time to quantify images, empowering future generations of biologists to adopt and run more quantitative imaging experiments for research and clinical practice.
Training automated nucleus segmentation tools using modern machine-learning approaches requires collecting annotated examples. The 2018 Data Science Bowl created a resource of diverse images contributed by numerous biological laboratories and manually annotated by a team of expert biologists at the Broad Institute. All those data are now publicly available with public domain licenses to facilitate future scientific research as well as industrial development. We hope others in the wider bioimaging community will contribute more images and annotations to grow this resource with additional experimental variations, including unstained brightfield and electron microscopy, as well as many other common image modalities that were not included in our study.
The challenge attracted participation from different teams in the data science community, who made all types of contributions, learned together and collaborated to understand the problem better and make progress toward the proposed goal. Solutions presented by several participants achieved the goal of a single model able to segment various microscopy images with no intervention. The experimental results indicate that nucleus segmentation could be fully automated, requiring no manual settings or image processing expertise from users, while still providing improved accuracy versus the evaluated tools. Higher accuracy may be possible through a larger, more diverse training set and by incorporating the latest advances in machine learning and computer vision research.
The top participants presented solutions on the basis of fully convolutional networks (U-Nets and FPNs) or Mask-RCNN. These two approaches were widespread during the competition; what distinguishes the winners was a combination of pre-processing and post-processing techniques, as well as the application of best practices during training (mostly data balancing and data augmentation). A common theme among the top competitors was the use of data augmentation during training and testing, including color shifts to make networks color invariant, and scaling methods to address object size challenges. Interestingly, all top three solutions used a ranking strategy to select the best segmentation masks from several candidates predicted by the base models. While this is common practice for RCNN-like models (third-best solution), the top two models also created their own strategies to achieve a similar effect with fully convolutional networks.
The results present a successful proof of concept that deep learning is indeed capable of delivering accurate results without user interaction. However, even though the top models are publicly available, they still require computational expertise to be applied to images. A user-friendly tool is needed to bridge the gap between these cutting-edge solutions and everyday biomedical practice, similarly to what the NucleiAIzer system proposes 21 . We also found that data availability is a limitation to reach top performance for various image types, thus, additional efforts are needed to collect and annotate more data to expand the applicability of future systems. The generalization ability of models may also be evaluated in other datasets not used during the Data Science Bowl challenge, such as the Cell Tracking Challenge and others. Other aspects of usability remain to be addressed. For instance, if there are mistakes in the segmentation, how can these models efficiently and easily take feedback from humans to correct segmentation errors? The results of the 2018 Data Science Bowl are a first step toward creating a generic system for segmenting the nucleus of cells in every microscopy image. Future work could expand the dataset to cover missing major microscopy imaging types, such as unstained brightfield images and three-dimensional images. Following the strategy laid out here, models could also be constructed to segment cell structures in addition to the nucleus, such as cell borders and organelles.

online content
Any methods, additional references, Nature Research reporting summaries, source data, statements of code and data availability and associated accession codes are available at https://doi.org/10.1038/ s41592-019-0612-7.

Methods
Dataset. The image sets were donated by multiple laboratories studying different aspects of cell biology. The names and credits are listed in Supplementary File 1. A total of 841 images were collected, representing a wide variety of nuclei observed under different experimental conditions and imaged with various staining protocols. Our goal was to collect as many independent biological experiments as possible to create a resource that contains enough technical and biological variability to train generic nucleus segmentation models.
In total, the dataset contained images from more than 30 different biological experiments, which were split into 16 experiments for training (670 images) and first-stage evaluation (65 images) and exactly 15 experiments for the second-stage evaluation (106 images). The number of experiments represented in the training and first-stage evaluation is approximate because these include images from public or anonymous sources without metadata to confirm the exact number. Holding 15 experiments for the second-stage evaluation allowed us to simulate the realistic evaluation case of bringing newly acquired images for segmenting their nuclei. See Supplementary Note 1 for more statistics and details about the dataset.
Annotation strategy. Overall, the image set was annotated with 29,464 individual nuclei in the training set, 4,152 in the first-stage test set and 3,717 in the secondstage test set, for a total of 37,333. The annotations were created by expert biologists who manually delineated each object in the images using one of two tools: (1) an assisted annotation tool that precomputed superpixel segmentations to facilitate the selection of regions in the foreground or background; and (2) the GIMP image editing software to create annotation masks by coloring individual pixels outlining each nucleus.
The assisted annotation tool made an initial over-segmentation of the image using the simple linear iterative clustering superpixels algorithm 41 . The annotators could then color each superpixel with one of four colors to indicate what regions correspond to objects and what others to background. Objects were required to have different colors if they were touching each other. Superpixels are very helpful to reduce the amount of annotation time, but also may contain systematic noise because their boundaries are not necessarily perfectly aligned with the real object boundary. This strategy was used for training images only to facilitate large scale annotation; for test images, we used the per-pixel annotation strategy using GIMP to score participants with respect to masks drawn 100% manually. Human annotations were not post-processed to avoid introducing unintended artifacts. See Supplementary Note 5 for more details.
Image modalities. In this dataset, we included 2D light microscopy images of stained nuclei. The majority of the images in this dataset came from fluorescent images with cells of different sizes and various types, primarily stained with DAPI or Hoechst. The dataset also included tissue samples stained with hematoxylin and eosin, displaying structures from a diversity of organs and animal models. The image collection was organized to include different technical settings and a variety of biologically different experiments. We excluded phase-contrast, differential interference contrast and other image modalities because during the data collection period we did not find image sets or donating laboratories that could make these images available in the public domain.
Evaluation. Performance metrics. The evaluation strategy was on the basis of identifying object-level errors. This was accomplished by matching target object masks with predicted objects submitted by participants and then computing true positives and false positives. In order to match target masks and predicted objects, the intersection-over-union (IoU) score was computed for all pairs of objects using IoU ¼ jA\Bj jA ∪ Bj I , where A and B are two objects, and the operator | | measures area. A minimum IoU threshold t was selected to identify correctly segmented objects and any other predicted segmentation mask below the threshold was considered an error. With all true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN), we created a confusion matrix and computed precision (P), recall (R) and F1 scores, using a fixed IoU threshold t as follows: We used increasing IoU thresholds to estimate shape-matching accuracy. When a segmentation covered the target mask perfectly, the IoU score was 1 and the object was correctly detected no matter which threshold was used. In practice, segmentations can only approximate the real shape of the object, so at certain coverage threshold the object was missed. This test estimated how well segmentations matched the shape of manually defined target masks. Then, the official competition score S, was defined in terms of type I and II errors, using multiple IoU thresholds as follows: where T ¼ 0:10; 0:15; :::; 0:95 f g For some of the results reported in this paper, we also computed F1, precision and recall in an aggregated manner, similar to the official score. The competition evaluation score is described in detail in Supplementary Note 2.
Two-stage evaluation protocol. The data challenge was on the basis of a two-stage evaluation protocol with one training set and two holdout sets. The training and first-stage holdout sets were available to competitors during a period of 2.5 months for calibrating the algorithms. The first-stage holdout target masks were not directly accessible to the competitors but instead used to test their submitted segmented images against, yielding a numerical score. Participants were allowed to make a maximum of five submissions per day to obtain feedback about their performance, and they could select two submissions for evaluation and ranking. Depending on the scores obtained from these submissions, they decided to tune their methods and submit again later during the competition.
The second-stage holdout set was released during the final period of the competition, giving participants only 1 week to process 3,200 images. These images had an estimated 100,000 single nuclei, only 106 (3%) images had manually defined target masks useful for scoring, and the competitors did not know which images were going to be evaluated. This approach was enforced to prevent competitors from using extremely slow solutions, hand-outlining results or choosing among a large number of algorithms or settings by visually verifying the results. The same submission rules applied during the second-stage evaluation, allowing participants to submit at most five times per day and select only two submissions for final scoring.
The segmented images produced by the top solutions were manually screened and presented naturally occurring errors produced by automated solutions; none of them had signs of hacking or cheating. This experimental procedure was as rigorous as those used in other initiatives organized by scientists for other scientists. Kaggle has a long history of experience working with scientists to define these experimental settings and has a good pool of best practices for data science that were enforced to assure the validity of the results and identify hacking.
Grouping of images. The guiding principles for creating the five groups of images in the dataset was visual similarity on the basis of image colors and object sizes. The goal was to facilitate the application of classical image segmentation algorithms, which most heavily relied on (1) the nuclei and background colors (white versus black, purple versus white, purple versus pink, back versus gray); and (2) the approximate size of nuclei in the image. The number of groups constructed was kept as low as possible while maintaining the ability to create a robust analysis workflow for each group.
This organization in five groups may reflect staining protocols and microscopy techniques used in the experiments. However, that information was not explicitly used for determining the assignment of images to groups, it was all on the basis of visual inspection. In fact, the analyst that created the groups and designed classical segmentation workflows had no knowledge of experimental details of the test sets. This was intentional, with the purpose of making generic segmentation solutions with minimal assumptions, similarly to the conditions presented to participants of the competition.
The five groups of images were also used to conduct data analysis of segmentation performance of competitors. Most of the results presented in this paper have been organized around these five generic groups of images. However, this information was not provided to participants of the competition, for them, the entire dataset contained varied example microscopy images with a single object of interest, the nucleus.
Reference segmentations. Our goal was to evaluate the contribution of nucleus segmentation methods proposed by participants of the challenge, considering that these methods worked across a variety of experiments with no user intervention. The most appropriate baseline was an existing strategy that could be applied to any image (within the constraints laid out, 2D images stained for nuclei) and produce accurate results with no human interaction. Given that there was no such method in existence, we approximated it by taking an approach that used as little human time as possible. As an approximation for quantitative reference, we do not call the approach baseline segmentations, but rather reference segmentations.
We chose CellProfiler v.3.1.5 as the tool to create reference segmentations given its flexibility to configure robust pipelines on the basis of well-established algorithms, while investing a minimum amount of time. CellProfiler is a powerful open source tool for microscopy image analysis that includes a variety of fundamental image processing algorithms in a modular way. The algorithms can be organized in a computational graph (pipeline) that takes images as inputs and produces various types of outputs. Importantly, a pipeline is defined using a user-friendly interface and can run complex operations without the need to write a single line of code. These properties made CellProfiler a good reference for models in the competition, because the goal of the challenge was to investigate generic nucleus segmentation methods with minimal user interaction.
Five custom pipelines were designed, one for each of the image types in the test set, with the goal of evaluating the classical algorithms implemented in CellProfiler. A single classical segmentation pipeline was unlikely to work well in the variety of images represented in the test set, thus, we adapted the best practices reported in

NaTure MeThoDS
AnAlysis | FOCUS