Background & Summary

Microscopy image recognition has seen vast advances in recent years, fostered by the availability of high quality datasets as well as by the application of sophisticated deep learning pipelines. One of the most important topics in the field of microscopy imaging is the classification of cells, typically stained with hematoxylin and eosin (H&E) dye. In this area, one particularly challenging task is the detection of mitotic figures, i.e. cells undergoing division, in tumor tissue. It is commonly accepted that the quantity of mitotic figures is one of the most powerful prognosticators of biological behavior for many tumor types, both in humans1,2 and animals3,4,5. In the field of automatic detection of those mitotic figures, there have been a number of competitions in recent years, e.g. the TUPAC16 challenge6, the ICPR MITOS-20127 and ICPR MITOS-ATYPIA-2014 challenge8.

Mitotic figures are defined histologically by the lack of a nuclear membrane and the presence of hairy projections of the chromosomes (nuclear material)9. A common method for quantification is the mitotic count (MC), which means counting mitotic figures in a standard-sized area located where the tumor is assumed to have the highest mitotic density. The method is widely used, as it can be obtained easily on standard H&E stained sections without additional costs10. Regardless, reproducibility is currently hampered by high inter- and intra-rater variability11,12 due to the difficulty of identifying mitotic figures and the variable distribution of mitotic figures throughout the tumor section13. Identification of individual mitotic figures has only moderate agreement between trained pathologists as they include a wide range of morphological variants depending on the phase of cell division and tissue properties as well as atypical morphologies. Previous studies have identified inter-rater disagreement of 17.0–34.0% in distinguishing individual mitotic figures from other cell structures in canine mast cell tumors (CCMCT) and human breast cancer12,13,14. Yet, even if results are typically more stable, algorithmic approaches have not reached human performance in this task. Identifying the area with the highest mitotic density – as requested for the MC – is complicated by a patchy mitotic distribution13. In contrast to human observers, machine learning-based algorithms can quickly evaluate entire whole slide images (WSI) and propose the area with the highest density. A previous study has shown that algorithms can outperform human observers in this task and pose a very promising method to overcome this challenge15.

CCMCT are among the most common skin tumors in dogs16. Tumors compose of round to polygonal neoplastic mast cells with variable amounts of faintly stained intracytoplasmic granules, which contain different substances such as eosinophilic chemotactic factors. Due to these substances, aggregation of non-neoplastic eosinophilic granulocytes – a small immune cell containing eosinophilic granules – is additionally found in most CCMCT17. Biological behavior is highly variable: CCMCT are considered potentially malignant. Whereas the majority of cases will have a benign behavior, others may develop fatal metastatic diseases. Therefore, accurate prognostication of the biological behavior such as by quantification of mitotic figures is essential in order to select an appropriate therapeutic approach16. It has been determined that the MC has good prognostic value for CCMCT as a solitary parameter3,4 and as part of a grading system18.

Given the importance of quantifying mitotic figures in various tumor types of animals and humans, it is at first glance surprising that none of the available datasets provide labels for complete WSI. Manual annotation of such large areas, however, is a labor-intense and tedious task. In this work, we present a dataset consisting of 32 fully-annotated WSI of CCMCT with a total of 44,880 mitotic figure annotations. Potential mitotic figures have been identified by one veterinary pathologist [CB] and subsequently by a deep learning-based pipeline. Two experts [CB, RK] classified the annotations in a blinded manner and reviewed the disagreed labels to find common consensus on the label class. This collection19 represents the currently largest data set in number of annotated mitotic figures and annotated tumor area. Therefore it provides researchers with new opportunities for the development and refinement of data-driven algorithms for mitotic figure identification on entire whole slide images.

Methods

Selection and preparation of specimen

Histological specimens of CCMCT cases were obtained from the author’s institute diagnostic archive. 32 Cases with high tissue quality were selected retrospectively in such a way that the dataset includes cases with variable density of mitotic figures ranging from low to very high MCs. One representative tissue block (formalin-fixed and paraffin-embedded) was chosen per case. New tissue sections were produced at a thickness of approximately 1 μm and stained with H&E by a tissue stainer (ST5010 Autostainer XL, Leica, Germany). Whole slide scanning was performed by a linear scanner (ScanScope CS2, Leica, Germany) in one focal plane by default settings at a magnification of 400x (image resolution: 0.25 μm/pixel), using an Olympus UPlanSAPO 20x lens (field number = 26.5, numerical aperture = 0.75).

Manually expert labelled (MEL) dataset

Primary annotations were carried out by two experts trained in the field of veterinary pathology [CB, RK]. For this, we used an open source software solution made available by our research group20. This software provides two modes specifically designed for this task: Firstly, an expert can screen a WSI for specific structures (in this case mitotic cells) at highest magnification. For this, the software detects tissue presence in the image and shows partially overlapping tissue segments to the expert for assessment. This ensures that no region of the WSI is left out for assessment. The first expert on each dataset classified cells into the following groups (see also Fig. 1):

  1. 1.

    Mitotic figure.

  2. 2.

    Non-mitotic, neoplastic mast cells.

  3. 3.

    Non-mitotic, ambiguous cells.

  4. 4.

    Eosinophilic granulocytes.

Fig. 1
figure 1

Examples for various cell types annotated in the data set. Not shown are ambiguous cells. Due to their count, only for the class of mitotic figures a complete list of cells is provided.

The group of ambiguous cells plays a special role, here, as it is non-disjunct to the other groups besides mitotic cell. This group was initially used to account for cells that are not mitotic figures, but also not clearly attributed to other cells.

The first assessment of a WSI was always carried out twice by the first expert (see Fig. 2). The second expert was blinded to the cell class decisions of the first expert, but not to the positions where cells were annotated. We followed this procedure, because we assumed the risk to miss rare mitotic events on WSIs to be greater than the potential bias introduced when having to judge an already available cell annotation of unknown class. The annotation software20 provides a mode for this blinded annotation, in which one or multiple unassigned annotations are presented without any class labels. After selection of the respective classes, the next random annotation(s) would be presented.

Fig. 2
figure 2

Creation of the manually expert labelled (MEL) dataset variant, which is the base for all other data set variants. Every WSI was screened for mitotic figures by the first expert. The second expert was able to see annotations but not class labels, and was additionally able to set new annotations, if needed. Disagreed cells were re-assigned to both experts for a common consensus.

It is well known, that the concordance of different experts w.r.t. mitotic figure assessment is not perfect. All cases, where both experts did not agree on the same class, and additionally a number of doubtful candidates found by the first reviewer, were re-evaluated by both experts in order to find agreement on the label class, resulting in the manually expert labelled (MEL) data set variant. Naturally, manual screening of large images introduces the risk of missing candidates for annotation, which we perceive as one of the main risks for data quality. Due to this, we employed an algorithm-aided pipeline.

Augmented dataset for mitotic figures

In order to improve the quality of our dataset, we made use of deep learning techniques, trained on the manually, expert-labelled (MEL) data set. We derive two data set variants:

Hard-example augmented expert labelled dataset variant (HEAEL)

In this dataset variant, our primary aim was to split up the group of non-mitotic figures and ambiguous cells into mitotic figure-lookalikes and other cells. It has been shown that determination of hard examples is helpful for faster convergence of the classification approaches21.

For cell classification, we used a standard CNN network architecture based on ResNet-1822 as backbone. We trained this network using image crops of 128 px × 128 px around annotated cells of the dataset. The cases where this cell classifier network predicted a high certainty mitotic figure were reviewed again by both experts, to account for potentially misclassified cells (see Fig. 3).

Fig. 3
figure 3

Algorithm-aided division of the ambiguous class non-mitotic cells, resulting in the hard-example augmented expert labelled (HEAEL) dataset variant. A ResNet1822-based classifier was used to sort ambiguous cells into more or less likely mitotic figure candidates, which were subsequently presented to both experts.

Object-detection augmented expert labelled dataset variant (ODAEL)

In order to counteract bias encountered due to one or both experts missing candidates of the (relatively rare) mitotic figures, we shifted towards an augmented dataset generation technique. In this approach, a deep network would propose additional potential mitotic figure candidates, and the human experts would have to rate and assign to the different groups of our dataset (see Fig. 4). With this mechanism, we generated, additionally to the missed mitotic figures, also a list of hard negative samples, i.e. examples that a model or even a human expert could potentially misjudge for true mitotic figures. By definition, hard negative mitotic figure lookalikes were cells where the model classified a mitotic figure, but the consensus of human experts neglected this to be the correct label.

Fig. 4
figure 4

Algorithm-aided labelling of potentially missed mitotic cells, resulting in the object-detection augmented expert labelled (ODAEL) dataset variant. We used a customized RetinaNet23 object detector for mitotic figure candidate extraction from WSI, subsequently filtered out known cells and performed a refining classification. Results of which were presented to two experts to extend the database with potentially missed mitotic figures.

First, based on a three-fold split, a custom RetinaNet23 model was trained for each fold. We used an input size of 512 × 512 for the model, and fed images that would typically contain at least one mitotic figure to the model. RetinaNet uses focal loss to account for class imbalance, which is especially important in our case due to the foreground (mitotic figure) class being less prevalent than the background class. As network backbone, we used a ResNet-1822 topology, pre-trained on ImageNet24. We trained the model for 6 cycles, each with 50,000 random image crops.

Li et al.25 have shown that a dual stage approach improves performance significantly over a single stage object detection approach. Motivated by this, we introduce a second stage cell classifier after the initial object detection/cell localization stage. We use the previously trained (for hard-example classification) network for this purpose.

Data Records

We provide the 32 original WSI in the Aperio SVS format on figshare19. All slides have been fully anonymized and label images have been removed. Each described variant of the dataset is made available as database file (SQLite3 format). The database format provides for each annotation:

  • The slide on which the cell was annotated.

  • The coordinates (x, y) of the cell.

  • The agreed class (by all experts) of the cell.

  • Two or more individual class labels. For each label, it is known who assigned the label, be it expert 1, expert 2, both experts (consensus vote), or, for the augmented dataset the object detection algorithm. The unique numeric identifier of each label also represents the order in which the labels were given to the annotation.

Table 1 gives an overview about all three dataset variants. Slides are sorted by number of annotated mitotic figures. There was a large spread in the total count, reflecting also differences in tumor proliferation. To ease comparison of results on the dataset, we assigned slides randomly to be part of the training or test dataset. The number of mitotic figure look-alikes greatly increased from the hard-example-augmented dataset to the object-detection-augmented dataset. The reason for this is that all non-mitotic cells that were given a probability of above 0.5 for mitosis by the dual stage classifier were added to this class.

Table 1 Overview of the dataset and all its variants: Numbers given per cell category are for the variant where expert labels were given after object detection/hard example classification/only manual observation.

Getting started

To reconstruct the experiments, the first step is to clone the GitHub repository (an overview is given in Table 2). It includes a jupyter notebook (Setup.ipynb) that downloads all individual slides and the database file from figshare. After this initial setup was run, all required data is available to run the other notebooks. Training of the networks is conducted in the notebooks RetinaNet-CCMCT-<variant>.ipynb, where <variant> is one of the data set variants (MEL, HEAEL, ODAEL). Trained networks are stored as RetinaNet-<variant>-export.pth in the main folder. Also in the main folder, there is a script to run the models on the test set (Inference-Retinanet.py) and the evaluation scripts to calculate the F1 score. In the subfolder 2nd_stage, all scripts and notebooks are provided to train and evaluate the 2nd stage ResNet-18 classifier. First, patches need to be extracted (exportDataset_<variant>.py), and later the classifier is trained (CellClassification-<variant>.ipynb). For inference, there is a third script (Inference-CellClassifier.py) available. Evaluation of both stages and all variants is performed in the notebook Evaluation.ipynb in the root folder.

Table 2 Excerpt from the GitHub file list.

Technical Validation

Our technical validation of the dataset is two-fold: First, we assessed the quality of assigned labels by conducting a classification experiment of mitotic figures versus other cells. Secondly, we performed a detection task on the complete WSIs of the test set. Both are informative for distinctive questions: While the first test can yield information as to how well separation of classes is possible and thus indirectly assesses label class quality, the latter also assesses the coverage of mitotic figures on the WSI.

Classification of preselected cells

For this validation task, 128 × 128 px patches with single cells of all classes besides ambiguous cells at their respective center (mitotic figure, mitotic figure lookalike, neoplastic mast cells and granulocyte) have been extracted from the ODAEL variant of the dataset. We used a standard state-of-the-art classification CNN classification network, based on a ResNet-18 stem22 pre-trained on ImageNet24. The network was trained for 1 cycle of 10 epochs using the super-convergence scheme26 with a maximum learning rate of 10−2 and the Adam optimizer27. With this approach, we reach an accuracy of 91.390% on the test set. As shown in Table 3, the main confusion is between mitotic figures and mitotic figure-lookalikes, while all other cell types were separated well by the classifier. This result also is consistent with the high intra- and inter-rater variance in this task by human experts.

Table 3 Confusion matrix: Classification results of a ResNet-18-based CNN classifier on patches with a certain cell type in the center (accuracy on test set is 91.390%).

Detection of mitotic figures on WSI

This task was performed to give a baseline for mitotic figure detection on our dataset. We trained one model for each of the dataset variants. For this, we chose RetinaNet23 as a state-of-the-art object detection approach, because implementations are available for all major machine learning frameworks currently in use in the scientific community. A similar approach was also followed by Li et al. in their DeepMitosis framework25. RetinaNet introduced the focal loss, which is very suitable for mitotic figure detection, because it assigns greater weight to decisions that were hard for the network, and thus an explicit hard example mining as a training strategy can be avoided.

We feed 256 × 256 px image patches to our model, which is build on a ResNet-1822 stem pre-trained on ImageNet24 with spatial pyramid features for the network, and two customized heads, one for bounding box detection and one for mitotic figure/background classification. The heads are based on the lowest feature pyramid layer at the highest (16 × 16) spatial resolution.

We used a customized sampling scheme to ensure and speed up model convergence. For each training batch, 50% of the images would contain at least one mitotic figure, 40% would contain a mitotic figure look-alike (hard example) and 10% of images were picked completely at random from the slide. In the MEL dataset variant where no hard examples were available, we used the ambiguous cells instead in the scheme. For training, only the upper half of each WSI was used, for validation, we used the lower half. The test set was never used during training and algorithmic optimization.

Due to the high number of potential images to be extracted from the WSI, we perceive the classical definition of epochs in deep learning (i.e. the entire training set being seen in back-propagation at least once) to be not sensible any more. We thus consider pseudo-epochs of 5,000 (each time randomly selected) images for our training.

After initial training for a single pseudo-epoch, the heads of the networks were trained using the super-convergence scheme of Smith and Topin26 with Adam as optimizer27 for 3 cycles of 10 pseudo-epochs using a maximum learning rate of 10−4. After this convergence, the complete network was fine-tuned for 2 × 30 pseudo-epochs for which an early stopping paradigm was applied to retrieve the model with highest validation performance. As per the validation loss, we did not find the model to overfit the data, which is not surprising due to the huge amount of image material in the data set. The sampling scheme used by us leads to an overestimation of likelihood for mitotic figures by the model. Due to this, we optimize the threshold for object detection by processing the complete WSIs of the training and validation set after the model was trained. Again, we used the patch classifier trained in the previous step as second stage for the mitotic figure detection.

Not surprisingly, we find an influence of the dataset variant on the F1 score (see Table 4). Since the ODAEL variant is expected to be thorough in the identification of all present mitotic figures, it is in line with expectations that the ODAEL variant archived the highest F1 scores for all models. Overall, the influence of the dataset variant on the F1 score is above 3 percentage points, underlining the sensibility of the applied method.

Table 4 Performance assessment (F1 score) on the different variants of the dataset.

Ablation study

One of the most interesting questions for a dataset of this size is, how strongly it benefits from the increased size over previous approaches. The predominant approach in current datasets is to annotate a subset of a size of ten contiguous high power fields (HPF). We follow the definition of Meuten10, who defined the area of a single HPF to be 0.237 mm2. To investigate, how a restriction in size impacts the detection results, we thus derived small subsets with an area of 5, 10 and 50 HPF, taken from our best performing ODAEL dataset variant. We asked a senior pathology expert to determine the most mitotically active part of the tumor as he would do for manual mitotic counts. This procedure is consistent with the one described by Veta et al. for the TUPAC16 dataset6.

To compare against the existing data sets, we focus in the following on the data set reduced to 10 HPF area (see Table 5 for the other cases). Using an aspect ratio of 4:3, the resulting images were 7,017 px in width and 5263 px in height. The resulting (to an area of 10 HPF reduced) dataset consists of 7,617 cell annotations, including 1041 mitotic figures. Regardless having a slightly higher number of cases, it includes a quite similar number of mitotic figures than the AMIDA13 dataset (cf. Table 6). We trained the same pipeline as for the complete dataset, however for a shorter amount of iterations to avoid over-fitting due to the much smaller dataset variance: The RetinaNet object detector was trained for a single cycle of 10 pseudo-epochs using super-convergence, and for another 60 iterations with normal adaptive learning rate based on Adam. During this last period, we used early stopping and chose the model with highest validation performance. As shown in Fig. 5, the performance of the model increases significantly with the amount of annotated area and the number of available WSI. The data shows, however, that a plateau is reached for the number of WSI, and doubling the number of training WSI from 12 to all (21) increased performance only slightly.

Table 5 Ablation study dataset subsets.
Table 6 Comparison of our dataset and its variants to other datasets with mitotic figure annotations.
Fig. 5
figure 5

Results of the ablation study using the dual stage detector. In panel a, the results of using varying training area sizes around an expert-selected most mitotically active part of tumor are given. In panel b we show the results of using only a subset of the slides for training.

Usage Notes

Annotations are provided in the SlideRunner database format20, which can be also used to view the WSIs with all annotations, but also in the popular MS COCO format. Be aware that the latter does not provide the possibility to annotate an object with multiple expert labels, thus the data format is of reduced information content. We encourage to view and process the data based on the SlideRunner database format.