A completely annotated whole slide image dataset of canine breast cancer to aid human breast cancer research

Astract Canine mammary carcinoma (CMC) has been used as a model to investigate the pathogenesis of human breast cancer and the same grading scheme is commonly used to assess tumor malignancy in both. One key component of this grading scheme is the density of mitotic figures (MF). Current publicly available datasets on human breast cancer only provide annotations for small subsets of whole slide images (WSIs). We present a novel dataset of 21 WSIs of CMC completely annotated for MF. For this, a pathologist screened all WSIs for potential MF and structures with a similar appearance. A second expert blindly assigned labels, and for non-matching labels, a third expert assigned the final labels. Additionally, we used machine learning to identify previously undetected MF. Finally, we performed representation learning and two-dimensional projection to further increase the consistency of the annotations. Our dataset consists of 13,907 MF and 36,379 hard negatives. We achieved a mean F1-score of 0.791 on the test set and of up to 0.696 on a human breast cancer dataset.


Methods
Selection and preparation of specimen. All specimens were taken retrospectively from the histopathology archive of an author (R.K.) with approval by the local governmental authorities (State Office of Health and Social Affairs of Berlin, approval ID: StN 011/20). Specimens of breast tissue from bitches had been surgically removed by the treating veterinarian for purely diagnostic purposes for cases suspicious of mammary neoplasia. The tissue had been routinely fixed in formalin and embedded in paraffin. For this study new tissue sections were produced from the tissue blocks and staining with hematoxylin and eosin using an automated slide stainer (ST5010 Autostainer XL, Leica, Germany). Case selection was random and specimens with acceptable tissue quality were included. All images were digitized using a linear whole slide scanner (Aperio ScanScope CS2, Leica, Germany) at a resolution of 0.25 microns per pixel (400X).
Manually expert labeled (MeL) dataset. Labeling mitotic figures accurately requires a great deal of expertise in the field of tumor pathology. Additionally, it is a labor-intensive task that requires a high level of concentration throughout the process. To set a baseline, an expert pathologist with some years of experience in mitotic figure detection (C.A.B.) screened each WSI twice for mitotic figures. For this, a specialized software solution 15 was used that provides a screening mode. This mode presents overlapping image patches selected from the WSI in regions where tissue is present. Whenever the expert was done with a given image section, the program would propose the next suitable image patch. This ensures that no portion of the image was left out in this assessment. Besides mitotic figures, the expert annotated similar appearing objects that could be confused with mitotic figures based on their visual appearance, but, do not represent cells in the state of cell division (see Fig. 1). This was the precondition for the next step: It is widely known, that inter-rater discordance can be high for mitotic figures [2][3][4][5] . To reduce the subjective effect of this rater, we asked a second expert (R.K.), who is a senior pathology expert with several years of experience in mitotic figure assessment. The expert was given the task to assess for each of the annotated objects (the labels were blinded for him) a new label (mitotic figure or mitotic figure look-alike). This results in two independent expert opinions for every single object of interest. For disagreed labels (N = 1,268/39,868), agreement was initially obtained by consensus of the first and second pathologist. This preliminary dataset was used for augmented dataset development using machine learning and data analysis www.nature.com/scientificdata www.nature.com/scientificdata/ techniques (see below). Final agreement for ground truth, which was used for technical validation of the dataset, was obtained through majority vote by a third pathologist (T.A.D.). Agreement through majority vote was commonly performed in human mitotic figure breast cancer datasets 8,10 . Object-detection augmented and expert labeled (ODAeL) dataset. While one expert screened the WSIs with the help of a dedicated software solution and with great diligence, we still have to assume that, due to the partially infrequent occurrence of mitotic figures, the expert might have missed a certain percentage of mitotic figures. To take this into account, we employed a machine learning-based pipeline to find candidates for missed mitotic figures to be presented to the experts (see Fig. 2). For this, we first split a preliminary dataset into three parts (each 7 WSIs), which were then subsequently each used once as a test set for a CNN-based object detector, with the WSIs of the other two parts used as training set. We used a customized version 12 of RetinaNet 16 for this purpose. All objects detected by the network were subsequently checked for existence in the MEL dataset variant. Second, we trained a cell classifier on 128 px patches cropped around the cells annotated in the currently used training part of the MEL dataset and ran the inference with all newly identified mitotic figure candidates. The cell patches were grouped into 10 groups according to their model score (where 1.0 represented cells being very likely a mitotic figure, and 0.0 very unlikely, respectively). All mitotic figures were initially shown to the first pathologist and the second pathologist independently, and in case of disagreement, the third expert rendered the final vote, like in the previous dataset variant. Through this procedure, the number of mitotic figures was increased by 6.06% compared to the MEL dataset (see Table 1), which is in line with previous findings 12 . The number of non-mitotic cells was increased much more significantly (+36.22%). The reason for this is that all candidates that were identified by the RetinaNet (even those with low scores) were added to the list of non-mitotic figures. Comparing the individual tumor cases, we can see that the relative increase in mitotic figures is higher for the tumors with less overall mitotic figures. This highlights that missing mitotic figures is more likely for lower grade tumors due to the rareness of the event.
Clustering and object-detection augmented and expert labeled (CODAeL) dataset. While the previous, machine-learning driven approach was aimed at identifying previously missed mitotic figures, we still have to take into account that the dataset suffers from a certain degree of inconsistency due to misclassification. While it is commonly easy to identify a clear mitotic figure and a clearly non-mitotic cell, there are a significant amount of cells where this differentiation is not easy to make (see Fig. 1). In this assessment, the experts have to create a visual and perceptual cutoff that pertains to judging what represents a mitotic figure. This cutoff may, however, not be constant over time. Both sources of inconsistency, i.e., the varying cutoffs over time as well as plain human error, can be counteracted by clustering with subsequent reevaluation.
For this, we cropped out patches of all cells (mitotic figures and hard negatives) in the database and trained a ResNet18 17 -based classifier for 10 epochs using standard binary cross-entropy. In order to enable a better clustering, we used methods from representation learning in the next step: we removed the last layer of the network and trained the network for another 10 epochs using contrastive loss 18 . The resulting feature vector of all images were subsequently transformed into a two dimensional representation using uniform manifold approximation and projection (UMAP), resulting in two coordinates in the 2D representation for each image (see 0).
Next, the image patches were inserted into a new image according to an upscaled version of these coordinates, as described by Marzahl et al. 19 : To avoid interference of image patches, a grid with tile size according to the patch size was constructed. The patch was then pasted into the grid according to a least distance between grid coordinates and coordinates as given by the projection. Whenever a position in the grid was already filled, the tile was placed on the next possible grid position with least distance. This resulted in an 80,000 × 60,000 px image, containing all 50,286 individual cell patches (see center of Fig. 3). In this image, all patches representing a mitotic figure were given a red box as a rectangle, and all patches representing a non-mitotic cell a blue rectangle. Next, the first pathology expert re-assessed all mitotic figures in this graphical representation. While potentially also introducing a bias, the representation facilitated identification of labeling errors as well as comparisons to similar appearing objects, thus increasing consistency. Expert driven classification changes occurred for 621 non-mitotic figures (changed to mitotic figures) and 771 mitotic figures (to non-mitotic figures) with a total of 1392 classification changes.
To reduce the bias introduced by the pipeline and its graphical representation, we presented all changes from the first expert to the second expert, however, 2D mapping was omitted and only patches of 128 px size were assessed. The second expert disagreed only in a minority of cases. In total, decisions were overruled in 109 cases (7.8%), resulting in 42 cells classified as mitotic figures and 67 cells classified as non-mitotic cells/structures. The disagreements were lastly given to a third expert for a majority ruling, as shown in Fig. 3. The final CODAEL datset with majority vote by a third pathologists contains 13,907 mitotic figures, as shown in Table 1. www.nature.com/scientificdata www.nature.com/scientificdata/

Data Records
The dataset, consisting of 21 anonymized WSIs in Aperio SVS file format, is publicly available on figshare 20 . Alongside, we provide cell annotations according to both classes in a SQLite3 database. For each annotation, this database provides: • The WSI of the annotation • The absolute center coordinates (x,y) on the WSI • The class labels, assigned by all experts and the final agreed class. Each annotation label is included in this, resulting in at least two labels (in the case of initial agreement and no further modifications), one by each expert. The unique numeric identifier of each label furthermore represents the order in which the labels were added to the database.  www.nature.com/scientificdata www.nature.com/scientificdata/ We also provide polygon annotations for the tumor area within the WSI. The polygon annotations consist of multiple coordinates linked to a single annotation. The publicly available libraries provided by the SlideRunner 15 python package can be used to conveniently extract those. The WSIs and all annotations can also be viewed using this open source software. Table 1 gives an overview about the database and all its variants, sorted by the number of mitotic figures in the CODAEL dataset variant. The number of mitotic figure look-alike cells is almost three times higher than the number of mitotic figures, and this factor is lower for the manually labeled dataset variants. The table also indicates which WSIs were assigned to the training set and which to the test set.
We also investigated the count of mitotic figures. In most grading schemes 1,21 this is defined as ten consecutive high power fields (10 HPF, representing an area of 2.37 mm 2 for the most commonly used microscope settings 22 ). We chose an area of 10 HPF with an aspect ratio of 4:3 and used a moving window summation over all mitotic figure events to derive the mitotic count. As most grading schemes recommend to perform the mitotic count where there is the highest mitotic activity, we took the absolute maximum of this two-dimensional mitotic count map. The distribution of the mitotic figures per 10 HPF area within the complete tumor area can be seen in Fig. 4: As  www.nature.com/scientificdata www.nature.com/scientificdata/ evident from the examples depicted, there is heterogenous distribution of mitotic figures throughout the tumors. Thus, the mitotic count is highly dependent on the correct determination of the area of maximum mitotic activity. Notably, in almost all cases the higher cutoff value of 10 1 is exceeded. Since finding the optimal threshold is, however, so strongly dependent on the position, we can assume that these would have to be adjusted for an automatic mitotic figure detection-based grading scheme. In total, the experts annotated a tumor area of 4.360.07 mm 2 , greatly exceeding the state-of-the-art in mammary carcinoma datasets, which is given by the TUPAC16 dataset (251.5 mm 2 ) and similar to the canine cutaneous mast cell tumor dataset 12 .
Getting started. We provide a github-repository, including all experiments described in this manuscript.
The repository includes a jupyter notebook (Setup.ipynb) that will download the dataset from figshare automatically, setting up the environment for all experiments. The repository contains jupyter notebooks using a fast.ai 23 implementation of RetinaNet. All results used to generate the plots and tables in this paper are provided alongside. Besides the network training notebooks, there is a python script to run inference on the complete dataset, and a script to separate training from test at inference (inference on training is required to properly determine the cutoff thresholds on the WSIs).
In the PatchClassifier subfolder, we provided the implementation of the mitotic figure/non-mitotic figure patch classifier, also used as second stage in our experiments. Alongside with a patch extraction script (to create 128px patches centered around the annotated cell from the WSI), there are jupyter notebooks used for the training of the second stages for each dataset variant and an inference script that we used to yield the final results.
All database variants described in this paper have been placed in the databases folder.

technical Validation
In order to set a baseline for our novel mammary carcinoma dataset, we performed three experiments: First, we performed a cell classification experiment on cropped-out patches centered around the annotated pattern. Second, we conducted an object detection experiment with the complete WSIs. Lastly, to investigate a domain transfer from canine tissue to human tissue, we repeated this same experiment on the largest publicly available dataset from human mammary carcinoma, TUPAC16 10 , and subparts thereof. We repeated all experiments five times independently to be able to report mean and standard deviation of the metrics.
Classification of centered patches. For the cell classification experiment, we utilized a standard CNN classification pipeline based on a ResNet-18 17 stem, pre-trained on ImageNet 24 . We used the implementation available in fast.ai 23 for this and employed the standard image transforms with their default parameters. Due to the high class imbalance, we used minority class oversampling and image augmentation. We initially trained only the randomly initialized classification head for a single epoch, while the rest of the network was frozen. Subsequently, we unfroze the network stem and trained the network for 10 epochs using the cyclic hyper-convergence learning-rate scheme of Smith 25 . During this, we already employed an early stopping paradigm to prevent overfitting. In this, the model with the highest accuracy score was chosen. As shown in Table 2, we find a steady increase of the area under the ROC curve (ROC AUC) with each curation step of the dataset: While the manually labeled dataset has a mean ROC AUC value of 0.926, the clustering-and object-detection supported set already results in a mean value of 0.944.

Detection of mitotic figures on WSIs. For the detection of mitotic figures on whole slide images (WSIs),
we employed two state-of-the-art methods: As the primary stage, we used a customized 26 RetinaNet 16 approach. We used RetinaNet, since it represents a good performance to complexity trade-off, and is available for many machine learning frameworks. As the second stage, we assessed candidates with the patch classifier described in the previous section. We chose this dual stage setup over approaches like Faster RCNN (which integrate two stages into a single network) because this approach offers a higher flexibility for sampling during training, and has been shown to be successful in mitotic figure detection 27 . For the RetinaNet, we used only a single aspect ratio (since mitotic figures and mitotic figure look-alikes can loosely be approximated by a quadratic bounding box), and three scales. The network was based on a ResNet-18 17 stem, pre-trained on ImageNet 24 . Besides standard augmentation methods, we took random crops from the whole slide images based on a sampling scheme 12 that allows it to drive network convergence by selecting patches with true mitotic figures as well as hard negative examples. Since the WSIs already show a quite high diversity in staining, no further color augmentation was used.
In the first step, the pre-trained network stem was frozen (learning rate = 0), to encourage a fast adaptation of the randomly initialized object detection head. Under these conditions, the network was trained using focal loss  Table 2. Cell classification experiment, based on cropped-out patches of the manually expert labeled (MEL), object detection-augmented and expert labeled (ODAEL) and the clustering + object detection-augmented and expert labeled (CODAEL) dataset variants. ROC AUC indicates the area under the receiver operating characteristic curve. Values represent mean ± standard deviation of five independent training and inference runs.
www.nature.com/scientificdata www.nature.com/scientificdata/ as loss function and Adam as optimizer for one epoch (consisting of 5,000 images, randomly selected according to the sampling scheme). Note that we use the term epoch in this context differently: as random crops from vastly big images introduce a great deal of randomness, it is complex to assess which parts have been used within the training already -which renders the usual definition of epoch useless. Thus, we arbitrarily define an epoch to contain 5,000 randomly sampled images. As the next step, the network stem was unfrozen and all weights were adapted for two cycles of 10 epochs using the super-convergence scheme 25 . Next, we trained the network for another 30 epochs, and performed a selection of the model with the lowest validation loss. Towards the end of these 30 epochs, the model would commonly be on the verge of overfitting, thus the model selection process was required to aid generalization. The train-and validation split was performed on the upper (training) vs. lower (validation) part of the WSIs of the training set. While this does not represent a truly statistically independent sample, it was the best compromise to ensure training stability while still keeping the amount of WSIs that are available for training high.
In the scope of detecting mitotic figures on whole slide images, we define the detection of a true positive (TP) when the detection and the mitotic figure annotation lie within 25 px distance of one another. If no detection (with a model score above threshold) is present within this distance, it is counted as a false negative (FN). On the contrary, if a detection is found outside the vicinity of an annotation, it is classified as a false positive (FP). Using the respective sum of the aforementioned counts over all slides, we define the F 1 score as: In the same way, we define precision (Pr) and recall (Re) as: Test performance improved when using the dual stage algorithm compared to the single stage algorithm, and higher degrees of algorithmic augmentation for dataset development(MEL to CODEAL dataset variant) further improved the F 1 metric (see Table 3). Standard deviations of five independent training and testing runs were small overall, proving that this dataset may be used to train algorithms reproducibly.
Methods of label agreement. Previous datasets of mitotic figures in human mammary carcinoma used a majority vote by a third pathologist for disagreed labels 8,10 . Therefore we also used this approach to establish the ground truth for the technical validation study. In a previous canine dataset of mitotic figures, a consensus by the same pathologists was used for ground truth 12 and we used this approach for the training of the networks used in the augmented datasets. This section aims to briefly compare these two methods. For the CODEAL datset variant, the consensus contained 30 more mitotic figures, thus there is a negligible difference in the total number of mitotic figures (0.22%). Training and testing with the consensus variant yielded an F 1 score of 0.790, which is comparable to the majority vote variant with the third expert (F 1 of 0.791 ± 0.012).
Both methods for label agreement have potential biases in regard to label consistency (reproducible decision criteria) and label accuracy (true label class). While the majority vote method may potentially be more representative of the general expert opinion, introducing another expert with variable decision criteria might potentially reduce label consistency. Although future studies need to examine the influence of these two different labeling methods, we could not find a notable difference in the total number of labels or algorithmic performance for our datasets.
Inter-species transfer: applicability on human breast tissue. Human mammary carcinoma is a cancer of high prevalence worldwide. Due to the similarity of canine mammary carcinoma and human mammary carcinoma, we aimed to investigate how applicable a system trained on the dataset proposed in this work would be on human mammary carcinoma. Between datasets of different origins, we can expect to observe a domain shift, effectively reducing the performance in cross-domain detection results. This domain shift might be caused by biological differences in tumor or normal tissue between humans and dogs and variable tissue processing workflows between different laboratories including variable WSI acquisition. Finally, we can also expect a difference in expert opinions between different ground truth datasets which might be reflected in lower recognition performance.
To investigate this, let us have a brief look at the TUPAC16 10 dataset: According to Veta et al., the dataset was acquired using two scanners of different types at three clinical environments. The first 23 cases, previously  Table 3. Performance assessment (F1 score) for mitotic figure detection on the test set of the three different dataset variants, mean and standard deviation for five independent training and inference runs.
www.nature.com/scientificdata www.nature.com/scientificdata/ released as the MICCAI AMIDA-13 dataset, were acquired using the Aperio ScanScope XT, while the remaining 84 cases were acquired using the Leica CS400 scanner. Both scanners have very different color representation (see Fig. 5), and thus cause a severe domain shift, that shall, however, not be the scope of this work, as we only wanted to investigate the general transferability of models trained on canine tissue. The images used for the canine specimens (scanned with Aperio ScanScope CS2) seem to have similar color representation as compared to human images scanned with the Aperio ScanScope XT.
To undertake the question of reduced generalization caused by different opinions of the annotators, we re-labeled the TUPAC16 set with very similar methods, using experts from the present study 28 .
Looking at Table 4, we can see that the domain transfer task yields mean F1 scores of 0.544 on the training set of TUPAC16. At first glance, this seems like a significant reduction in performance, compared to what the system trained on the same dataset achieved on the test set presented in this work. As shown in Table 4, the performance is already better if we limit the task to the AMIDA13 dataset, or the test set thereof, which can likely be explained by the similar color representation between the human and canine images. In contrast, testing on the TUPAC-test set, which comprises exclusively of images from the Leica SCN400 scanner, has a lower F 1 score. As the second stage of our algorithm is specifically trained to distinguish difficult patterns, it is likely to show a stronger dependency on the domain. Thus a change in color representation might lead to inferior performance, compared to the single stage approach as shown in the present study.
Performance was further improved when testing on the re-labeled ground truth, which likely has higher annotation consistency to the canine dataset. This underscores the importance of consistent labeling methods when algorithms are to be tested on independent datasets. We further performed model selection (MS) and optimization of the threshold (TO) on the AMIDA training set as a step towards further domain adaptation, resulting in a higher F1 score of 0.696.
A common approach when a big and more representative dataset and a smaller target-domain dataset are available is transfer learning. For this, a model that was trained on the original, large dataset is fine-tuned on the   www.nature.com/scientificdata www.nature.com/scientificdata/ smaller dataset. In order to estimate the remaining domain shift of the Aperio ScanScope-scanned slides of the TUPAC dataset (i.e., the AMIDA-13 dataset), we thus additionally performed model fine-tuning using the training part of the AMIDA-13 dataset, and evaluated it on the respective test set. Our results in Table 4 indicate that there is a residual domain shift, likely caused by variability in tissue processing or quality, or by the fact that only hot spot regions were annotated for the AMIDA-13 dataset, and not the whole WSI.
In summary, we can observe a strong domain shift that can be largely attributed to the scanner that was used for the digitization of images, and a much lesser domain shift caused by a combination of processing differences (such as staining, section thickness, etc.) or species. Additionally, we found a decrease in performance that can be attributed to label inconsistency, either caused by differences in expert opinion or by using a different labeling workflow. Results of this domain transfer experiment suggest that this dataset may be valuable for training or testing mitotic figure algorithms for human breast cancer as we provide annotations for entire WSIs.

Usage Notes
Annotations are provided in the SlideRunner database format 15 , which can be also used to view the WSIs with all annotations, but also in the popular MS COCO format. The latter, however, only contains the final annotation class and not the annotation history (i.e., the multiple labels given by multiple experts).

Code availability
All code used in the experiments described in the manuscript was written in Python 3 and is available through our GitHub repository (https://github.com/DeepPathology/MITOS_WSI_CMC/). We provide all necessary libraries as well as Jupyter Notebooks allowing tracing of our results. The code is based on fast.ai 23 and OpenSlide 29 and provides some custom data loaders for use of the dataset.