Cell segmentation for immunofluorescence multiplexed images using two-stage domain adaptation and weakly labeled data for pre-training

Cellular profiling with multiplexed immunofluorescence (MxIF) images can contribute to a more accurate patient stratification for immunotherapy. Accurate cell segmentation of the MxIF images is an essential step. We propose a deep learning pipeline to train a Mask R-CNN model (deep network) for cell segmentation using nuclear (DAPI) and membrane (Na+K+ATPase) stained images. We used two-stage domain adaptation by first using a weakly labeled dataset followed by fine-tuning with a manually annotated dataset. We validated our method against manual annotations on three different datasets. Our method yields comparable results to the multi-observer agreement on an ovarian cancer dataset and improves on state-of-the-art performance on a publicly available dataset of mouse pancreatic tissues. Our proposed method, using a weakly labeled dataset for pre-training, showed superior performance in all of our experiments. When using smaller training sample sizes for fine-tuning, the proposed method provided comparable performance to that obtained using much larger training sample sizes. Our results demonstrate that using two-stage domain adaptation with a weakly labeled dataset can effectively boost system performance, especially when using a small training sample size. We deployed the model as a plug-in to CellProfiler, a widely used software platform for cellular image analysis.

Immuno-oncology profiling requires a detailed assessment of the tumor microenvironment 1 . This includes identifying and quantifying different immune cell subsets, their spatial arrangement, and the expression of immune checkpoint markers on these cells 2 . Simultaneously characterizing both immune and tumor-related pathways can empower a more accurate patient stratification for immunotherapy [3][4][5] . Advances in imaging and automatic analysis (including artificial intelligence) can dramatically impact the ability to perform such characterization 1 . Immunofluorescent multiplexing, one of several multiplexing technologies that have recently become available, allows the labeling of different protein markers with immunofluorescent (IF) conjugated antibodies on the same tissue section simultaneously. Accurate cell segmentation on the multiplexed immunofluorescence (MxIF) images is an essential step when generating profiling information for downstream analysis.
Meijering 6 published a comprehensive review of the literature on methodologies for nuclear/cell segmentation, covering many conventional algorithms (i.e. non-deep learning), including thresholding, filtering, morphological operations, region accumulation, and model fitting. These methods used alone are seldom able to achieve satisfactory results. More commonly, combinations of methods are used for specific nuclear/cell segmentation tasks. For example, Veta et al. 7 proposed a pipeline that used maker-controlled watersheds followed by postprocessing steps to segment cell nuclei on haematoxylin and eosin (H&E) stained images. They used color unmixing, morphological operations followed by a fast radial symmetry transform 8 to extract foreground and background markers to perform the marker-controlled watershed algorithm 9 . Software platforms have been developed to make it easier to adapt existing methods to new tasks 10,11 . For example, CellProfiler 11  www.nature.com/scientificreports/ of interest (ROI). Even if one could collect a very large number of image samples containing large variabilities and afford to hire a large group to annotate images, it is impossible to include images that represent all possible conditions, nor define annotation standards that fit all use purposes. Domain adaptation/transfer learning is a method that adapts a model trained for one task for the target task by fine-tuning. For a CNN, this allows it to be pre-trained to learn many feature filters (e.g. edge, intensity, etc.) before fine-tuning. This method has been effective in improving performance in many tasks, especially when the training sample size is not large 28 .
In this article, we present a deep learning-based pipeline for instance cell segmentation for MxIF images of DAPI (a nuclear stain) and Na + K + ATPase (a membrane stain, referred to here as MEM) stained images by using two-stage domain adaptations to train a Mask R-CNN model. We performed first-stage domain adaptation by fine-tuning the model using a weakly labeled dataset. The second-stage domain adaptation was performed by using a dataset with manually annotated fine labels. We validated our pipeline on two different in-house datasets and one public dataset. The primary contributions of this paper are: (1) description of a two-stage domain adaption method for whole cell segmentation on MxIF images, which allows the model to achieve a human-level performance and a better than state-of-the-art performance validating on different datasets; (2) demonstration that the use of a weakly labeled dataset (using samples from a similar but different domain) for first-stage domain adaption substantially improves the final model performance overall and allows the model to achieve satisfactory results on different domain datasets using a few manually annotated samples for second-stage domain adaptation; (3) presentation of a recursive pipeline for generating high-quality weak labels of cell membrane boundary, which uses a minimum amount of human labor; (4) deployment of our pipeline in Cellprofiler as a plug-in for easy access for both developers and end users. Figure 1 is a schematic outline of the workflow for our method. A Mask R-CNN model, pre-trained using natural image data, was trained using two-stage domain adaptation by first fine-tuning using a weakly-labeled dataset (see the workflow for generating weak labels in B and C) followed by fine-tuning with a manual annotated dataset (see the branch resulting pre-train weak). For comparison purposes the model was also fine-tuned using the manual labeled data only (see the branch of pre-train COCO). The trained system was deployed as a plug-in to the CellProfiler platform. The plug-in deployment packages all the dependencies in an executable for easy access and use (refer to "System development" and "Results" for deployment details).

Data. This study was approved by our institutional Health Sciences Research Ethics Board, and all methods
were performed in accordance with the relevant guidelines and regulations. For the human tissues used in this study, informed consent was obtained from all participants and/or their legal guardian(s). For system validation, a total number of 223 ROIs were used (see the detailed breakdown in Table 1). Three different datasets were used: (1) ovarian cancer, (2) breast cancer, and (3) mouse pancreatic tissue samples (a public dataset used for algorithm comparison purposes). Samples were split 50:50 for training and testing such that samples from the same case/mouse were not used for both training and testing (see the training and testing datasets separately listed in Table 1). For generating the weak labels for the first-stage domain adaptation, a separate dataset (i.e. not overlapped with any of the total of 223 ROIs listed in Table 1) comprising a total number of 210 ROIs (referred as weak-label-set) was created by sampling from one ovarian cancer tissue section of a training case. When sampling, we excluded tissue regions that were sampled for creating O-train. The weak-label-set of 210 ROIs was separated into three groups for the weak label generation (see details in weakly labeled data in the following section); each group contains 26 (weak set-1), 16 (weak set-2), and 168 ROIs (weak set-3) respectively.
We performed multiplexing for both the ovarian and breast cancer samples at our laboratory. Ovarian cancer samples were supplied by Dr. Pamela Ohashi's laboratory (University Health Network, Toronto, ON). A breast cancer tissue microarray (TMA) was purchased from Pantomics (CA, USA). The mouse pancreatic data and expert annotations were downloaded from a publicly available dataset: https:// warwi ck. ac. uk/ fac/ sci/ dcs/ resea rch/ tia/ data/ micro net.

Tissue imaging for in-house data
In this study, we used six ovarian cancer tissue sections from 6 patients and 40 breast cancer cores from a TMA which was created with samples from 20 patients. The tissues were imaged using a prototype Immunofluorescence protein multiplexer (GE Research, Niskayuna NY, USA). The antibodies that were used for sequential staining in this work included: CD3, CD4, CD8, Ki67, CD68, PD1, PDL1. Na + K + ATPase (MEM) and Ribosomal S6 were used as membranous and cytoplasmic markers respectively and 4′,6-diamidino-2-phenylindole (DAPI) was used as a nuclear marker. The formalin-fixed paraffin-embedded tissues were imaged on glass slides at 0.293 µm/pixel using MxIF and registered using the software, Layers (GE Research, Niskayuna NY, USA).

Manual annotated data
All ROIs (in Table 1) were manually contoured by human observers. The in-house datasets including ovarian and breast cancer tissue samples were annotated by a researcher (Han) trained by a biologist (Cheung) and a pathologist (Liu). O-test2 (a subset of six ROIs from one section of the testing dataset) was annotated by three observers (i.e. Han, Cheung, and Martel) independently. The annotation was performed to contour the cell boundary, defined by the cell membrane edge seen on the MEM channel (see the white annotation in A-B as an example in Fig. 6). For each ROI, annotation was performed on the Sedeen viewer (www. pathc ore. com/ sedeen). After conversion of the single channel DAPI channel and MEM channel images from 16-bit to 8-bit, the images were stacked in the R-G-B color space in the order DAPI-MEM-DAPI. The annotator could www.nature.com/scientificreports/ switch between the DAPI channel, MEM channel, and the stacked color images (SCI) during the process. A polygon tool was used to contour each individual cell. Cells were excluded if they met any of the following conditions: (1) overlapping nuclei; (2) membrane crossing over the nuclei; (3) out of focus nuclei; (4) faintly stained nuclei. The manual annotation process took approximately 2.5 hours per ROI for each observer.

External mouse pancreatic data
The publicly available dataset of mouse pancreatic tissue sections was acquired from https:// warwi ck. ac. uk/ fac/ sci/ dcs/ resea rch/ tia/ data/ micro net. The images were generated using a different staining (i.e. E-cadherin) for the membrane. Those tissues were manually contoured by expert biologists 19 . The annotations were carried out using different criteria to those we adopted; each cell was required to be separated from its neighbor by a gap and all cells, including those that were faintly stained or out of focus, were included. R-CNN model using domain adaptation with and without pre-training using weakly labeled ovarian data. The corresponding final models are referred to as pre-train weak and pre-train COCO respectively. The dotted and dashed boxes (i.e. B and C) demonstrate the process of generating weak annotations for the cell boundaries: (B) describes using the Na + K + ATPase image to generate weak (i.e. rough estimation) annotations for cell boundaries using seeded watershed by using nuclear labels as seeds. (C) demonstrates the semiautomatic nuclear segmentation using bootstrapped U-net after CellProfiler labeling and manual editing of the nuclear annotations. This figure is created using diagrams.net (https:// www. diagr ams. net/). (1) Method overview Our method for generating weak labels of whole cell boundaries includes (1) nuclear segmentation on DAPI channel images, (2) cell membrane segmentation on the MEM channel images. The workflow can be found in B and C in Fig. 1. In the nuclear segmentation stage (see the process in C in Fig. 1), we created a pipeline using conventional algorithms in CellProfiler to segment nuclei on a small set of samples (i.e. weak set-1). To improve the label quality to train a U-Net, a human observer participated in editing/correcting the segmentation results (e.g. splitting touching/overlapped objects (Fig. 2b)). We then trained a U-Net using the weak set-1. The trained model was used to generate nuclear labels for the weak set-2. A second round of human editing helped in correcting U-Net generated labels ( Fig. 2d-f). Finally, we trained the U-Net using both weak set-1 and weak set-2 and used the trained U-Net to label nuclei for the weak set-3. Given the nuclear labels from the weak-label-set, we then produced cell boundary weak labels using marker-controlled watershed (B in Fig. 1). In such a recursive manner for human-involved editing and U-Net training, on successive iterations the human observer only needs to correct a small number of nuclei that were incorrectly segmented/labeled. The details of the methods are described in the following sections.
(2) Semi-automatic nucleus segmentation For the weak set-1, nuclei were segmented on the DAPI channel images using Otsu's method 29 for thresholding followed by seeded watershed performed in CellProfiler (see result example in (a) in Fig. 2). Segmentation results were then reviewed and manually edited using Sedeen viewer 30 (Fig. 2b). The DAPI channel images of the weak set-1 and the nuclear segmentation results (Fig. 2c) were used to train a U-Net 14 for nuclear segmentation. Segmentation was then performed on the weak set-2 of the DAPI channel images using the trained U-Net. These segmentation results were reviewed and edited manually ( Fig. 2d-f). Both weak set-1 and weak set-2 were used to train another U-Net. Nuclear segmentation was then conducted on weak set-3 using that trained model.
Our U-Net 14 implementation is based on the implementation (https:// github. com/ carpe nterl ab/ unet4 nuclei/) described by Caicedo et al. 31 for nuclear segmentation. We adapted the code to implement it using Pytorch. We used DAPI channel images as input, and three-class label maps (i.e. color images in which red labels the nuclear boundary pixels, blue labels pixels within the nucleus, and the background pixels are shown in green) as ground truth for training (see Fig. 2c,f). (

3) Automatic cell boundary segmentation using marker-controlled watershed
For the weak-label-set, nuclear segmentation was performed as described above. Next, cell segmentation (see results in Fig. 3b) was performed on the MEM channel ( Fig. 3c) images using seeded watershed 9 (see B in Fig. 1). The nuclear segmentation results (see Fig. 3a) include three labels: (1) inner nucleus; (2) nuclear boundary; (3) background. A morphological erosion was performed on the inner nucleus using a 3 × 3 disk to reduce the size of the marker. The resulting regions were used as seeds to perform watershed segmentation of the cells. Some segmented regions were found to correspond to large background regions, therefore, a size filter was used to remove any regions ≥ 1400 pixels (see grey regions in Fig. 3b).

System development.
Our proposed method (pre-train weak) used the following steps (see workflow in Fig. 1). Initially, the Mask R-CNN model was pre-trained for instance segmentation using a large dataset of natural images (i.e. the MS COCO dataset 32 ). A two-stage domain adaptation was then performed: (1) fine-tuning Table 1. Dataset information (the notation of each dataset is in the brackets). www.nature.com/scientificreports/ with weak-label-set; (2) fine-tuning with manual annotated datasets. Weak cell boundary labels were generated using a semi-automatic method as described above (see steps in B and C in Fig. 1). For comparison purposes, experiments were also performed on a model without pre-training using the weakly labeled dataset (pre-train COCO) (see Fig. 1). In the experiment using a public dataset (see details in experiment design below), a model (pre-train breast) was fine-tuned from the final model trained for the breast cancer dataset. Mask R-CNN 21 is a deep learning architecture for instance segmentation (see workflow in Fig. 4). There are two stages: (1) generating the proposals for the regions where the objects might exist using a region proposal network (RPN) described by Lin et al. 33 ; (2) based on the proposals, performing classification (i.e. classifying each proposal as object labels), refining the bounding boxes of the proposal, and segmenting the object. The   Fig. 4), and these were extracted from the cell labels as described in the previous section.
During training, the model was initialized with the pre-trained weights generated using the MSCOCO dataset 32 (https:// github. com/ matte rport/ Mask_ RCNN/ relea ses/ downl oad/ v1.0/ mask_ rcnn_ coco. h5). ResNet-101 was used as the backbone structure. To perform cell segmentation, the network heads were first trained for 20 epochs, and then all layers were trained for 40 epochs. Binary cross entropy was used as the multiloss function, and stochastic gradient descent (SGD) as the optimizer. The training parameters are: batch size = 6, learning rate = 0.0001, weight decay = 0.0001, momentum = 0.9. The gradients were clipped to 5.0 to prevent gradient explosion. The input images were scaled up by a factor of two using nearest neighbor interpolation and normalized across channels. The input images were augmented randomly using the following methods: (1) flipping vertically/horizontally for 50% of all images, (2) blurring images using a Gaussian kernel with sigma of 5.0, (3) multiplying by a random value between 0.8 to 1.5 to darken/brighten input images, (4) randomly rotating the images by an angle of 90°, 180° or 270° using affine transform. The model was fine-tuned using the identical setup as the training stage.
The method was deployed as a CellProfiler plug-in for cell segmentation for MxIF images. As analysis pipelines also require nucleus segmentation, a separate nuclear segmentation model was trained with the 2018 Data Science Bowl nucleus dataset 35 and fine-tuned with the dataset with nuclear segmentation labels that were used for generating weak cell boundary labels. The same setup was used for training and fine-tuning as described above.
Our implementation for seeded watershed used Matlab 2018b (The Mathworks, Natick, MA). The seeded watershed algorithm and manual annotation were performed on a PC with AMD Ryzen 5 2600 CPU, 16 GB memory, and Gigabyte GeForce RTX 2070 8 GB GPU. Our implementation for U-Net, training and fine-tuning of Mask R-CNN were performed on our server using a single Nvidia Titan XP GPU with a setup of dual Intel Xeon E5-2660 CPU and 128 GB memory. Experimental validations were performed using Compute Canada (www. compu tecan ada. ca) with one node using 48 CPU cores.
Our system was deployed using the Flask framework 36 , which included the plug-in and the executable components. The plug-in is a Python script written in CellProfiler plug-in specified format. The executable was compiled using Pyinstaller-4.1 with our source code and run-time environment dependencies. Experimental design. Experiment validating against multi-observer annotations. The system was trained using O-train and validated on the O-test2 against each observer's annotation respectively and the results were averaged. To evaluate the annotation agreement between different observers as a baseline for comparison, we validated the annotations between all pairs of observers and averaged the pairwise results.
In order to evaluate the potential impact from the observer annotation, we also conducted a comparative analysis on O-test2. We compare the system results when validating against the researcher's annotation and the expert's annotation. We also compare those results to the inter-observer agreement between the researcher (Han) and expert (Cheung).   19 used in their study.
Error metrics. Object-Dice (OD) and object-Hausdorff (OH) were computed as metrics to measure the system performance 37 . The Dice coefficient 38 measures how well the segmented region matches the ground truth. Hausdorff distance 39 measures the longest distance between the segmentation boundary and the ground truth boundary, which gives the largest misaligned distance. Computing those metrics at the object-level takes object identification into consideration. For example, touching cells that are well segmented as a whole but not properly separated can have high value of Dice but lower value of Object-Dice weighted by the cell size.

Approval for human experiments. This study was approved by our institutional Health Sciences
Research Ethics Board (Sunnybrook Research Institute Health Sciences Research Ethics Board, Toronto, Ontario, Canada).

Results
Experiment validating against multi-observer annotations. The results are shown in Table 2. Pretrain weak yielded comparable results to the multi-observer agreement in both metrics of OD and OH, while pre-train COCO showed inferior results with slightly lower OD and higher OH.
Comparative results are shown in Table 3. The annotation agreement between the researcher (Han) and expert (Cheung) are close to the multi-observer agreement ( Table 2). For pre-train weak, the results validating against Cheung are close to the multi-observer agreement ( Table 2). When validating against Han's annotation, which were used for second-stage domain adaptation, the results show higher agreement (OD of 0.80 and OH of 7.86). This result is similar to the results on other datasets with single observer annotations (see results reported in Fig. 5). In Table 3, the differences between two pairwise results are within the range of the standard deviations of the two. Based on our observation including two expert observers (Liu and Cheung), the primary disagreements between the human observers and model vs. each observer come from the borderline cases where controversy may arise due to the excluding rules.
Experiments validating on three different datasets using different training sample sizes. The results are shown in Fig. 5. In general, pre-train weak (blue plots) yields higher OD and lower OH than the pre-train COCO (orange plots) in all the experiments, suggesting that it provides superior performance. The performance differences between the two methods are larger when the training sample sizes are smaller for the experiments using ovarian cancer and external datasets. Compared to pre-train weak, for both metrics, pre-train COCO has higher standard deviations and shows larger variation (i.e. changes between the points) when using different training sample sizes.
For all methods, we observed increased OD and decreased OH as training sample size increases, and the twostage domain adaptation methods (i.e. pre-train weak and pre-train breast) are less sensitive to training sample size with a boosted performance than pre-train COCO. The pre-train weak method achieves close to optimal OD and OH when using training sample sizes of more than 600 cells for all experiments. This also applies to the  www.nature.com/scientificreports/ pre-train breast method applied to the public dataset. In contrast, the pre-train COCO method requires a larger training sample size and there are large variations in OD and OH. In the experiments using ovarian and breast cancer datasets, the pre-train weak using a training sample of one ROI achieves higher OD and lower OH than pre-train COCO using all the available training data (A-D in Fig. 5). Similarly, pre-train weak used three ROIs www.nature.com/scientificreports/ for training to achieve better performance than pre-train COCO using all the training samples in the experiment using the public dataset (E and F in Fig. 5). We note that the available training sample size in the public dataset includes 90 ROIs, which contains 8426 annotated cells.
For the experiment using publicly available data, pre-train breast and pre-train weak have similar ODs and OHs for all training sample sizes (E-F in Fig. 5). When using a training sample size of 72 ROIs for direct comparison, both methods have superior performance compared to Micro-Net (Table 4). The performance superiority remains when using a sample size as small as one ROI for training ( Fig. 5 and Table 4). Pre-train COCO also has a slightly higher OD and a much lower OH than Micro-Net using the training sample size of 72. All methods using Mask R-CNN, regardless of training sample sizes, have lower OHs than Micro-Net using 72 ROIs for training ( Fig. 5 and Table 4). In addition, all methods, regardless of training sample sizes, show substantially higher  www.nature.com/scientificreports/ OD and lower OH than U-net as reported by Shan et al. 19 (OD of 0.67; OH of 40.39) using training sample size of 72ROIs ( Fig. 5 and Table 4). Figure 6 shows an example of our experimental results.

Example visual results from the experiments.
We observe that most system outputs are closely aligned to the manual annotation, with pre-train weak showing slightly better alignment than the pre-train COCO (see cyan and white contours in A-B). Each cell is contoured by the system independently without broken contours (see color masks in C-D). Cells that have very tight boundaries (cells in the white dashed box 1) are also contoured by both pre-train weak and pre-train COCO (see cells in the white dashed box 1 in A-B). It was found that pre-train weak yields lower numbers of FNs and FPs than pre-train COCO (see FNs by red contours and FPs by yellow contours in A-B), especially at the region where the cells are tightly packed with weak boundary stains (see cell boundaries indicated by yellow arrows in Region 2 and result maps A-B). Figure 7 shows an example of our experiments on the public dataset. In general, all outputs are closely aligned with manual annotations. Pre-train weak and pre-train breast show similar performance with the closest alignment to the manual annotations. In contrast, Micro-Net has cells that are overly segmented, and those masks show irregular and fragmental shapes (see Region 1 and 2 in the result maps indicated by yellow arrows for comparison). This results in a large value of Hausdorff distance. Pre-train COCO shows a lower degree of alignment to manual annotation than pre-train breast and pre-train weak (see the example indicated by the green arrow in result maps). It also has an overly segmented cell (see Region 3, and cell indicated by the orange arrow in the pre-train COCO result map). Fig. 8. We set up the inputs and outputs in the CellProfiler interface (D). The input images are DAPI (a nuclear stain) and Na + K + ATPase (a membrane stain) channel images for nuclear and cell segmentation. Pan-cytokeratin channel image (C) is the image for cellular profiling as an example. Once the executable is successfully launched (E), the "Analyze image" button is clicked to run the algorithm. The segmentation results are overlaid in the color image (F). These results were applied to the pan-cytokeratin channel image for cellular profiling and image is presented (G). Users can profile images from multiple channels simultaneously by adding more images as input. A detailed demo is available at: https:// youtu. be/ gpLDj PQJF8Q. The setup tutorial is available at: https:// www. youtu be. com/ watch?v= sirRJc-A4tc. The package is publicly available at: https:// drive. google. com/ drive/ folde rs/ 1WBYF H9bf8 9s-xjQNZ HKSGd Fov08 h0iFG? usp= shari ng. The source code is available at: https:// github. com/ Wench aoHan SRI/ DeepC Seg.

Implementation deployment. The workflow is shown in
The source code and testing data for the three experiments is available at: https:// github. com/ Wench aoHan SRI/ Cell-Segme ntati on/ tree/ main/ Mask_ RCNN-master.

Discussion
We have described a two-stage domain adaptation pipeline to train a Mask R-CNN for instance cell segmentation of MxIF images using a weakly labeled dataset for pre-training. The trained model provides end-to-end instance segmentation without the need for any pre-and post-processing steps. The segmentation results were www.nature.com/scientificreports/ validated against three different manually annotated datasets with a performance that matches multi-observer agreement on the ovarian cancer dataset (Table 2), and a better performance than the state-of-the-art on the public mouse pancreatic dataset (E-F in Fig. 5 and Table 4). We deployed our model to a widely used software platform, CellProfiler, as a plug-in for easy access. The plug-in runs using an executable backend without the need for installing any software dependencies. It performs instance nuclear and cell segmentation followed by cellular profiling for the target image channel(s). The plug-in demonstrates the potential for supporting more efficient immunoprofiling using MxIF images (Fig. 8).
Our method using two-stage domain adaptation boosted the model performance, and the advantage is more obvious when the training sample size is small. This is demonstrated in Fig. 5 where the pre-train weak network (blue plots) shows higher OD and lower OH than the pre-train COCO network (orange plots) in all experiments; both models used the same network (i.e. Mask R-CNN) but were trained differently. In the experiment using the public dataset, pre-train COCO shows a similar OD to the method of Shan et al. 19 regardless of the training sample size. In contrast, the two-stage domain adaptation methods showed a better performance than the state-of-the-art even when only a few ROIs were used for fine-tuning (E-F in Fig. 5 and Table 4). In addition, the two-stage domain adaptation approach showed more robust and consistent performance than the single domain adaptation method (i.e. pre-train COCO). The volatility of the pre-train COCO results may be due to the uneven distribution of different training samples, and it is possible that our method makes the model less sensitive to the selection of training samples by using a large number of weakly labeled samples for pre-training.
Our method of generating weak labels is efficient for the first-stage domain adaptation, which creates a large number of weakly labeled data that have sufficient quality for pre-training. First, comparing to the existing methods (including conventional methods and a deep learning-based approach), we used a combination of conventional algorithms and a deep learning model (i.e. U-Net) with a recursive manual label correction process. Our method takes advantage of the low computation cost of conventional methods to create a set of coarse labels for a small number of samples. We also take the advantage of batch processing capacity and accurate segmentation performance using U-Net to generate more labels. The recursive process of both manual correction and U-Net training improved the label quality (see example images from Figs. 2 and 3) and increased training sample size with a minimum labor cost involved. Also, generating these weak labels is much easier than manual annotation. For one ROI that contains approximately 200 cells, it takes about 3 hours to annotate. In contrast, the weak labels take about 3 minutes for manual editing. Importantly, the mouse pancreatic data experiment shows that pre-training using these weak labels performs equally well as manual labels (E and F in Fig. 5 pretrain weak vs. pre-train breast which was pre-trained with 2000 manual annotated cells). We speculate that the first-stage fine-tuning effectively trained the network for identifying cell objects. This can also explain our www.nature.com/scientificreports/ observation that the primary difference in results between pre-train weak and pre-train COCO is the number of FNs. Pre-train COCO was not pre-trained with more samples, therefore, has more FNs (A-B in Fig. 6). Finally, pre-training the network with weakly labelled data from one dataset, allows the model to be fine-tuned using a very small amount of labelled data (i.e. approximately 600 cells) from the target dataset to achieve a performance that is close to optimal (obtained when fine-tuning used more than 8000 cells) as shown in Fig. 5, even when the annotations are slightly different to those in the initial dataset (i.e. difference between mouse annotations that do not touch 19 vs. our in-house dataset annotations that are touching). It is extremely impractical to annotate thousands of cells for each task and our method enables easier adaptation by using a very small sample size of manually annotated data for fine-tuning.
Our method and experimental results should be interpreted with the following limitations. First, although our study used a large sample size from three different datasets, the sample size is still limited and further improvements in performance may be achievable when a larger sample size is available for hyper-parameter tuning. Second, although we used three different datasets, which includes annotations done by four different observers, we cannot rule out annotation bias as each dataset was primarily annotated by a single observer. We expect the model may be biased toward the annotator whose annotation is used for second-stage fine-tuning (Table 3 and Fig. 5). Our comparative analysis indicated that the bias may not substantially impact results even when validating against a different annotation. For example, in comparing the OD and OH between the two observers, and model vs. one of the observers in Table 3, the differences between two pairwise results are seen to be within the range of the standard deviations of the two. In addition, based on the observation on the samples from the comparative analysis, the disagreement primarily arises from the "include/exclude" rules. In application, since the users may need to annotate a few ROIs for the second stage domain adaptation for their specific task, our reported results should still validly reflect the model performance in application. We have deployed our method in an open-source platform and the source code is publicly available. We recommend that this tool be validated by a broader group of users in the near future. Third, to compare our method to the most relevant works 12,19 , it is not immediately clear if Mask R-CNN is superior to Micro-Net for the experimental task because Micro-Net was not pre-trained by the MS COCO dataset nor the weakly labeled data. However, the use of Mask R-CNN is suitable for our need for end-to-end instance-level cell segmentation. Finally, in order to make the model more generic, our results did not include any post-processing steps, which usually requires manually setting hyper-parameters. In practice, post-processing steps may be helpful to further improve the performance for specific tasks.