A generalized deep learning framework for whole-slide image segmentation and analysis

Histopathology tissue analysis is considered the gold standard in cancer diagnosis and prognosis. Whole-slide imaging (WSI), i.e., the scanning and digitization of entire histology slides, are now being adopted across the world in pathology labs. Trained histopathologists can provide an accurate diagnosis of biopsy specimens based on WSI data. Given the dimensionality of WSIs and the increase in the number of potential cancer cases, analyzing these images is a time-consuming process. Automated segmentation of tumorous tissue helps in elevating the precision, speed, and reproducibility of research. In the recent past, deep learning-based techniques have provided state-of-the-art results in a wide variety of image analysis tasks, including the analysis of digitized slides. However, deep learning-based solutions pose many technical challenges, including the large size of WSI data, heterogeneity in images, and complexity of features. In this study, we propose a generalized deep learning-based framework for histopathology tissue analysis to address these challenges. Our framework is, in essence, a sequence of individual techniques in the preprocessing-training-inference pipeline which, in conjunction, improve the efficiency and the generalizability of the analysis. The combination of techniques we have introduced includes an ensemble segmentation model, division of the WSI into smaller overlapping patches while addressing class imbalances, efficient techniques for inference, and an efficient, patch-based uncertainty estimation framework. Our ensemble consists of DenseNet-121, Inception-ResNet-V2, and DeeplabV3Plus, where all the networks were trained end to end for every task. We demonstrate the efficacy and improved generalizability of our framework by evaluating it on a variety of histopathology tasks including breast cancer metastases (CAMELYON), colon cancer (DigestPath), and liver cancer (PAIP). Our proposed framework has state-of-the-art performance across all these tasks and is ranked within the top 5 currently for the challenges based on these datasets. The entire framework along with the trained models and the related documentation are made freely available at GitHub and PyPi. Our framework is expected to aid histopathologists in accurate and efficient initial diagnosis. Moreover, the estimated uncertainty maps will help clinicians to make informed decisions and further treatment planning or analysis.

aided methods for histopathology analysis to aid the pathologists. With the increasing prevalence of whole-slide imaging(WSI) scanners that can scan the entire tissue sample, in-silico methods of conducting pathology analysis can now be explored (Madabhushi and Lee, 2016).
Broadly, histopathology analysis includes two types, namely (1) Classification analysis and (2) Segmentation analysis. Classification analysis helps in classifying WSI into benign or malignant (Nanthagopal and Rajamony, 2013), (Guray and Sahin, 2006). Classification can also be used to study cancer subtypes directly (Malhotra et al., 2010). Segmentation analysis helps in detecting and separating tumor cells from the normal cells (Wählby et al., 2004), (Xu et al., 2016). This analysis can also be used for the aforementioned classification task and other analyses like pN-staging and tumor burden estimation.
Automatic histopatholgy analysis is plagued by a myriad of challenges (Tizhoosh and Pantanowitz, 2018); 1. Insufficient training samples: As the data extraction process for medical images is expensive and less frequent, this results in class imbalance issues, adding an inherent skewness to the dataset. This makes it hard to adapt algorithms which are developed for natural images to medical data. ture extraction, but understanding these extracted features and extracting meaningful information from them is challenging.
The organization of the paper is as follows. Prior work on histopathology tissues using deep learning methods are discussed in the section 1.1, Followed by our contributions in section 1.2. In section 2.2 we introduce the datasets used in this work, all the preprocessing techniques are discussed in section 2.3. In section 2.4, we discuss multiple deep convolutional networks used, followed by training and inference pipelines in section 2.7 and 2.8 respectively. In section 2.9, 2.10, and 2.6 we describe various other concepts like pN-staging, tumor burden, and uncertainty analysis. Experimental analysis and comprehensive results are presented in section 3 and 4. In section 5 we describe our open-source project (DigiPath AI) and in section 6 we discuss our results and conclude with future direction of work.

Deep learning methods for histopathology image analysis
Deep learning methods have shown great success in various tasks involving image, graph, and text data. In the field of medical imaging and diagnosis, deep learning models have achieved human-like results on many specific problems (Kermany et al., 2018), (Bakas et al., 2018), (Weng et al., 2017), (Rajpurkar et al., 2017). One possible medical application where deep learning could have a very high impact if leveraged accurately would be the automation of histopathology analysis. In recent years, grand challenges (Li et al., 2019;CAM, 2017a,b;PAIP, 2019;ICAIR, 2019) are encouraging all the researchers to collectively work on histopathological data using deep learning based solutions, by providing labeled data.
Histopathology slides provide a more comprehensive view of diseases on the tissue and is still considered as a gold standard for cancer diagnosis (Gurcan et al., 2009). In recent past many deep learning-based methods have approached problems of nuclei segmentation (Sirinukunwattana et al., 2016), liver tumor segmentation (Kaluva et al., 2018), epithelial tumor tissue segmentation (Shapcott et al., 2019), Gleason grading (Arvaniti et al., 2018), signet ringcell detection (Li et al., 2019) and many more. Deep learning-based methods are increasingly used in the context of histopathology analysis because of their ability to automatically discover the representations needed for feature detection from raw data.
Deep learning has also helped in merging multi-domain information (Bagari et al., 2018) in diagnosis, since they learn to associate important features from each domain, and has been proven to provide better results.

Lymph node metastases of breast cancer
Breast cancer is one of the most common cancer among women in the world.
The prognosis of breast cancer patients is mainly based on the extent of metastases (Sites, 2014). Metastases refers to the spreading of cancer to different parts of the body from where it originally started. This usually occurs when cancer cells break away from the main tumor and enter the bloodstream or lymphatic system. A formally accepted way to classify the extent of cancer is based on TNM staging criteria (Amin and Edge, 2017;Sobin et al., 2011). The TNM staging system takes into account the size of the tumor (T-stage), cancer spreading to regional lymph nodes (N-stage) and metastasization of tumor to other parts of the body (M-stage). In case of breast cancer, the lymph nodes are usually the first location to get metastasized. With the help of sentinel lymph node procedure (Giuliano et al., 2011(Giuliano et al., , 2017, the most likely metastasized lymph nodes are excised and taken for further histopathological processing and examination by a pathologist. The excised node is preserved by fixing in formalin and embedded in paraffin wax block to enable cutting micrometers thin slices of the tissue. These tissue slices are placed on glass slides and are stained with hematoxylin and eosin (H&E). This staining enables the pathologist to differentiate between nuclear (blue) and cytoplasmic parts (pink) of the cell, thereby providing a general overview of the tissue's structure.
Clinically, metastases is divided into one of the three categories namely:isolated tumor cells (ITC), micro-metastases and macro-metastases. The categorization is based on the diameter of the largest tumor cells cluster. The  (Sobin et al., 2011). A simplified version of the pN-staging scheme used in CAMELYON17 Challenge (Bejnordi et al., 2017) is provided in Table 2. However, this diagnostic procedure in assessing lymph nodes status is challenging for pathologists as large tissue area has to be examined in detail under microscope at several magnification levels for identifying metastases. Also, this process is tedious, prone to missing small metastases and would require extensive time by pathologist for a thorough examination.
The advent of whole-slide imaging scanners has enabled digitization of glassslides at very high resolution. Typical whole-slide images (WSI) are in order of gigapixels and usually stored in multi-resolution pyramidal format. These WSIs are suitable for developing computer aided diagnosis systems for automating the pathologist workflow and also with the availability of large amount of data makes WSIs amenable for analysis with machine learning algorithms. Some of the earliest works (Wolberg et al., 1994;Diamond et al., 2004;Petushi et al., 2006) based on machine learning algorithms for digital pathology and WSIs were cancer classification.
Recently, with advent of convolutional neural networks (CNNs) considerable pN1mi Micro-metastases found, but no macro-metastases found. pN1 Metastases found in 1-3 lymph nodes, of which at least one is a macro-metastasis. pN2 Metastases found in 4-9 lymph nodes, of which at least one is a macro-metastasis.
improvements have been shown in various computer vision tasks like detection, segmentation and classification. CNNs have also been proposed in lymph node metastasis detection in the recent years (Litjens et al., 2016;Paeng et al., 2017;Liu et al., 2017;Lee and Paeng, 2018;Li and Ping, 2018;Wang et al., 2018).

Colorectal carcinoma
Colorectal carcinoma is third most common cancer in the United States (Fleming et al., 2012). Majority of colorectal carcinoma are adenocarcinomas originating from epithelial cells (Hamilton, 2000). Shapcott et al. (2019) (Greff et al., 2016) for estimating the patient risk score using spatial sequential memory.

Liver cancer
Three out of every four cases of liver cancer results in death, making it one of the deadliest forms of cancer (Stewart et al., 2019). Amongst the different forms of liver cancer, hepatocellular carcinoma is the most common form (Chuang et al., 2009). Computer-aided methods for detecting liver cancer from vari-ous modalities such as CT (Li et al. (2015), Yasaka et al. (2017)), ultrasound (Wu et al., 2014) and laparoscopy (Gibson et al., 2017) have been explored.
Methods utilizing multi-omics data such as RNA sequencing have also been successful (Chaudhary et al., 2018). However, for histopathology, lack of annotated datasets comprising of liver tissue images in this modality has rendered this area unexplored. The new grand challenge PAIP (2019) on liver cancer segmentation motivates research in this area.

Contributions
In this paper, we developed an end-to-end framework for histopathology analysis from WSI images that generalizes well with various cancer sites. The framework comprises of segmentation at its core along with novel algorithms that utilize the segmentation to do pathological analyses such as metastasis classification, viable tumor burden estimation, and pN-staging. As discussed in section 1, challenges in WSI analysis are mainly due to their extremely large size, variability in staining, and the limited amount of data. In this work, we propose a few key contributions which helped in addressing a few of the aforementioned problems.
• Extremely large size of WSI: we propose an efficient patch-based approach for training and inference, where sampling is done on a generated tissue mask, thereby reducing computations on unnecessary patches.
• Insufficient data: To address the problems of limited data and class imbalance, we made use of overlapping and oversampling based strategies in patch extraction along with multiple data augmentations.
• Inference Pipeline: To remove the edge artifacts characteristic of patchbased segmentation, we have proposed an overlapping strategy, which proved to reduce them significantly.
• pN Staging: We combined and experimented on various strategies of oversampling and under-sampling techniques to address class imbalance problem in the dataset for metastases classification.
• Whole tumor segmentation: We propose an empirical non-deep learning based approach for estimating the whole tumor region from the viable tumor region that is computationally efficient yet performs on par with the deep learning based methods.
• Uncertainty Estimation: We have proposed an efficient patch-based uncertainty estimation framework, to estimate both data specific and model (parameter) specific uncertainties.
• Transfer Learning: Based on our study, we show that transfer learning using CAMELYON dataset helps in faster convergence for other histopathology datasets.
• Open-source Packaging: Finally, we packaged the framework into an opensource GUI application for the benefit of researchers.

Overview
The Figure 1, illustrates the proposed overall framework. The proposed framework predicts segmentation mask along with uncertainty maps for a given input image (liver or breast or colon tissue). In this framework we also define a pipeline for pN staging, metastases classification, and tumor burden estimation.

DigestPath
DigestPath dataset consists of a total of 750 tissue slices taken from 450 different patients. On average, each tissue image was of size 5000x5000. The dataset also included fine pixel-level annotations of lesion done by experienced pathologists. Additionally, 250 tissue slices taken from 150 patients were provided as a test set. The data was collected from multiple medical centers, especially from several small centers in developing countries. All the tissues were stained by hematoxylin and eosin (H&E) and scanned at 20x resolution.
Out of 750 tissue slices, 725 slices were used in training, and the rest 25 slices we considered as a held-out test set.

PAIP
The PAIP dataset contains a total of 100 whole slide images scanned from liver tissue samples. Each image had an average dimension of 50,000x50,000 pixels. All the images were stained by hematoxylin and eosin, scanned at 20x magnification, and prepared from a single center, Seoul National University Hospital. The dataset included pixel-level annotation of the viable tumor and whole tumor regions. It also provided the viable tumor burden metric for each image. Tumor burden is defined as the ratio of the viable tumor region to the whole tumor region. The viable tumor region is defined as the cancerous region.
The whole tumor area is defined as the outermost boundary enclosing all the dispersed viable tumor cell nests, tumor necrosis, and tumor capsule (Fig 2).
Each tissue sample contains only one whole tumor region. This metric has applications in assessing the response rates of patients to cancer treatment.
Out of the 100 images, 50 images were the publicly available training set, 10 images were reserved for validation set that was made publicly available after the challenge was completed, and the rest 40 images were the test set whose ground truth were not publicly available.

Tissue mask generation
Tissue mask generation was done to prevent unnecessary computations on blank spaces in tissue slice. In this step, the entire tissue region was separated from the background glass region of the slide. Here we make use of Otsu's adaptive thresholding (Otsu, 1979)  to ensure that during patch-based inference of the WSI, the neighboring contexts were appropriately provided in the extracted patches at small tissue regions and tissue borders.

Patch coordinate extraction
The WSIs are gigapixel images, and the typical image sizes are of order 100k × 100k. We proposed to train our neural networks with small patches of fixed size extracted from the WSI. We extracted coordinates as the center of the patches from the low-resolution WSI's tissue mask. We rescaled the extracted coordinates to correspond to level-0 WSI. The randomly sampled coordinates from the WSIs corresponded to regions from the tumor, tumor boundary regions, and non-tumor tissue. From each WSI, a fixed number of patch coordinates corresponding to various regions of WSIs were extracted.
Further, these extracted patch coordinates were split into three cross-validation folds.

Data augmentation, normalization and data loading
To increase the number of data points and generalization of models to various staining and acquisition conditions, we made use of data augmentations.
Augmentations like flipping left, and right, 90-degree rotations, and Gaussian blurring along with color augmentation were performed. Color augmentation included random changes to brightness, contrast, hue, and saturation, the ranges for which were 64.0/255, 0.75, 0.25, 0.04, respectively. Additionally, in order to introduce a diversity of patches extracted from the WSI during every training cycle, we incorporated random coordinate perturbation. The random coordinate perturbation involved offsetting the center of the patch coordinates within a specified radius prior to the extraction from WSI. We fixed our radius to be 128 pixels and extracted patches of size 256x256 from the level-0 image.
Post augmentation, the images were normalized using equation 1 to scale the range of data statistics, basically transforming the data to have zero mean and unit standard deviation. Since the image dimensions were huge, computing the image statistics during training was expensive. This was circumvented by assuming the mean and standard deviation of the image to be 128, which spreads the entire region.
The data-loader prepared batches of data comprising an equal number of tumorous and non-tumorous patches, thereby facilitating a balanced dataset for training the neural network.

Network architecture
For the task of segmentation of tumor regions from patches of WSIs, we propose to utilize encoder-decoder based fully convolutional neural (FCN) networkbased architectures (Long et al., 2015). An encoder comprises of a series of operations (like convolution and pooling) that reduces the input size to a low dimension feature vector. The decoder upsamples the feature vectors to a larger dimension. The entire encoder-decoder is then trained end-to-end. Fully convolutional architecture substitutes fully connected layers with convolutional layers, thereby allowing input of arbitrary size. We propose an ensemble approach comprising of some of the FCN architectures that show the state of the art results on natural images, namely Densenet, InceptionResNetV2, and DeepLabV3Plus.

Figure 3: Densenet Architecture
We used a U-Net (Ronneberger et al., 2015) based encoder-decoder architecture. The main feature of U-Net is the presence of skip connections between layers of the encoder and decoder. These connections allow for the free flow of low-level information directly from the encoder to the decoder, bypassing the bottleneck. The skip-connection was made by concatenating the l th layer with the (n − l) th layer.
For the encoder, we use the DenseNet (Iandola et al., 2014) architecture.
The input for a particular layer in the DenseNet network was concatenated to inputs of all the previous layers. This was done to prevent the loss of information through forward pass in deep networks. The decoder was composed of the bilinear up-sampling module and convolutional layers. In the Inception-ResNet's architecture, the inception blocks have skip con-nections that sum the input of the block to the output.

Ensemble
Our ensemble comprised of 3 FCN models, each of the individual FCN model was trained on one of the 3-fold cross-validation splits. During inference, the predicted posterior probabilities of the segmentation maps were averaged to generate the ensemble model prediction. The models were trained and inferred on the image patch size of 256x256 at the highest resolution of WSI.

Loss function
Tumor regions were represented by a minuscule proportion of pixels in WSIs, thereby leading to class imbalance. This issue was circumvented by training the network to minimize a hybrid loss function. The hybrid cost function comprised of cross-entropy loss and a loss function based on the Dice overlap coefficient.
The dice coefficient is an overlap metric used for assessing the quality of segmentation maps. The dice loss is an differentiable function that approximates to dice-coefficient and is defined using the predicted posterior probability map and ground truth binary image as defined in the Eq. 2. The cross-entropy loss is defined in Eq. 3. In the equations, p i is the predicted posterior probability map, and g i is the ground truth image.
The total loss is defined as a linear combination of the two loss components as defined in Eq. 4. The neural networks are trained by minimizing the total loss. The α, β, γ are empirically assigned to the individual loss components. FG and BG represent the foreground pixels that correspond to the tumor regions and the background pixels that corresponded to non-tumor regions, respectively.

Uncertainty analysis
Uncertainty estimation is essential in assessing unclear diagnostic cases predicted by deep learning models. It helps pathologists/doctors to concentrate more on the uncertain regions for their analysis. The need and challenges that we might face in the context of uncertainty are discussed in (Begoli et al., 2019).
There exist two main sources of uncertainty, namely (1) Aleatoric uncertainty, and (2) Epistemic uncertainty. Aleatoric uncertainty is uncertainty due to the data generation process itself. In contrast, the uncertainty induced due to the model parameters, which is the result of not estimating ideal model architectures or weights to fit the given data, is known as epistemic uncertainty (Kendall and Gal, 2017). Epistemic uncertainty can be approximated by using test time Bayesian dropouts as discussed in (Leibig et al., 2017), which estimates uncertainty by Montecarlo simulations with Bayesian dropout.
In the proposed pipeline, we estimate aleatoric uncertainty for each model using test time augmentations, as introduced in (Gal and Ghahramani, 2016).
For epistemic uncertainty, we make use of the diversity of model architectures and calculate uncertainty as described in equation 5.
and T T A denotes the set of possible test time data augmentations allowed. In our case T T A ∈ {rotation, verticalf lip, horizontalf lip}.  We made use of the transfer learning technique by using ImageNet (Deng et al., 2009) pre-trained weights for encoders based on DenseNet and Inception-ResNetV2. We made use of the models pre-trained on PascalVOC (Everingham et al., 2010) in the case of DeeplabV3Plus. For the first two epochs, the encoder section of the models were frozen, and only the decoder weights were made trainable. For the remaining epochs, both the encoder and decoder parts were trained, and the learning rate was reduced gradually. The learning rate was decayed after every few epochs in a deterministic manner to allow for the model to gradually converge.

Inference pipeline
Figure 7 provides an overview of the inference pipeline to generate tumor probability maps and model uncertainty maps. In a typical WSI, the tissue region occupies a smaller fraction when compared to the background glass slide, hence for faster inference, the pre-processing step involved segmentation of tis- sue region. This tissue segmentation mask was generated, as discussed in section 2.3.1. In order to facilitate extraction of patches from the WSI within the tissue mask region a uniform patch-coordinate sampling grid was generated at lower resolution as shown in Figure 8. Each point in the patch sampling grid was rescaled by a factor to map to the coordinate space corresponding to WSI at its highest resolution. From these scaled coordinate points as the center, fixed-size high-resolution image patches were extracted from the WSI. We define sampling stride as the spacing between consecutive points in the patch sampling grid. The high-resolution patch size and the sampling stride controlled the overlap between consecutive extracted patches from the WSI. The main drawback with patch-based segmentation of WSI was that the smaller patches could not capture a wider context of the neighborhood. Moreover, stitching of the patch segmentation introduced boundary artifacts (blockish appearance) in the tumor probability heat-maps. We observed that the generated heat-maps were smooth when the inference was done on overlapping patches with larger patch-size and averaging the prediction probabilities at overlapping regions. Our experiments suggested that 50% overlap between consecutive neighbouring patches was the optimal choice as it ensured that a particular pixel in the WSI was seen at most 4-times during the heat-map generation. However, this approach increased the  2.9. pN-Staging Pipeline The Figure 9 illustrates the overall pipeline for pN-Staging. The pipeline comprises 4 blocks as described below: • Pre-processing: The tissue regions in the WSIs were detected for patch extraction.
• Heatmap generation: The extracted patches from the WSIs were passed through the inference pipeline to generate the down-scaled version of the tumor probability heatmaps.
• Feature extraction: The heatmaps were binarized by thresholding at 0.5 and 0.9 probabilities, and at each of these thresholds, the connected components were extracted, and region properties were measured using scikitimage(van der Walt et al., 2014) library. We computed 32 geometric and morphological features of the probable metastases regions. Table 4 provides a description of all the features computed.
• Classification: The pN-stage was assigned to the patient based on all the available lymph-node WSIs, taking into account their individual metastases type (Table 1). We employed a simple count-based rule based on pN-staging as defined in Table 2. For predicting the metastases type, we built an ensemble of Random Forest classifiers (Liaw et al., 2002) using the extracted features.

Tumor burden estimation
We obtained the viable tumor region from the segmentation pipeline discussed. For the whole tumor region, initially, we attempted to use the same segmentation pipeline and model it as a binary classification problem. However, the model provided results that had an accuracy lesser than 10%. Therefore, we adopted a heuristic method to calculate the tumor burden. First, the viable tumor region was predicted. Morphological operations were applied to remove some of the false positives and fill the holes. Next, the smallest convex hull that contained the entire viable tumor region was calculated. Simultaneously, a tissue mask was obtained using otsu thresholding. The convex hull was then multiplied with the tissue mask to obtain the whole tumor segmentation. The viable tumor burden was then calculated by taking the ratio of the area of the viable tumor and whole tumor. Figure 10 show an overview of the viable tumor burden estimation. The results are shown in Figure 18.

Experimental analysis
In this section, we experimentally analyze the effectiveness of our proposed methodologies for segmentation and classification models. We implemented our neural networks using TensorFlow (Abadi et al. (2016)) software. We ran our experiments on multiple desktop computers with NVIDIA Titan-V GPU with 12 GB RAM, Intel Core i7-4930K 12-core CPUs @ 3.40GHz, and 48GB RAM.

Lesion detection performance on CAMELYON
In this section, we detail some of the techniques specific to CAMELYON dataset pre-processing and discuss the performance of various FCN architectures and ensemble configurations for lesion detection on the CAMELYON16 test dataset (n=139).

Tissue mask generation
In some of the CM17 cases, the Otsu's thresholding failed because of the black regions in WSI. Hence, prior to the application of image thresholding operation, the pre-processing involved replacing black pixel regions in the WSI back-ground with white pixels and median blurring with a kernel of size 7x7 on the entire image. Median blur aided in the smoothing of the tissue regions and removal of noise at the tissue bordering the glass-slide region' while preserving the edges of the tissue. Figure 11 illustrates the pipeline for tissue mask generation with an example.

Dataset preparation
For training our model for lesion segmentation, we used training sets of both CAMELYON16 and CAMELYON17 dataset, which had pixel-level annotations.
As noted by the challenge organizers, some of the WSIs were not exhaustively annotated in the CAMELYON16 training set; we excluded them in our training  cross-validation split of the training set to maximize the utilization of the limited number of WSIs. The stratification ensured that the ratio of negative to metastases was maintained in all the three folds. From 628 WSIs, we randomly sampled 571029 coordinates whose patches included regions from the tumor and non-tumor tissue regions. A patch extracted from a WSI was labeled as a tumor patch if it had non-zero pixels labeled as tumor pixels in the pathologists manual annotation. Further, these extracted patch coordinates were split into 3-cross validation folds. Table 5 shows the distribution of the split in each of the folds.

Training and inference configuration of ensemble FCN models
We experimented with following two types of Ensemble configuration: • Ensemble-A: It comprised of the three different FCN architectures, as described in section 2.7. The inference pipeline made use of patch size of 256 and extracted non-overlapping patches, as illustrated in Figure 9.
• Ensemble-B: It comprised of three replicated versions of a single FCN architecture. The inference pipeline made use of a patch size of 1024 with a 50% overlap between neighboring patches, as illustrated in Figure 12. In both the ensemble configurations, each model in the ensemble was trained with one of the 3-fold cross-validation splits. All the models made use of pretrained weights with the fine-tuning procedure, as described in section 2.7. The models were trained for ten epochs with a batch size of 32. The reason for this significant boost in the performance of Ensemble-A could be attributed to the effect of averaging the heatmaps from multiple FCN models, thereby lowering the probabilities of uncertain or less confident regions and hence eliminating the False Positives. Figure 13 illustrates the heatmaps generated by Ensemble-A and its constituent FCN models on a CAMELYON16 test case.  We also observed that the performance of individual models in the ensemble were similar (considering FROC score). However, this observation contrasted with Ensemble-A, where the models showed significant difference in their FROC scores. This also helped us to ascertain the fact that the performance difference in individual FCN models of Ensemble-A were not because of data variation seen between individual cross-validation folds. The Figure 14 illustrates the heatmaps generated by Ensemble-B and its constituent FCN models on a CAMELYON16 testcase. The Table 7 provides a summary of FROC scores on CAMELYON16 testset (n=139) for various configurations of patch-size and sampling stride. We observed that running inference on larger patch size enabled generation of smoother heatmaps and patch extraction with overlapping stride gave better context at the patch borders thereby minimizing the blockish artifacts seen in the heatmaps. In the Figure 14 we observed that while running inference on larger patch size, the model slightly struggled at tissue borders which was evident from the confidence probability values. This was primarily because we had trained the models on patch size of 256 and our training samples lacked regions from tissue border and glass-slide regions. Since, there was a marginal difference in FROC scores between various ensemble configurations (Table 7), in the interest of minimizing the computation time we preferred to use Ensemble-A with patch-size and sampling stride set to 256 (non-overlapping patch acquisition) for running inference on CAMELYON17 testing dataset (n=500).

Lymph-node metastases type classification
In this section we detail and discuss the dataset preparation and training methodology of Random Forest classifiers for metastases type classification.

Dataset preparation
The CM17 training dataset (Table 3) Table 9 shows the distribution of WSIs in terms of metastases type between train and validation sets, our split strategy ensured that the distribution of metastases type between the two splits were similar.

Training methodology
We generated tumor probability heatmaps for all the 500 WSIs using Ensemble-A configuration (section 3.1.3) and extracted all the features listed in Table 4. and validated on the held-out validation set (n=285). Our experimentation on various classifiers showed that the optimal performance in-terms of classification accuracy (90.18%) and Cohen's kappa score (0.9164) on held-out validation set was achieved with Random Forest classifier with 100 trees. Figure 15 shows the confusion matrix and the feature importance ranking by the Random Forest classifier. From Table 9, we observed that the data distribution was highly class imbalanced, with negative cases being majority class and ITC cases being minority class. This lead to mis-classifications between ITC and negative cases, as evident in the confusion matrix.
We further experimented by training another Random Forest classifier on the complete training set (n=500) in order to maximize the utilization of training points. The 5-fold cross-validation showed an average accuracy score of 90%, and its performance was similar to the model trained on train set (n=215).
We refer to the above two trained models as RF-PI and RF-CI (Random Forest classifiers trained on the partial and complete training set with imbalanced class data, respectively).

Data balancing
In order to handle the class imbalance problem, one of the techniques proposed in the literature is oversampling by synthetically generating minority class samples using SMOTE algorithm. (Chawla et al., 2002). However, this method can introduce noisy samples when the interpolated new samples lie between marginal outliers and inliers. This problem is usually addressed by removing noisy samples by using under-sampling techniques like Tomek's link (Tomek, 1976) or nearest-neighbors. We proposed to balance our training data using a SMOTETomek (Batista et al., 2004) algorithm, a combination of SMOTE and Tomek's link performed consecutively. We separately balanced our train set (n=215) split and the complete training set (n=500). We independently trained two Random Forest classifiers using these two balanced datasets. We refer to these two trained models as RF-PB and RF-CB (Random Forest classifiers trained on Partial and Complete training set, which are class Balanced data respectively). Table 10   For all the models, 5-fold cross-validation accuracy was estimated on their respective training sets. These values are provided as mean (standard deviation).

Ensembling classifiers
We proposed an ensemble strategy for combining the predictions of all the four trained Random Forest classifiers. The ensembling was based on the majority voting principle, and in case of a tie, the higher metastases category was selected. We refer to this model as RF-Ensemble.

Segmentation performance on DigestPath
In this section we discuss specific details on the training and inference strategies on DigestPath dataset. We split the training data (n=725) into 3-fold cross-validation sets, each set of data was used to train the individual models in the ensemble. We extracted a total of 80,000 patches from the entire training data.
We trained each model with a patch size of 256 and a batch size of 32, with Adam optimizer.
In the case of inference, we made use of patches with dimension 256 x 256, with 50% of overlap and a batch size of 32. We considered 0.5 as a threshold value to generate a segmentation binary mask on a predicted probability map. Figure 16 illustrates an example of the segmentation map generated by our ensemble model. We tested our trained models on a subset of the training set (n=25) reserved for our internal testing set, and their corresponding results are tabulated in Table 11.

Segmentation performance on PAIP
In this section we elaborate on the training and inference details for segmentation on the PAIP dataset. For the tissue mask, we first performed Otsu thresholding on the HSV version of the slide image. We then performed the closing morphological operation with a kernel size of 20, followed by an opening operation with a kernel size of 5 and finally a dilation operation with a kernel size of 20.
We split the training data (n=50) into five cross-validation folds and used three of those folds for training the models of the ensemble. The data was split into five folds as opposed to three folds to ensure that each training set had at least 40 samples. We extracted a total of 200,000 patches from the entire training data with equal contributions from each training sample. We trained with a patch size of 256 and a batch size of 32.
For the inference, we used patches of 1024 and a batch size of 16. For the threshold, we experimented with a range of values and chose 0.5 as it gave optimal segmentation performance on the validation set(n=10). We observed that setting lower threshold(0.5) resulted in false positives and higher threshold(0.9) led to under-segmentation. Figure 17 illustrates an example of the segmentation map generated by our and their corresponding results are tabulated in Table 12. Figure 18 shows the whole tumor region predictions obtained by the methodology described 2.10. The first image shows that this pipeline gives a good approximation to the whole tumor region. Most of the samples fit into this category. Our pipeline failed in two scenarios. One is when the segmentation pipeline predicted viable tumors in discrete dispersed regions. This is illustrated in the second example in Figure 18 where there is a spurious prediction apart from the main tumor prediction. The other failure case is when the actual whole tumor region is larger than the convex hull boundary of the actual viable tumor region. The third sample in figure 18 demonstrates this failure.

Uncertainty analysis
In this section, we demonstrate and interpret the results of our uncertainty analysis on all three digital pathology datasets.
3.6.1. DigestPath uncertainty maps Figure 19 provides an illustration of tumor probability and uncertainty maps for a held-out test case from the DigestPath dataset. Figure 19( 1,2,3,4,5)(b) corresponds to ground truth image overlayed on the tissue image, while Figure   19(1,2,3,4,5)(c) corresponds to tumor probability heatmap overlayed on tissue regions. In Figure 19(4, 5)(c), we observed ensemble prediction reduced the number of false positives, their by increasing overall dice score to 0.86, which is about 0.04-0.06 of dice score improvement when compared to individual model.

PAIP uncertainty maps
In the Figure 20 provides an illustration of tumor probability and uncertainty maps for a held-out test case from PAIP dataset. Our observations were similar to as discussed in section 3.6.1.

CAMELYON uncertainty maps
In Figure 21 provides an illustration of tumor probability and uncertainty maps for a held-out test case from CAMELYON dataset. Our observations were similar to as discussed in section 3.6.1.

Transfer learning between cancer sites
To test transfer learning between cancer sites, we first tried inferencing on the DigestPath dataset with a model trained on the CAMELYON dataset. As it can be observed from Figure 22, the model learned inherent tissue concepts that were common between both sites. The similar staining pattern and nuclei similarity could have contributed to this observation.
Subsequently, we conducted a comparative analysis on the model weight initialisation. We trained two models on the PAIP dataset; One pre-trained on the CAMELYON dataset and one pre-trained on the Imagenet dataset (Deng et al., 2009). We observed that the former model converged at a faster rate. we plan to extend this generalizability study by getting pathologists annotations done and precisely assessing the seg mentation performance.

Hardmining
Our patch extraction scheme from the whole-slide images were based on random uniform sampling. However, this led to some hard examples being excluded from the training set which resulted in the model's poor performance on such regions of WSI. Therefore, we attempted to solve this issue by hardmining the poorly performing regions in WSI and fine-tuning the trained model with this hardmined set.
For the experimental analysis with the PAIP dataset, we first extracted 80,000 random patches as the initial training set and trained a model on it. We empirically defined patches that scored less than 0.4 on the Jaccard index as regions were the model performed poorly. We than inferred on the entire dataset and extracted all patches that meet this criterion. In this manner, we obtained a total of 70,000 patches as the hardmined set. We then continued training the model after substituting the training set with the hardmined set. However, at every epoch, we observed the model's accuracy on validation set reduced. We theorized that by training only on the failure cases, the model forgets the initial  To investigate this occurrence, we examined the patches obtained from hardmining and analysed them. We found groups of patches that were highly similar to each other. This was because the regions where the model failed were larger in size compared to the dimensions with which the patches were extracted. Therefore, the hardmining algorithm extracts several patches from the same region.
As a result, this set was less representative of the data compared to the first set of randomly extracted patches. We observed a similar pattern in the other datasets as well.

Performance on DigestPath Challenge
In the case of the DigestPath challenge, we submitted our ensemble model and obtained a Dice score of 0.78 on test set. The scores of all the top-performing teams are tabulated in the Table. 14.

Performance on PAIP 2019 Challenge
The

Open Source Contribution
We developed an open-source application (Haran Rajkumar, 2019) on top of our segmentation pipeline. The application 25 can load WSI images, perform segmentation, and calculate the uncertainties. It features an API with which researchers can utilize the segmentation pipeline within their applications. Conversely, the application's modular structure allows for researchers to test their segmentation pipeline with the application's GUI as well. The slide viewer was built using OpenSlide (Goode et al., 2013) and OpenSeadragon (Antoine Vandecreme).

Discussion and Conclusion
We developed an automated end-to-end framework for tumor tissue segmentation and whole slide image analysis that showed state-of-the-art results on three publicly available histopathology challenges. We approached the problem of segmentation of gigapixel WSI images using the divide and conquer strategy by dividing the large image into computationally feasible patch sizes and running segmentation algorithms on patches and stitching segmented patches to generate the segmentation of the entire WSI. We approached the problem of patch image segmentation using fully convolutional neural networks (FCN). FCNs are encoder-decoder based architectures employed for generating dense pixel-level classification. We used some of the state-of-the-art CNNs on natural image tasks as encoders for our FCNs, and the decoder was basically a learnable upsampling module to generate dense predictions. Our segmentation framework is an ensemble comprising of multiple FCN architectures, each independently trained on different subsets of the training data. The ensemble generated the tumor probability map by averaging the posterior probability maps of all the FCNs. The ensemble approach showed superior segmentation performance when compared to its individual constituting FCNs. The patch-based segmentation methods for large-sized images suffer from loss of neighboring context information at patch borders. We attempted to address this issue during inference by using a larger patch size and averaging overlapping patch regions posterior probability maps while stitching tumor probability maps for WSI. Depending upon tissue area covered on glass-slide, our inference pipeline takes about approximately 30-75 minutes for generating the entire tumor probability map. In addition to the generation of tumor heat-map, we also incorporated a methodology for generating uncertainty maps based on model and data variability. These uncertainty maps would assist in better interpretation by pathologists and fine-tuning the model with uncertain regions. Our proposed hard-mining approach showed decreased segmentation performance because of the adopted fine-tuning procedure on the trained model. Hence, one of the directions of our future work would involve developing effective methodologies for the online extraction of good representative patches from WSI during the training phase rather than the extraction of a fixed set of random patches (Lee et al., 2019;Pinchaud, 2019). Other areas to explore would be in the design of efficient and multi-resolution FCN architectures for capturing multi-resolution information from WSI images (Graham et al., 2019). Our experimental results on transfer learning showed that pretraining models with different histopathology datasets could act as good starting points for training models were pathology datasets are limited. Post-processing techniques could be one of the directions to improve generated WSI segmentation, techniques such as patch-based conditional random fields (Krähenbühl and Koltun, 2011;Li and Ping, 2018) could be employed to refine the generated segmentation masks rather than employing hardcoded threshold values. The segmentation of WSIs is usually the primary step for many analysis tasks such as metastases classification and estimation of tumor burden. In this regard, we developed an automated pipeline for lymph node metastases classification and pN-staging. We proposed an ensemble of multiple Random Forest classifiers, each trained on different subsets of the training data. The training data was prepared by extracting meaningful features from a pathologist's viewpoint from the tumor probability maps. We also demonstrated the efficacy of our synthetic samples in addressing class imbalance datasets for such classification tasks.
Our method for tumor burden estimation used an empirical method for estimating the whole tumor region. However, it still performed on par with other deep learning approaches that modeled it as a segmentation problem. This shows that the convex hull is a good approximation and also computationally inexpensive. This method could be developed further by incorporating learningbased methods. For example, the convex hull output could be used as an initial point for active contours-based models (Kass et al., 1988). GK and BS edited the manuscript, supervised and funded the study.

Author Contributions
One of the metrics used in CM16 challenge for lesion-based evaluation was free-response receiver operating characteristic (FROC) curve. The FROC curve was defined as the plot of sensitivity versus the average number of false-positives per image. We used CM16 challenge testing dataset for evaluating the performance of our algorithms for lesion detection/localization. The detection/localization performance was summarized using Free Response Operating Characteristic (FROC) curves. This was similar to ROC analysis, except that the false positive rate on the x-axis is replaced by the average number of false positives per WSI.
In the CM16 challenge, a true positive was considered, if the location of the detected region was within the annotated ground truth lesion.
• If there were multiple findings for a single ground truth region, they were counted as a single true positive finding and none of them were counted as a false positive.
• All detections that were not within a specific distance from the ground truth annotations were counted as false positives.
Appendix A.0.4. Cohen's kappa score Cohen's kappa (Fleiss and Cohen, 1973) is a statistic that measures the interrater reliability for categorical variables. In CM17 challenge for evaluating pNstaging of the patients the metric used was Cohen's kappa with five classes and quadratic weights. The kappa metric ranges from -1 to +1, where 1 represented perfect agreement with the raters, and 0 represented the amount of agreement that can be expected by random chance and, a negative value represented lower than chance agreement.