A Cluster-then-label Semi-supervised Learning Approach for Pathology Image Classification

Completely labeled pathology datasets are often challenging and time-consuming to obtain. Semi-supervised learning (SSL) methods are able to learn from fewer labeled data points with the help of a large number of unlabeled data points. In this paper, we investigated the possibility of using clustering analysis to identify the underlying structure of the data space for SSL. A cluster-then-label method was proposed to identify high-density regions in the data space which were then used to help a supervised SVM in finding the decision boundary. We have compared our method with other supervised and semi-supervised state-of-the-art techniques using two different classification tasks applied to breast pathology datasets. We found that compared with other state-of-the-art supervised and semi-supervised methods, our SSL method is able to improve classification performance when a limited number of labeled data instances are made available. We also showed that it is important to examine the underlying distribution of the data space before applying SSL techniques to ensure semi-supervised learning assumptions are not violated by the data.

In this paper, we present a semi-supervised learning method that analyzes groups of labeled and unlabeled points in multidimensional feature space in order to identify areas of high density and then guides the learning method to place decision boundaries through the regions with low density. We apply this technique to the analysis of digital pathology images of breast cancer.

Related Works
Semi-supervised learning methods are not commonly used in the pathology image analysis field although they have previously been employed in some applications of medical image analysis to improve classification performances on partially labeled datasets [2][3][4][5] . In order to make it possible for semi-supervised learning methods to make the most of the labeled and unlabeled data, some assumptions are made for the underlying structure of data space 1 . Among the assumptions, smoothness and cluster assumption are the basis for most of the state-of-the-art techniques 6 . In the smoothness assumption, it is assumed that points that are located close to each other in data space are more likely to share the same label, and in the cluster assumption, it is assumed that the data points that belong to one class are more likely to form and share a group/cluster of points. Therefore, the core objective of these two assumptions is to ensure that the found decision boundary lies in low density rather than high density regions within data space.
The most basic and easiest SSL method to apply is self-training 7-10 , which involves repeatedly training and retraining a statistical model. First, labeled data is used to train an initial model and then this model is applied to the unlabeled data. The unlabeled points for which the model is most confident in assigning labels to, are then added to the pool of labeled points and a new model is trained. This process is repeated until some convergence criterion is met. Another family of methods is based on generative models [11][12][13] , in which some assumptions are made about the underlying probability distribution of data in feature space. The parameters defining the assumed generative model are then found by fitting the model to the data. Graph-based SSL techniques [14][15][16][17] , attempt to generate an undirected graph on the training data in which every point on the graph is connected by a weighted edge. The weights are assigned to the edges in such a way that closer data points tend to have larger weights and hence they likely share same labels. Labels are assigned to the unlabeled points by propagating labels of labeled points to unlabeled ones through the edges of the graph with the amount dependent on the edge weights. This way unlabeled points can all be labeled even if they are not directly connected to the labeled points.
The support vector machine (SVM) classifier is an efficient and reliable learning method and to date is one of the best classifiers in terms of performance 18 over a wide range of tasks. Semi-supervised SVM techniques expand the idea of traditional SVM to incorporate the ability to use partially labeled datasets to learn reliable models while maintaining accuracy. The idea is to minimize an objective function by examining all possible label combinations of the unlabeled data iteratively in order to find low-density regions in the data space to place the decision boundary through [19][20][21][22] . Many implementations of the objective functions have been reported in the literature however these are often time inefficient. The reader is referred to Chapelle et al. 's work 23 for a review comparing different methods. Kernel tricks which implement the cluster assumption in SSL have also been proposed 24,25 .
Recently, there have been some attempts to replace the lengthy objective function optimization process of semi-supervised SVMs by cluster analysis 6,26,27 . The concept behind these cluster-then-label techniques for semi-supervised learning 28 is to first find point clusters of high density regions in data space and then assign labels to the identified clusters. A supervised learner is then used to find the separating decision boundary that passes through low density regions in data space (i.e. between the clusters). In this study, we propose a novel cluster-then-label semi-supervised learning method and compare its performance with other state-of-the-art techniques for two digital pathology tasks; triaging clinically relevant regions of breast whole mount images 29 and the classification of nuclei figures into lymphocyte, normal epithelial and malignant epithelial objects.  30 , we demonstrated that a semi-supervised cluster-then-label method was able to produce a reliable classification model from small amounts of labeled data. In this study, we propose an improvement of the method proposed in our earlier study 30 and we carry out an extensive experimental comparison with other state-of-the-art semi-supervised techniques on two different pathology image classification tasks.
Clustering Analysis for Semi-supervised learning. Inspired by the work published by Ankerst et al. 31 , we propose a cluster-then-label based SSL method that works by finding the underlying structure of points (clusters of points forming high density regions) in the data space. A standard supervised SVM is then employed to find the decision boundary using knowledge about the underlying structure of the data space provided by the clustering analysis. In the Ankerst et al. 's 31 study, an ordering of points in the data space was found based on how points are spatially located around each other. Therefore, spatially closest points become neighbors in the ordering set. The clustering approach presented by Ankerst et al. 31 is unsupervised and finding the clusters from the ordering set is a challenge.
In this paper, our approach in finding spatially closest points in the data space is somewhat similar to the one proposed by Ankerst et al. 31 , in the sense that points are grouped in such a way that they form clusters of densely populated points separated by regions with sparsely located points (low density). We consider a semi-supervised seeded approach to finding spatially closest points and checking how inclined unlabeled points are toward each of their surrounding labeled points. The algorithm starts by calculating the core radii of the labeled points with respect to all points in the data space. A labeled/unlabeled point q is located at a core radius to a labeled point p if it is within a circle/sphere with p as its center and ε as its radius. The value of ε for every point p is defined as the distance from p to the kth closest point located within the neighborhood of p. Parameter k is the minimum number of points located at the neighboring of p that could form a cluster and is specified by the user. Thus, the core radius is low in high density regions and high in low density regions. For this study, the value of k was set to one tenth of the number of points in the data space. This value of k showed a more consistent performance in our preliminary experiments 30 .
The core radii of all labeled points with respect to the whole data space are calculated. We define a distance matrix D of the size l × u where l is the number of labeled points and u is the number of unlabeled points. The Euclidean distance between the labeled point p i and the unlabeled point q j is compared to the core distance ε i , and d ij is defined as the maximum of these two values. Therefore, this could be written as: where ε i is the core radius of the labeled point p i , l is the number of labeled points, and u is the number of unlabeled points.
Once the matrix D has been populated using expression (1), it is used to find the closest labeled point for each unlabeled point. Figure 2 shows an example of how core radii are useful in assigning unlabeled points to different clusters and highlights differences between conventional clustering methods. In Fig. 2(a) the unlabeled point q is located within the core radii of both the labeled points p 1 and p 2 . Since ε 2 < ε 1 , q is assigned to p 2 despite the fact that, according to the Euclidean distance, it is actually closer to p 1 . In Fig. 2(b) the unlabeled point q is within the core radius of p 1 and lies outside of the core radius of p 2 however, since ε 1 > ||p 2 − q|| 2 , q is again assigned to p 2 . In cases where an unlabeled point q is equidistant between two points with different labels, q is assigned a negative label. Since ε 2 < ε 1 , q is assigned to p 2 despite the fact that, according to the Euclidean distance, it is actually closer to p 1 . (b) An unlabeled point q is within the core radius of the labeled point p 1 but not p 2 . Since ε 1 > ||p 2 − q|| 2 , q is again assigned to p 2 . ε 1 , and ε 2 are the core radii of the labeled points p 1 and p 2 respectively. Please note that the ε 1 and ε 2 are different based on the density of points surrounding them. In this representation, k is set to be 7. This, however, rarely happens in practice as distances are represented as floating point values. Hence, this method finds and groups the points that match both the smoothness and cluster assumptions in SSL and is referred to as Semi-supervised Seeded Density-based (S 3 DB) clustering hereafter.
After the groups of points that form clusters are identified and the underlying structure of the data space is learned (using S 3 DB), this knowledge is given to an SVM classifier with radial basis function (RBF) as kernel to find the maximum margin boundary that passes through the sparse regions.
Datasets. Pathology Triaging Image Dataset. Recently 29 , we addressed the problem of triaging digital pathology images and employed a supervised learning method to distinguish between different relevant or irrelevant breast tissue regions. Here, we consider a more extensive dataset and focus on the statistical learning aspect of the problem. The goal is to achieve a high sensitivity of at least 95% in detecting relevant regions while maintaining the highest possible specificity.
Data Collection: To generate a ground-truth dataset, we have used whole-mount images (WMIs) 32 of 30 breast lumpectomy specimens stained with hematoxylin and eosin (H&E) (n = 150 WMIs). The slides corresponding to 28 of the patients were scanned at 5X magnification (135 WMIs, 2 μm/pixel) and 2 of them were scanned at 10X (15 WMIs, 1 μm/pixel). Patches of 512 × 512 pixels (1 mm 2 for 5X and 0.25 mm 2 for 10X images) were cropped from each WMI at the highest magnification by overlaying a grid of uniformly spaced squares on the previously preprocessed (adaptive thresholding and morphological operations) tissue regions (Fig. 3). The collaborating pathologist then labeled patches from the 2 patients scanned at 10X magnification (15 WMIs, 2849 patches) and 8 patients scanned at 5X magnification(115 WMIs, 2302 patches labeled, 2100 patches unlabeled). For each patch, the pathologist evaluated the presence of diagnostically relevant information corresponding to each tissue type. According to the pathologist's annotations, diagnostically relevant features include cancers, atypias, microcalcifications and lymphovascular invasion, and irrelevant features include fat, stroma, normal ducts and lobules. To assess inter-observer variability when labeling the triaging ground-truth set, a random subset of 1500 patches were evaluated by a second pathologist. The Kappa agreement coefficient between the two pathologists was κ = 0.77. Figure 4 shows a subset of this ground-truth set. We have also added 1500 unlabeled image patches from the remaining 20 patients scanned at 5X (20 WMIs). This set of unlabeled patches was used to improve the generalization performance of the learning models as mentioned in section 3.3.1.
Texture Feature Extraction from Patches: To retrieve texture features from image patches, they were converted from RGB to Lab colorspace and the normalized luminance channel was divided into smaller non-overlapping tiles of size 32 × 32 pixels. Root filter set (RFS) 33 texture filters were used to highlight different textures from image tiles. First order statistical measures (mean, mode, standard deviation, skewness and kurtosis) were calculated from the maximal filter responses along all filter orientations of each scales to combine the texture information. To regroup all extracted information from individual tiles and form one numerical representative per image patch, the bag of words (BoW) 34 technique was used with a previously found optimum dictionary size of 100 29 . The calculated 100-dimensional histograms of words per individual image patch were used to train and evaluate the statistical learning techniques presented in this paper. We used the RBF kernel of the SVM classifier implemented in libsvm library 35 to find the best separating hyperplane between the two classes. Data Collection: H&E stained sections from 92 post-NAT lumpectomy specimens were scanned at 20X magnification (0.5 μm/pixel). The whole slide images (WSIs) were annotated by an expert pathologist using the Sedeen Viewer 37 (Pathcore, Toronto, Canada). A total of n = 166 rectangular regions of interest (ROIs) were defined on the 92 WSIs and, within these ROIs, the centers of nuclei figures were labeled as either lymphocyte, benign epithelial, or malignant epithelial. Nuclei that were out of focus, out of plane, or could not be categorized, were not marked by the pathologist. More than 30,000 nuclei figures (n = 3,868 lymphocyte, n = 10,407 benign epithelial, and n = 16,419 malignant epithelial figures) were marked from all 116 ROI patches.
Nuclei Feature Extraction from Patches: In order to train the proposed SSL method, the nuclei have to be segmented first. We have developed a segmentation method 38 that works by manipulating the original RGB colorspace of the image patches to better identify foreground nuclei figures. Multilevel thresholding and marker controlled watershed algorithms were then used to extract nuclei regions and divide overlapping nuclei figures. The nuclei segmentation method achieved an F1-score of 0.9 when tested against a publicly available dataset of 7931 nuclei from 36 images 39 . The effect of color variation from the image patches was reduced by standardizing their color to a reference image as explained in our recent study 36 . The segmentation method was able to segment more than 72% of the ground-truth nuclei figures (n = 21,779) from the 166 ROI patches. The segmented figures were used to extract 125-dimensional feature vectors from individual nuclei figures based on intensity, morphological, textural, and spatial properties describing their differences among the three classes 36 . Table 1 summarizes the datasets used to validate and compare performances of the supervised and semi-supervised learning methods described in section 3.1.

Experimental Setup
Comparison with State-of-the-art Methods. We compare our proposed SSL method (S 3 DB + SVM) with a range of successful supervised and semi-supervised methods in the literature.
Method for supervised learning. The standard supervised SVM technique implemented in the libsvm 35 library was considered to find the separating decision hyperplane. Here we used the RBF kernel and a similar parameter optimization approach to other methods described in this paper was followed, as explained in the subsequent section. Let X = x 1 , x 2 , …, x l be the set of d-dimensional labeled points with Y = y 1 , y 2 , …, y l be their labels. The SVM technique works by minimizing the optimization function presented in equation (2) to find the maximum margin hyperplane parameters dividing the two classes.    Methods for semi-supervised learning. a) semi-supervised Fuzzy c-mean (ssFCM) clustering + SVM: this method has been previously employed for semi-supervised learning 2,27,40,41 . The idea is to first apply the semi-supervised clustering to both labeled and unlabeled data to find the underlying structure of the space (hard labeling) and then a supervised classifier is trained on the labeled data. The semi-supervised version of the original unsupervised FCM in particular is useful to provide a prior knowledge on the structure of the space in the form of labels 27 . Therefore, in the following optimization problem, the first term is to discover the data space structure of the labeled data and the second term takes care of the unlabeled data. Let X = x 1 , x 2 , …, x l be the set of d-dimensional labeled points with Y = y 1 , y 2 , …, y l be their labels and ⁎ be the set of unlabeled points, then the ssFCM objective function can be written as: where c is the number of classes (c = 2 for binary classification), m is the degree of fuzziness (we set m = 2), V represents the set of prototypes corresponding with each class, U and U * are matrices that define the fuzzy membership values for the labeled and unlabeled data points respectively and u ij is the probability that the jth labeled data point belongs to class i. The maximum number of iterations for the experiments using this method was set to 1000 rounds. We have used the RBF-SVM classifier in conjunction with the semi-supervised FCM method similar to the one presented by Gan et al. 27 . The parameters of the SVM classifier was optimized using a similar strategy to other methods described in this paper as explained in section 3.3.
b) TSVM 19,42 : this SSL method is one of the most successful implementations of the semi-supervised SVM technique in terms of performance 23 . The algorithm starts by learning a partially complete model using the labeled data only and then applies this to the unlabeled data. The method then improves the initial solution by switching the labels assigned to the unlabeled data to decrease the objective function after each iteration. The label switching mechanism is important to ensure the balancing constraints between the two classes are maintained. Therefore, the main objective function to be minimized in this method is as follow: Visualization and Cluster Separability. To visualize the underlying distributions of data spaces used in this study in lower dimensions, t-distributed Stochastic Neighbor Embedding (t-SNE) was used 43 . It is an iterative method which maps data points into lower dimensional space in such a way that the distances between points correspond to their similarity. Also, we have used Fisher Discriminant Ratio (FDR) [44][45][46] , as a measure of cluster separability. FDR measures the cluster separability by calculating the square of the difference between means of points in each cluster divided by the sum of square of their standard deviations: Experimental Design. In order to evaluate the performance of the proposed SSL technique and compare with state-of-the-art methods, the following validation steps were taken.
Triaging Image Dataset. Here, the dataset described in section 2.2.1 was divided into a training set containing the patches scanned at 5X magnification, and a testing set containing the 2 patient datasets scanned at 10X (n = 2849 labeled image patches). The training data was further subdivided into two components; one part contained the labeled and unlabeled patches from the 8 patients reviewed by the pathologist (n = 4402 patches, 307 relevant, 1995 irrelevant, and 2100 unlabeled image patches with the mean and standard deviation of 283 ± 90 labeled patches and 267 ± 243 unlabeled patches per patient) and the second part consisted of the of 1500 unlabeled image patches taken from the remaining patients' WMIs (section 2.2.1). An 8-fold patient-wise cross-validation scheme was used to train and validate the performance of learning methods. The optimum SVM-RBF parameters were chosen by examining a range of possible SVM trade-off parameter (C) and the kernel width (γ) values on the training set. The additional set of 1500 unlabeled image patches (section 2.2.1) was included in all folds of the cross-validation scheme.
Validation step for semi-supervised methods: For every fold in semi-supervised learning methods, one or more of the patient datasets were randomly selected to be the labeled set (unlabeled images of chosen patients were kept unlabeled) and labels of the rest of patients were kept hidden (unlabeled set).
Validation step for supervised method: Similarly, for the supervised learning method, in every fold one or more patient datasets were randomly selected to be the labeled set (unlabeled images of chosen patients were discarded) and rest of the patients data were also discarded.
The randomly selected patients, dictionary of words, and histograms of words were kept the same to form paired labeled sets in every fold of each experiment. To do a fair comparison between different methods, we defined the optimum SVM-RBF parameter set by first identifying all sets that produced a sensitivity of 95%; from which the set with maximum specificity was chosen.
Validation using an unseen set: To compare the generalization performance of the methods, the median of the optimized parameters found in all 8-folds of the cross-validation was considered to train an overall model using all training images. For semi-supervised methods, one or more of patients were randomly chosen to form the labeled set and the labels of the rest of patients were kept hidden. For the supervised method also, one or more of patients data were chosen to form the labeled set and the rest of patients data were discarded. The overall performance of the trained model was assessed using the the two unseen patient cases in the test set. To match our previously trained models on 5X magnified images, the test image patches, which were scanned at 10X magnification, were down-sampled.
Nuclei Figure Classification Dataset. The aim of this experiment was to see whether adding many unlabeled instances to an already large set of labeled instances improved the classification performance when comparing an SSL technique with a supervised learning method. A cascaded learning approach was used to first train a classifier to distinguish between lymphocyte versus epithelial figures (L vs. BM) and then to distinguish between benign versus malignant classes (B vs. M).
The supervised SVM was trained using n = 13,821 labeled nuclei figures (n = 2,260 Lymphocytes, n = 3,157 Benign epithelial, and n = 8404 Malignant epithelial figures). Both the labeled figures and an additional n = 49,000 unlabeled figures were used by the semi-supervised methods.
For both supervised and semi-supervised training, a 5-fold cross-validation was performed to assess the performance of the learning methods. In this experiment, the best parameters were chosen in such a way as to maximize the accuracy.
Once the best parameters had been selected a final model was trained on the whole training dataset using the median of the best parameters in all 5 folds and this was applied to an unseen test set of n = 7,958 nuclei figures to evaluate the generalizability of the trained models.

Results
Comparing Classification Performances. Mean accuracy of the subject-wise cross-validated experiments are shown in Fig. 5 for different number of patients chosen to be the labeled set from the pathology triaging image dataset. As can be seen from Fig. 5, our clustering-based SSL technique (S 3 DB + SVM) achieved a superior performance compared to the other state-of-the-art supervised and semi-supervised methods. Table 2 summarizes the cross-validated performance of different methods on triaging image dataset at an operating point of 95% sensitivity. A pairwise Wilcoxon Signed-Rank test using 8 cross-validated accuracy values was used to compare each method at a given number of labeled patients to that of our proposed SSL technique. For each method tested, we had 6 comparisons, therefore for a two-tailed test with a 5% type I error the Bonferroni adjusted α-value = 0.004. Although none of the comparisons achieved statistical significance against this adjusted threshold, there is an increasing trend in the classification performance of our proposed method compared to other techniques. It is clear by looking at the specificity column that in most cases our method maintained higher specificity values at 95% sensitivity compared to the other methods at all individual experiments (except when number of labeled patients = 1). It is also interesting to note that the average train time of our method is significantly lower than the TSVM technique, which requires a heavy optimization on a 64-bit Intel(R) Xeon(R) CPU (at 3.50 GHz) machine. The supervised SVM method however, took the least amount of time to train a model using the labeled data made available to it. Table 3 summarizes the performance of the four methods using an overall model trained in the first validation phase using number of labeled patients as 3 on a totally unseen test set. The test set consisted of two patient cases scanned at 10X magnification. It is clear that our method consistently performs better in classifying image patches compared to the other supervised and semi-supervised techniques in a totally unseen test set. Table 4 summarizes the mean performance of the 5 fold cross-validation on n = 13,821 labeled nuclei figures combined with n = 49,000 unlabeled objects using our proposed SSL technique compared with supervised SVM method trained on the labeled portion only. No statistically significant difference in performance was observed between the accuracy pairs of the supervised SVM and our proposed SSL method in Table 4 using a pairwise Wilcoxon Signed-Rank test.    Table 5 summarizes the performance of applying the models generated from both supervised and semi-supervised methods on the training part on an independent testing set of n = 7,958 nuclei figures. From Tables 4 and 5 it is clear that the proposed SSL method did not fit well for this dataset and the performance is poor compared to the supervised SVM method.
Comparing Cluster Separability Measures. In order to have a sense of how separable the clusters of each class are with respect to each other, FDR measures are summarized in Table 6. From Table 6 we can see that as the FDR measure increases (relevant vs. Irrelevant and L vs. BM) the classes in each dataset tend to form separable clusters while in case of B vs. M where separable clusters are not formed the FDR measure is low. Furthermore, in order to visually compare the distribution of different class labels in feature spaces of both datasets, their dimensions were reduced using the t-SNE method 43 . Figure 6 shows the dimensionality-reduced data space of the triaging image dataset with every point representing an image patch. Similarly, Fig. 7 shows the data space of the nuclei figure classification dataset with every point representing a nuclei figure. Considering Fig. 6, it can be seen that the relevant and irrelevant classes form separable clusters in the feature space while considering Fig. 7, it can be observed that lymphocyte class is better separated compared to the other two classes. Comparing benign versus malignant epithelial classes in the same figure we see that they do not tend to form separable clusters of points thus violating the cluster assumption of SSL.
There is a slight imbalance between the classes in both triaging image dataset and nuclei figure dataset. To determine whether this affected performance, we repeated the training and testing to check the consistency of the classification performance for both datasets, the weight values of the SVM models for minority class were   Table 4. Mean performance of 5-fold cross-validation on nuclei figure classification dataset using S 3 DB + SVM semi-supervised method (n = 13821 labeled nuclei objects and n = 49000 unlabeled ones) and supervised SVM method (n = 13821 labeled nuclei objects).  Table 5. Performance of applying the generated model from dataset used in Table 4 on n = 7958 nuclei objects from an independent testing set using S 3 DB + SVM and supervised SVM methods.  set to the ratio of the number of data samples in the majority class to that of the minority class. The classification performance was found to be consistent for both the datasets suggesting that the class imbalance did not affect the classification performance.

Discussion and Conclusion
In this study, we proposed a cluster-then-label based semi-supervised technique to find the underlying structure of the data space and provide this knowledge to train a reliable model. We have compared and validated this technique with other state-of-the-art supervised and semi-supervised methods for triaging breast digital pathology image patches and classifying nuclei figures. We found that when the method is used for the appropriate dataset the classification performance is superior and training time is much lower compared to the other semi-supervised methods.  Our proposed method did not perform as well as TSVM when only 1 patient data was made available as the labeled set for triaging image dataset ( Table 2). This is most likely due to the failure in the clustering method because an insufficient number of labeled points were made available to it. The method improved as the number of labeled points increased. Surprisingly, for the triaging image dataset, using the ssFCM method did not add any improvements to the classification performance compared to the supervised SVM method. This may be because of the fact that ssFCM assumes the underlying shape of clusters come from a Gaussian distribution; this leads to incorrect label assignment (in the cases where cluster shapes do not come from a Gaussian distribution) which in turn produces an incorrect decision boundary. Furthermore, one reason for a better performance of our method compared to the other cluster-then-label based techniques is that no assumptions are made about the underlying probability distribution of the clusters and so it can cope with clusters of any shape and form.
Although the improvements in accuracy of our proposed method compared with other techniques in Table 2 were not statistically significant after applying a Bonferonni correction, the effect size was large, with improvements of 20.4% and 25.3% in accuracy over the supervised SVM and the ssFCM methods respectively. The improvement in accuracy compared with the TSVM method was about 4% but the TSVM was very computationally expensive with each model taking more than 4 days to train.
Looking at Tables 4 and 5 for nuclei classification task, our method has a poor performance compared to the supervised SVM. The reason for this poor performance could be because the underlying structure of the data points in these datasets does not form proper clusters. This is supported by the FDR measures reported in Table 6 and t-SNE plot in Fig. 7. As shown in Table 6, the FDR measures for the relevant vs. irrelevant data has a higher value compared to L vs. BM and B vs. M data. Furthermore, from Figs 6 and 7 we can observe that relevant vs. irrelevant and L vs. BM tend to form clusters of points in the dimensionality-reduced t-SNE plots while B vs. M data does not form detectable clusters of points thus violating the cluster assumption of SSL. It is also important to note that semi-supervised learning methods are traditionally suitable for applications where only limited labeled data are available. This means that SSL methods may not work as well as supervised methods when large amounts of labeled data are present 1 .
In our preliminary experiments 30 , we systematically examined the effect of k, which controls the number of points that lie within the neighborhood of a labeled point, on a subset of our dataset. We found that the performance of our method was stable when a sufficiently large value was chosen for k. The best performance was achieved for k = one tenth of the number of points in the dataset.
Although TSVM is one of the top performing implementations of semi-supervised SVM, its performance was not found to be better on small-sized synthetic datasets when compared to the Branch and Bound (BB) technique 21,23 . The BB method seems to find the globally optimal solution for semi-supervised learning since it efficiently looks through all label combinations in the data space. However, due to its growing search tree basis for finding the solution, its train time is reported to be even slower than TSVM making it infeasible to apply on datasets with more than 200 data points 21 .
Our proposed semi-supervised cluster-then-label method showed improved performance over other methods for the triaging task, however, it did not perform well in the nuclei classification task. This suggests that although semi-supervised approaches may be useful in digital pathology where generating sufficiently large labeled datasets is a challenge, additional work is needed to identify whether the clustering assumptions are valid for specific tasks.