Abstract
Completely labeled pathology datasets are often challenging and timeconsuming to obtain. Semisupervised learning (SSL) methods are able to learn from fewer labeled data points with the help of a large number of unlabeled data points. In this paper, we investigated the possibility of using clustering analysis to identify the underlying structure of the data space for SSL. A clusterthenlabel method was proposed to identify highdensity regions in the data space which were then used to help a supervised SVM in finding the decision boundary. We have compared our method with other supervised and semisupervised stateoftheart techniques using two different classification tasks applied to breast pathology datasets. We found that compared with other stateoftheart supervised and semisupervised methods, our SSL method is able to improve classification performance when a limited number of labeled data instances are made available. We also showed that it is important to examine the underlying distribution of the data space before applying SSL techniques to ensure semisupervised learning assumptions are not violated by the data.
Introduction
Traditionally, there have been two fundamentally different tasks in the spectrum of pattern recognition and machine learning methods. On one side is supervised learning in which the goal is to learn a model from labeled data points. The learned model is then applied to an unseen test set and the method is validated based on how successful it was in assigning test data to different classes. The disadvantage of supervised learning techniques is that they are limited to learning from labeled datasets which are often expensive, time consuming, or difficult to generate. If the available labeled dataset is too small and does not represent the true variance of the data space then generalization performance may be poor. This issue is even more decisive in the medical image analysis domain since generating high quality datasets requires the effort of experienced and trained human observers. On the other side of the spectrum are the unsupervised learning methods in which unlabeled data points are grouped into clusters that share similar properties. Unlabeled datasets are often easier to acquire and require less human effort to create, however, since the information provided to these techniques is unlabeled, there is no clear way to validate the quality of this approach. In contrast to supervised learning, which only considers labeled data, and unsupervised learning which works only on unlabeled data, semisupervised learning (SSL) methods work with both labeled and unlabeled data points. Therefore, by using SSL, it is possible to combine the advantages of working with a small labeled dataset to guide the learning process and a larger unlabeled dataset to increase the generalizability of the found solution as shown in Fig. 1^{1}.
Pathology images are an important source of diagnostic and prognostic information. With the advent of whole slide scanner technologies, pathology slides are being digitized at microscopic resolution making it possible to store and analyze digital pathology images using computer systems. This has led to a rapidly growing field of research into machine learning techniques that can be used to classify images and provide quantitative information. A major difficulty facing researchers is the availability of labeled training data. Whole slide pathology images are orders of magnitude larger than other medical images and they are more complex. Pathologists use a combination of color, texture and morphological information that varies across multiple scales to interpret images and spend many years learning how to cope with enormous variability in the appearance of specific tissue and disease types. This means that it requires an expert to provide groundtruth labels and it also means that for every new application, additional training and validation data is needed; this makes the use of semisupervised learning particularly relevant for digital pathology.
In this paper, we present a semisupervised learning method that analyzes groups of labeled and unlabeled points in multidimensional feature space in order to identify areas of high density and then guides the learning method to place decision boundaries through the regions with low density. We apply this technique to the analysis of digital pathology images of breast cancer.
Related Works
Semisupervised learning methods are not commonly used in the pathology image analysis field although they have previously been employed in some applications of medical image analysis to improve classification performances on partially labeled datasets^{2,3,4,5}. In order to make it possible for semisupervised learning methods to make the most of the labeled and unlabeled data, some assumptions are made for the underlying structure of data space^{1}. Among the assumptions, smoothness and cluster assumption are the basis for most of the stateoftheart techniques^{6}. In the smoothness assumption, it is assumed that points that are located close to each other in data space are more likely to share the same label, and in the cluster assumption, it is assumed that the data points that belong to one class are more likely to form and share a group/cluster of points. Therefore, the core objective of these two assumptions is to ensure that the found decision boundary lies in low density rather than high density regions within data space.
The most basic and easiest SSL method to apply is selftraining^{7,8,9,10}, which involves repeatedly training and retraining a statistical model. First, labeled data is used to train an initial model and then this model is applied to the unlabeled data. The unlabeled points for which the model is most confident in assigning labels to, are then added to the pool of labeled points and a new model is trained. This process is repeated until some convergence criterion is met. Another family of methods is based on generative models^{11,12,13}, in which some assumptions are made about the underlying probability distribution of data in feature space. The parameters defining the assumed generative model are then found by fitting the model to the data. Graphbased SSL techniques^{14,15,16,17}, attempt to generate an undirected graph on the training data in which every point on the graph is connected by a weighted edge. The weights are assigned to the edges in such a way that closer data points tend to have larger weights and hence they likely share same labels. Labels are assigned to the unlabeled points by propagating labels of labeled points to unlabeled ones through the edges of the graph with the amount dependent on the edge weights. This way unlabeled points can all be labeled even if they are not directly connected to the labeled points.
The support vector machine (SVM) classifier is an efficient and reliable learning method and to date is one of the best classifiers in terms of performance^{18} over a wide range of tasks. Semisupervised SVM techniques expand the idea of traditional SVM to incorporate the ability to use partially labeled datasets to learn reliable models while maintaining accuracy. The idea is to minimize an objective function by examining all possible label combinations of the unlabeled data iteratively in order to find lowdensity regions in the data space to place the decision boundary through^{19,20,21,22}. Many implementations of the objective functions have been reported in the literature however these are often time inefficient. The reader is referred to Chapelle et al.’s work^{23} for a review comparing different methods. Kernel tricks which implement the cluster assumption in SSL have also been proposed^{24,25}.
Recently, there have been some attempts to replace the lengthy objective function optimization process of semisupervised SVMs by cluster analysis^{6,26,27}. The concept behind these clusterthenlabel techniques for semisupervised learning^{28} is to first find point clusters of high density regions in data space and then assign labels to the identified clusters. A supervised learner is then used to find the separating decision boundary that passes through low density regions in data space (i.e. between the clusters). In this study, we propose a novel clusterthenlabel semisupervised learning method and compare its performance with other stateoftheart techniques for two digital pathology tasks; triaging clinically relevant regions of breast whole mount images^{29} and the classification of nuclei figures into lymphocyte, normal epithelial and malignant epithelial objects.
Methodology
Proposed Method
In an earlier work^{30}, we demonstrated that a semisupervised clusterthenlabel method was able to produce a reliable classification model from small amounts of labeled data. In this study, we propose an improvement of the method proposed in our earlier study^{30} and we carry out an extensive experimental comparison with other stateoftheart semisupervised techniques on two different pathology image classification tasks.
Clustering Analysis for Semisupervised learning
Inspired by the work published by Ankerst et al.^{31}, we propose a clusterthenlabel based SSL method that works by finding the underlying structure of points (clusters of points forming high density regions) in the data space. A standard supervised SVM is then employed to find the decision boundary using knowledge about the underlying structure of the data space provided by the clustering analysis. In the Ankerst et al.’s^{31} study, an ordering of points in the data space was found based on how points are spatially located around each other. Therefore, spatially closest points become neighbors in the ordering set. The clustering approach presented by Ankerst et al.^{31} is unsupervised and finding the clusters from the ordering set is a challenge.
In this paper, our approach in finding spatially closest points in the data space is somewhat similar to the one proposed by Ankerst et al.^{31}, in the sense that points are grouped in such a way that they form clusters of densely populated points separated by regions with sparsely located points (low density). We consider a semisupervised seeded approach to finding spatially closest points and checking how inclined unlabeled points are toward each of their surrounding labeled points. The algorithm starts by calculating the core radii of the labeled points with respect to all points in the data space. A labeled/unlabeled point q is located at a core radius to a labeled point p if it is within a circle/sphere with p as its center and ε as its radius. The value of ε for every point p is defined as the distance from p to the kth closest point located within the neighborhood of p. Parameter k is the minimum number of points located at the neighboring of p that could form a cluster and is specified by the user. Thus, the core radius is low in high density regions and high in low density regions. For this study, the value of k was set to one tenth of the number of points in the data space. This value of k showed a more consistent performance in our preliminary experiments^{30}.
The core radii of all labeled points with respect to the whole data space are calculated. We define a distance matrix D of the size l × u where l is the number of labeled points and u is the number of unlabeled points. The Euclidean distance between the labeled point p_{ i } and the unlabeled point q_{ j } is compared to the core distance ε_{ i }, and d_{ ij } is defined as the maximum of these two values. Therefore, this could be written as:
where ε_{ i } is the core radius of the labeled point p_{ i }, l is the number of labeled points, and u is the number of unlabeled points.
Once the matrix D has been populated using expression (1), it is used to find the closest labeled point for each unlabeled point. Figure 2 shows an example of how core radii are useful in assigning unlabeled points to different clusters and highlights differences between conventional clustering methods. In Fig. 2(a) the unlabeled point q is located within the core radii of both the labeled points p_{1} and p_{2}. Since ε_{2} < ε_{1}, q is assigned to p_{2} despite the fact that, according to the Euclidean distance, it is actually closer to p_{1}. In Fig. 2(b) the unlabeled point q is within the core radius of p_{1} and lies outside of the core radius of p_{2} however, since ε_{1} > p_{2} − q_{2}, q is again assigned to p_{2}. In cases where an unlabeled point q is equidistant between two points with different labels, q is assigned a negative label. This, however, rarely happens in practice as distances are represented as floating point values. Hence, this method finds and groups the points that match both the smoothness and cluster assumptions in SSL and is referred to as Semisupervised Seeded Densitybased (S^{3}DB) clustering hereafter.
After the groups of points that form clusters are identified and the underlying structure of the data space is learned (using S^{3}DB), this knowledge is given to an SVM classifier with radial basis function (RBF) as kernel to find the maximum margin boundary that passes through the sparse regions.
Datasets
Pathology Triaging Image Dataset
Recently^{29}, we addressed the problem of triaging digital pathology images and employed a supervised learning method to distinguish between different relevant or irrelevant breast tissue regions. Here, we consider a more extensive dataset and focus on the statistical learning aspect of the problem. The goal is to achieve a high sensitivity of at least 95% in detecting relevant regions while maintaining the highest possible specificity.
Data Collection: To generate a groundtruth dataset, we have used wholemount images (WMIs)^{32} of 30 breast lumpectomy specimens stained with hematoxylin and eosin (H&E) (n = 150 WMIs). The slides corresponding to 28 of the patients were scanned at 5X magnification (135 WMIs, 2 μm/pixel) and 2 of them were scanned at 10X (15 WMIs, 1 μm/pixel). Patches of 512 × 512 pixels (1 mm^{2} for 5X and 0.25 mm^{2} for 10X images) were cropped from each WMI at the highest magnification by overlaying a grid of uniformly spaced squares on the previously preprocessed (adaptive thresholding and morphological operations) tissue regions (Fig. 3). The collaborating pathologist then labeled patches from the 2 patients scanned at 10X magnification (15 WMIs, 2849 patches) and 8 patients scanned at 5X magnification(115 WMIs, 2302 patches labeled, 2100 patches unlabeled). For each patch, the pathologist evaluated the presence of diagnostically relevant information corresponding to each tissue type. According to the pathologist’s annotations, diagnostically relevant features include cancers, atypias, microcalcifications and lymphovascular invasion, and irrelevant features include fat, stroma, normal ducts and lobules. To assess interobserver variability when labeling the triaging groundtruth set, a random subset of 1500 patches were evaluated by a second pathologist. The Kappa agreement coefficient between the two pathologists was κ = 0.77.
Figure 4 shows a subset of this groundtruth set. We have also added 1500 unlabeled image patches from the remaining 20 patients scanned at 5X (20 WMIs). This set of unlabeled patches was used to improve the generalization performance of the learning models as mentioned in section 3.3.1.
Texture Feature Extraction from Patches: To retrieve texture features from image patches, they were converted from RGB to Lab colorspace and the normalized luminance channel was divided into smaller nonoverlapping tiles of size 32 × 32 pixels. Root filter set (RFS)^{33} texture filters were used to highlight different textures from image tiles. First order statistical measures (mean, mode, standard deviation, skewness and kurtosis) were calculated from the maximal filter responses along all filter orientations of each scales to combine the texture information. To regroup all extracted information from individual tiles and form one numerical representative per image patch, the bag of words (BoW)^{34} technique was used with a previously found optimum dictionary size of 100^{29}. The calculated 100dimensional histograms of words per individual image patch were used to train and evaluate the statistical learning techniques presented in this paper. We used the RBF kernel of the SVM classifier implemented in libsvm library^{35} to find the best separating hyperplane between the two classes.
Nuclei Figure Classification Dataset
Recently^{36}, we developed an automated method to assess cancer cellularity in breast tissue removed after neoadjuvant chemotherapy (NAT). As a part of the pipeline, we developed a method to classify nuclei figures into three classes of lymphocyte (L), benign epithelial (B) and malignant epithelial (M) figures from a dataset of image patches annotated by a pathologist. Here, we used the same dataset to validate the proposed SSL technique.
Data Collection: H&E stained sections from 92 postNAT lumpectomy specimens were scanned at 20X magnification (0.5 μm/pixel). The whole slide images (WSIs) were annotated by an expert pathologist using the Sedeen Viewer^{37} (Pathcore, Toronto, Canada). A total of n = 166 rectangular regions of interest (ROIs) were defined on the 92 WSIs and, within these ROIs, the centers of nuclei figures were labeled as either lymphocyte, benign epithelial, or malignant epithelial. Nuclei that were out of focus, out of plane, or could not be categorized, were not marked by the pathologist. More than 30,000 nuclei figures (n = 3,868 lymphocyte, n = 10,407 benign epithelial, and n = 16,419 malignant epithelial figures) were marked from all 116 ROI patches.
Nuclei Feature Extraction from Patches: In order to train the proposed SSL method, the nuclei have to be segmented first. We have developed a segmentation method^{38} that works by manipulating the original RGB colorspace of the image patches to better identify foreground nuclei figures. Multilevel thresholding and marker controlled watershed algorithms were then used to extract nuclei regions and divide overlapping nuclei figures. The nuclei segmentation method achieved an F1score of 0.9 when tested against a publicly available dataset of 7931 nuclei from 36 images^{39}. The effect of color variation from the image patches was reduced by standardizing their color to a reference image as explained in our recent study^{36}. The segmentation method was able to segment more than 72% of the groundtruth nuclei figures (n = 21,779) from the 166 ROI patches. The segmented figures were used to extract 125dimensional feature vectors from individual nuclei figures based on intensity, morphological, textural, and spatial properties describing their differences among the three classes^{36}. Table 1 summarizes the datasets used to validate and compare performances of the supervised and semisupervised learning methods described in section 3.1.
Experimental Setup
Comparison with Stateoftheart Methods
We compare our proposed SSL method (S^{3}DB + SVM) with a range of successful supervised and semisupervised methods in the literature.
Method for supervised learning
The standard supervised SVM technique implemented in the libsvm^{35} library was considered to find the separating decision hyperplane. Here we used the RBF kernel and a similar parameter optimization approach to other methods described in this paper was followed, as explained in the subsequent section. Let X = x_{1}, x_{2}, …, x_{ l } be the set of ddimensional labeled points with Y = y_{1}, y_{2}, …, y_{ l } be their labels. The SVM technique works by minimizing the optimization function presented in equation (2) to find the maximum margin hyperplane parameters dividing the two classes.
subject to:
where \(\overrightarrow{w}\) and b are the parameters defining the maximum margin hyperplane, Φ (.) is the kernel function, C is the parameter defining the tradeoff between the margin size and misclassified examples, and ξ is the slack variable.
Methods for semisupervised learning
a) semisupervised Fuzzy cmean (ssFCM) clustering + SVM: this method has been previously employed for semisupervised learning^{2,27,40,41}. The idea is to first apply the semisupervised clustering to both labeled and unlabeled data to find the underlying structure of the space (hard labeling) and then a supervised classifier is trained on the labeled data. The semisupervised version of the original unsupervised FCM in particular is useful to provide a prior knowledge on the structure of the space in the form of labels^{27}. Therefore, in the following optimization problem, the first term is to discover the data space structure of the labeled data and the second term takes care of the unlabeled data. Let X = x_{1}, x_{2}, …, x_{ l } be the set of ddimensional labeled points with Y = y_{1}, y_{2}, …, y_{ l } be their labels and \({X}^{\ast }={x}_{1}^{\ast },{x}_{2}^{\ast },\ldots ,{x}_{u}^{\ast }\) be the set of unlabeled points, then the ssFCM objective function can be written as:
where c is the number of classes (c = 2 for binary classification), m is the degree of fuzziness (we set m = 2), V represents the set of prototypes corresponding with each class, U and U^{*} are matrices that define the fuzzy membership values for the labeled and unlabeled data points respectively and u_{ ij } is the probability that the jth labeled data point belongs to class i. The maximum number of iterations for the experiments using this method was set to 1000 rounds.
We have used the RBFSVM classifier in conjunction with the semisupervised FCM method similar to the one presented by Gan et al.^{27}. The parameters of the SVM classifier was optimized using a similar strategy to other methods described in this paper as explained in section 3.3.
b) TSVM^{19,42}: this SSL method is one of the most successful implementations of the semisupervised SVM technique in terms of performance^{23}. The algorithm starts by learning a partially complete model using the labeled data only and then applies this to the unlabeled data. The method then improves the initial solution by switching the labels assigned to the unlabeled data to decrease the objective function after each iteration. The label switching mechanism is important to ensure the balancing constraints between the two classes are maintained. Therefore, the main objective function to be minimized in this method is as follow:
subject to:
Visualization and Cluster Separability
To visualize the underlying distributions of data spaces used in this study in lower dimensions, tdistributed Stochastic Neighbor Embedding (tSNE) was used^{43}. It is an iterative method which maps data points into lower dimensional space in such a way that the distances between points correspond to their similarity. Also, we have used Fisher Discriminant Ratio (FDR)^{44,45,46}, as a measure of cluster separability. FDR measures the cluster separability by calculating the square of the difference between means of points in each cluster divided by the sum of square of their standard deviations:
where \({\overline{m}}_{1}\) and \({\overline{m}}_{2}\) are the mean of points; and \({\overline{s}}_{1}\) and \({\overline{s}}_{2}\) are the standard deviations of points in clusters 1 and 2 respectively.
Experimental Design
In order to evaluate the performance of the proposed SSL technique and compare with stateoftheart methods, the following validation steps were taken.
Triaging Image Dataset
Here, the dataset described in section 2.2.1 was divided into a training set containing the patches scanned at 5X magnification, and a testing set containing the 2 patient datasets scanned at 10X (n = 2849 labeled image patches). The training data was further subdivided into two components; one part contained the labeled and unlabeled patches from the 8 patients reviewed by the pathologist (n = 4402 patches, 307 relevant, 1995 irrelevant, and 2100 unlabeled image patches with the mean and standard deviation of 283 ± 90 labeled patches and 267 ± 243 unlabeled patches per patient) and the second part consisted of the of 1500 unlabeled image patches taken from the remaining patients’ WMIs (section 2.2.1).
An 8fold patientwise crossvalidation scheme was used to train and validate the performance of learning methods. The optimum SVMRBF parameters were chosen by examining a range of possible SVM tradeoff parameter (C) and the kernel width (γ) values on the training set. The additional set of 1500 unlabeled image patches (section 2.2.1) was included in all folds of the crossvalidation scheme.
Validation step for semisupervised methods: For every fold in semisupervised learning methods, one or more of the patient datasets were randomly selected to be the labeled set (unlabeled images of chosen patients were kept unlabeled) and labels of the rest of patients were kept hidden (unlabeled set).
Validation step for supervised method: Similarly, for the supervised learning method, in every fold one or more patient datasets were randomly selected to be the labeled set (unlabeled images of chosen patients were discarded) and rest of the patients data were also discarded.
The randomly selected patients, dictionary of words, and histograms of words were kept the same to form paired labeled sets in every fold of each experiment. To do a fair comparison between different methods, we defined the optimum SVMRBF parameter set by first identifying all sets that produced a sensitivity of 95%; from which the set with maximum specificity was chosen.
Validation using an unseen set: To compare the generalization performance of the methods, the median of the optimized parameters found in all 8folds of the crossvalidation was considered to train an overall model using all training images. For semisupervised methods, one or more of patients were randomly chosen to form the labeled set and the labels of the rest of patients were kept hidden. For the supervised method also, one or more of patients data were chosen to form the labeled set and the rest of patients data were discarded. The overall performance of the trained model was assessed using the the two unseen patient cases in the test set. To match our previously trained models on 5X magnified images, the test image patches, which were scanned at 10X magnification, were downsampled.
Nuclei Figure Classification Dataset
The aim of this experiment was to see whether adding many unlabeled instances to an already large set of labeled instances improved the classification performance when comparing an SSL technique with a supervised learning method. A cascaded learning approach was used to first train a classifier to distinguish between lymphocyte versus epithelial figures (L vs. BM) and then to distinguish between benign versus malignant classes (B vs. M).
The supervised SVM was trained using n = 13,821 labeled nuclei figures (n = 2,260 Lymphocytes, n = 3,157 Benign epithelial, and n = 8404 Malignant epithelial figures). Both the labeled figures and an additional n = 49,000 unlabeled figures were used by the semisupervised methods.
For both supervised and semisupervised training, a 5fold crossvalidation was performed to assess the performance of the learning methods. In this experiment, the best parameters were chosen in such a way as to maximize the accuracy.
Once the best parameters had been selected a final model was trained on the whole training dataset using the median of the best parameters in all 5 folds and this was applied to an unseen test set of n = 7,958 nuclei figures to evaluate the generalizability of the trained models.
Results
Comparing Classification Performances
Mean accuracy of the subjectwise crossvalidated experiments are shown in Fig. 5 for different number of patients chosen to be the labeled set from the pathology triaging image dataset. As can be seen from Fig. 5, our clusteringbased SSL technique (S^{3}DB + SVM) achieved a superior performance compared to the other stateoftheart supervised and semisupervised methods.
Table 2 summarizes the crossvalidated performance of different methods on triaging image dataset at an operating point of 95% sensitivity. A pairwise Wilcoxon SignedRank test using 8 crossvalidated accuracy values was used to compare each method at a given number of labeled patients to that of our proposed SSL technique. For each method tested, we had 6 comparisons, therefore for a twotailed test with a 5% type I error the Bonferroni adjusted αvalue = 0.004. Although none of the comparisons achieved statistical significance against this adjusted threshold, there is an increasing trend in the classification performance of our proposed method compared to other techniques. It is clear by looking at the specificity column that in most cases our method maintained higher specificity values at 95% sensitivity compared to the other methods at all individual experiments (except when number of labeled patients = 1). It is also interesting to note that the average train time of our method is significantly lower than the TSVM technique, which requires a heavy optimization on a 64bit Intel(R) Xeon(R) CPU (at 3.50 GHz) machine. The supervised SVM method however, took the least amount of time to train a model using the labeled data made available to it. Table 3 summarizes the performance of the four methods using an overall model trained in the first validation phase using number of labeled patients as 3 on a totally unseen test set. The test set consisted of two patient cases scanned at 10X magnification. It is clear that our method consistently performs better in classifying image patches compared to the other supervised and semisupervised techniques in a totally unseen test set.
Table 4 summarizes the mean performance of the 5 fold crossvalidation on n = 13,821 labeled nuclei figures combined with n = 49,000 unlabeled objects using our proposed SSL technique compared with supervised SVM method trained on the labeled portion only. No statistically significant difference in performance was observed between the accuracy pairs of the supervised SVM and our proposed SSL method in Table 4 using a pairwise Wilcoxon SignedRank test.
Table 5 summarizes the performance of applying the models generated from both supervised and semisupervised methods on the training part on an independent testing set of n = 7,958 nuclei figures. From Tables 4 and 5 it is clear that the proposed SSL method did not fit well for this dataset and the performance is poor compared to the supervised SVM method.
Comparing Cluster Separability Measures
In order to have a sense of how separable the clusters of each class are with respect to each other, FDR measures are summarized in Table 6. From Table 6 we can see that as the FDR measure increases (relevant vs. Irrelevant and L vs. BM) the classes in each dataset tend to form separable clusters while in case of B vs. M where separable clusters are not formed the FDR measure is low. Furthermore, in order to visually compare the distribution of different class labels in feature spaces of both datasets, their dimensions were reduced using the tSNE method^{43}. Figure 6 shows the dimensionalityreduced data space of the triaging image dataset with every point representing an image patch. Similarly, Fig. 7 shows the data space of the nuclei figure classification dataset with every point representing a nuclei figure. Considering Fig. 6, it can be seen that the relevant and irrelevant classes form separable clusters in the feature space while considering Fig. 7, it can be observed that lymphocyte class is better separated compared to the other two classes. Comparing benign versus malignant epithelial classes in the same figure we see that they do not tend to form separable clusters of points thus violating the cluster assumption of SSL.
There is a slight imbalance between the classes in both triaging image dataset and nuclei figure dataset. To determine whether this affected performance, we repeated the training and testing to check the consistency of the classification performance for both datasets, the weight values of the SVM models for minority class were set to the ratio of the number of data samples in the majority class to that of the minority class. The classification performance was found to be consistent for both the datasets suggesting that the class imbalance did not affect the classification performance.
Discussion and Conclusion
In this study, we proposed a clusterthenlabel based semisupervised technique to find the underlying structure of the data space and provide this knowledge to train a reliable model. We have compared and validated this technique with other stateoftheart supervised and semisupervised methods for triaging breast digital pathology image patches and classifying nuclei figures. We found that when the method is used for the appropriate dataset the classification performance is superior and training time is much lower compared to the other semisupervised methods.
Our proposed method did not perform as well as TSVM when only 1 patient data was made available as the labeled set for triaging image dataset (Table 2). This is most likely due to the failure in the clustering method because an insufficient number of labeled points were made available to it. The method improved as the number of labeled points increased. Surprisingly, for the triaging image dataset, using the ssFCM method did not add any improvements to the classification performance compared to the supervised SVM method. This may be because of the fact that ssFCM assumes the underlying shape of clusters come from a Gaussian distribution; this leads to incorrect label assignment (in the cases where cluster shapes do not come from a Gaussian distribution) which in turn produces an incorrect decision boundary. Furthermore, one reason for a better performance of our method compared to the other clusterthenlabel based techniques is that no assumptions are made about the underlying probability distribution of the clusters and so it can cope with clusters of any shape and form.
Although the improvements in accuracy of our proposed method compared with other techniques in Table 2 were not statistically significant after applying a Bonferonni correction, the effect size was large, with improvements of 20.4% and 25.3% in accuracy over the supervised SVM and the ssFCM methods respectively. The improvement in accuracy compared with the TSVM method was about 4% but the TSVM was very computationally expensive with each model taking more than 4 days to train.
Looking at Tables 4 and 5 for nuclei classification task, our method has a poor performance compared to the supervised SVM. The reason for this poor performance could be because the underlying structure of the data points in these datasets does not form proper clusters. This is supported by the FDR measures reported in Table 6 and tSNE plot in Fig. 7. As shown in Table 6, the FDR measures for the relevant vs. irrelevant data has a higher value compared to L vs. BM and B vs. M data. Furthermore, from Figs 6 and 7 we can observe that relevant vs. irrelevant and L vs. BM tend to form clusters of points in the dimensionalityreduced tSNE plots while B vs. M data does not form detectable clusters of points thus violating the cluster assumption of SSL. It is also important to note that semisupervised learning methods are traditionally suitable for applications where only limited labeled data are available. This means that SSL methods may not work as well as supervised methods when large amounts of labeled data are present^{1}.
In our preliminary experiments^{30}, we systematically examined the effect of k, which controls the number of points that lie within the neighborhood of a labeled point, on a subset of our dataset. We found that the performance of our method was stable when a sufficiently large value was chosen for k. The best performance was achieved for k = one tenth of the number of points in the dataset.
Although TSVM is one of the top performing implementations of semisupervised SVM, its performance was not found to be better on smallsized synthetic datasets when compared to the Branch and Bound (BB) technique^{21,23}. The BB method seems to find the globally optimal solution for semisupervised learning since it efficiently looks through all label combinations in the data space. However, due to its growing search tree basis for finding the solution, its train time is reported to be even slower than TSVM making it infeasible to apply on datasets with more than 200 data points^{21}.
Our proposed semisupervised clusterthenlabel method showed improved performance over other methods for the triaging task, however, it did not perform well in the nuclei classification task. This suggests that although semisupervised approaches may be useful in digital pathology where generating sufficiently large labeled datasets is a challenge, additional work is needed to identify whether the clustering assumptions are valid for specific tasks.
References
Chapelle, O. & Schölkopf, B. SemiSupervised Learning (The MIT Press, 2006).
Helmi, H., Teck, D., Lai, C. & Garibaldi, J. M. SemiSupervised Techniques in Breast Cancer Classification. In 12th Annual Workshop on Computational Intelligence (UKCI) (2012).
Shi, M. & Zhang, B. Semisupervised learning improves gene expressionbased prediction of cancer recurrence. Bioinforma. (Oxford, England) 27, 3017–23, https://doi.org/10.1093/bioinformatics/btr502 (2011).
Batmanghelich, K., Ye, D.H.; Pohl, K. & Taskar, B. Disease Classification and Prediction via Semisupervised Dimensionality Reduction. In International Symposium on Biomedical Imaging: From Nano to Macro, 1086–1090 (2011).
Moradi, E., Gaser, C., Huttunen, H. & Tohka, J. MRI based dementia classification using semisupervised learning and domain adaptation. In MICCAI 2014 Workshop Proceedings, Challange on ComputerAided Diagnosis of Dementia, based on Structural MRI Data (2014).
Chapelle, O. & Zien, A. SemiSupervised Classification by Low Density Separation. In Tenth International Workshop on Artificial Intelligence and Statistics (AISTAT 2005) (2005).
Yarowsky, D. Unsupervised word sense disambiguation rivaling supervised methods. In ACL ’95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics, 189–196 (1995).
Rosenberg, C., Hebert, M. & Schneiderman, H. SemiSupervised SelfTraining of Object Detection Models. In Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION’05), vol. 1, 29–36 (IEEE, 2005).
McClosky, D., Charnia, E. &Johnson, M. Effective selftraining for parsing. In HLTNAACL ’06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, 152–159 (2006).
Tanha, J., van Someren, M. & Afsarmanesh, H. Semisupervised selftraining for decision tree classifiers. Int. J. Mach. Learn. Cybern., https://doi.org/10.1007/s1304201503287 (2015).
Callisonburch, C., Talbot, D. & Osborne, M. Statistical machine translation with word and sentencealigned parallel corpora. In Proceedings of the ACL, 175–182 (2004).
Fujino, A., Ueda, N. & Saito, K. Semisupervised Learning for a Hybrid Generative/Discriminative Classifier based on the Maximum Entropy Principle. IEEE Transactions on Pattern Analysis and Mach. Intell. 30, 424–437 (2008).
Nigam, K. & Ghani, R. Analyzing the effectiveness and applicability of cotraining. In Proceedings of the ninth international conference on Information and knowledge management, 86–93 (2000).
He, J., Carbonell, J. & Liu, Y. GraphBased SemiSupervised Learning as a Generative Model. In IJCAI’07 Proceedings of the 20th international joint conference on Artifical intelligence, 2492–2497 (2007).
Talukdar, P. P. & Pereira, F. Experiments in Graphbased SemiSupervised Learning Methods for ClassInstance Acquisition. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, July, 1473–1481 (2010).
Liu, B. W., Wang, J. & Chang, S.f Robust and Scalable Semisupervised Learning. Proc. IEEE 100, 2624–2638 (2012).
Chang, K. C.C. & Lauw, H. W. Graphbased Semisupervised Learning: Realizing Pointwise Smoothness Probabilistically. In Proceedings of the 31st International Conference on Machine Learning, vol. 32 (2014).
Fern, M. & Cernadas, E. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? J. Mach. Learn. Res. 15, 3133–3181 (2014).
Joachims, T. Transductive Inference for Text Classification using Support Vector Machines. In Proceedings of the International Conference on Machine Learning (ICML) (1999).
Yuille, A. L. & Rangarajan, A. The ConcaveConvex Procedure (CCCP). Neural Comput. 15, 915–936 (2003).
Chapelle, O., Sindhwani, V. & Keerthi, S. Branch and Bound for SemiSupervised Support Vector Machines. In Advances in neural information processing systems (NIPS) (2006).
Chapelle, O. & Zien, A. A Continuation Method for SemiSupervised SVMs. In International Conference on Machine Learning (2006).
Chapelle, O., Sindhwani, V. & Keerthi, S. Optimization Techniques for SemiSupervised Support Vector Machines. J. Mach. Learn. Res. 9, 203–233 (2008).
Chapelle, O., Weston, J. & Scholkopf, B. Cluster Kernels for SemiSupervised Learning. In Advances in Neural Information Processing Systems 15, 601–608 (2003).
Weston, J. et al. Semisupervised protein classification using cluster kernels. Bioinform. (Oxford, England) 21, 3241–7, https://doi.org/10.1093/bioinformatics/bti497 (2005).
Dara, R., Kremer, S. & Stacey, D. Clustering unlabeled data with SOMs improves classification of labeled realworld data. In International Joint Conference on Neural Networks, 2002. IJCNN ’02. Proceedings of the 2002, 2237–2242 (2002).
Gan, H., Sang, N., Huang, R., Tong, X. & Dan, Z. Using clustering analysis to improve semisupervised classification. Neurocomputing 101, 290–298, https://doi.org/10.1016/j.neucom.2012.08.020 (2013).
Goldberg, A. B. New Directions in Semisupervised Learning. Ph.D. thesis, University of WisconsinMadison (2010).
Peikari, M., Gangeh, M., Zubovits, J., Clarke, G. & Martel, A. Triaging Diagnostically Relevant Regions from Pathology Whole Slides of Breast Cancer: A Texture Based Approach. IEEE Transactions on Med. Imaging (2015).
Peikari, M., Zubovits, J. T., Clarke, G. M. & Martel, A. L. Clustering Analysis for Semisupervised Learning Improves Classification Performance of Digital Pathology. In Machine Learning in Medical Imaging  6th International Workshop {MLMI} 2015, Held in Conjunction with {MICCAI} 2015, Munich, Germany, October 5, 2015, Proceedings, 263–270 (2015).
Ankerst, M., Breunig, M. M. & Kriegel, H.p. OPTICS: Ordering Points To Identify the Clustering Structure. In SIGMOD ’99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data, 49–60 (1999).
Clarke, G. M. et al. Increasing specimen coverage using digital wholemount breast pathology: implementation, clinical feasibility and application in research. Comput. Medical Imaging Graphics: Official Journal Comput. Medical Imaging Soc. 35, 531–41 (2011).
Geusebroek, J.M., Smeulders, A. W. M. & van de Weijer, J. Fast anisotropic Gauss filtering. IEEE Transactions on Image Processing: A Publication of the IEEE Signal Processing Society 12, 938–43, https://doi.org/10.1109/TIP.2003.812429 (2003).
Varma, M. & Zisserman, A. A Statistical Approach to Texture Classification from Single Images. Int. Journal Computer Vision 62, 61–81 (2005).
Chang, C.C. & Lin, C.J. LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems Technology 2, 27:1–27: 27 (2011).
Peikari, M., Salama, S., Nofechmozes, S. & Martel, L. Automatic Cellularity Assessment from Posttreated Breast Surgical Specimens. Cytom. A (in press), 1–30, https://doi.org/10.1002/cyto.a.23244 (2017).
Martel, A. L., Hosseinzadeh, D., Senaras, C., Madabhushi, A. & Gurcan, M. N. An Image Analysis Resource for Cancer Research: PIIP—Pathology Image Informatics Platform for Visualization, Analysis, and Management. Cancer Res. 77, e83–e87 (2017).
Peikari, M. & Martel, A. L. Automatic cell detection and segmentation from H and E stained pathology slides using colorspace decorrelation stretching. In SPIE Medical Imaging (2016).
Wienert, S. et al. Detection and segmentation of cell nuclei in virtual microscopy images: a minimummodel approach. Sci. Reports 2, 503, https://doi.org/10.1038/srep00503 (2012).
Gan, H. et al. Discussion of FCM algorithm with partial supervision. In Proceedings of the Eighth International Symposium on Distributed Computing and Applications to Business, Engineering and Science, 27–31 (2009).
Lai, D. T. C. & Garibaldi, J. M. A Preliminary Study on Automatic Breast Cancer Data Classification using Semisupervised Fuzzy cMeans. Int. J. Biomed. Eng. Technol. SI: MEDSIP 2012 Inf. Process. 13, 303–322 (2013).
Joachims, T. Making Large Scale SVM Learning Practical. In Support Vector Learning, 169–184 (1999).
Maaten, L. V. D. Accelerating tSNE using TreeBased Algorithms. J. Mach. Learn. Res. 15, 1–21 (2014).
Theodoridis, S. & Koutroumbas, K. Pattern Recognition (Academic Press, New York, 1998).
Lin, T.H., Li, H.T. & Tsai, K.C. Implementing the Fisher’s discriminant ratio in a kmeans clustering algorithm for feature selection and data set trimming. J. Chemical Information Computer Sciences 44, 76–87 (2004).
Wang, S., Li, D., Song, X., Wei, Y. & Li, H. A feature selection method based on improved fisher’s discriminant ratio for text sentiment classification. Expert. Syst. with Appl. 38, 8696–8702 (2011).
Acknowledgements
This research is funded by Canadian Cancer Society (grant # 703006). We would like to thank Dr. Gina Clarke for providing us with the triaging image dataset that was used as a part of our experiments in this study. We would also like to sincerely thank Dr. Judit Zubovits for her kind support in reviewing the triaging image dataset used to train and validate methods presented in this work.
Author information
Authors and Affiliations
Contributions
Experimental design, analysis of data, and assembling manuscript figures: M.P. Annotating pathology slides: S.N., S.S. Supervision of project: A.L.M. Manuscript writing: M.P., A.L.M. All authors reviewed manuscript.
Corresponding author
Ethics declarations
Competing Interests
The authors declare no competing interests.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Peikari, M., Salama, S., NofechMozes, S. et al. A Clusterthenlabel Semisupervised Learning Approach for Pathology Image Classification. Sci Rep 8, 7193 (2018). https://doi.org/10.1038/s41598018248760
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598018248760
Further reading

Efficient machine learning algorithm for electroencephalogram modeling in brain–computer interfaces
Neural Computing and Applications (2022)

Semisupervised labeling: a proposed methodology for labeling the twitter datasets
Multimedia Tools and Applications (2022)

Achieving situational awareness of drug cartels with geolocated social media
GeoJournal (2021)

Deep learning based breast cancer detection and classification using fuzzy merging techniques
Machine Vision and Applications (2020)

A methodology for image annotation of human actions in videos
Multimedia Tools and Applications (2020)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.