VOLTA: an enVironment-aware cOntrastive ceLl represenTation leArning for histopathology

In clinical oncology, many diagnostic tasks rely on the identification of cells in histopathology images. While supervised machine learning techniques necessitate the need for labels, providing manual cell annotations is time-consuming. In this paper, we propose a self-supervised framework (enVironment-aware cOntrastive cell represenTation learning: VOLTA) for cell representation learning in histopathology images using a technique that accounts for the cell’s mutual relationship with its environment. We subject our model to extensive experiments on data collected from multiple institutions comprising over 800,000 cells and six cancer types. To showcase the potential of our proposed framework, we apply VOLTA to ovarian and endometrial cancers and demonstrate that our cell representations can be utilized to identify the known histotypes of ovarian cancer and provide insights that link histopathology and molecular subtypes of endometrial cancer. Unlike supervised models, we provide a framework that can empower discoveries without any annotation data, even in situations where sample sizes are limited.


Introduction
Cells located within the micro-environment of a tumor have a prominent impact on its developmental process [1][2][3][4][5].Variations in the micro-environment can influence the epigenetic profiles within the tumor and the heterogeneity in the associated gene expression profiles [6].Various cell types reside in the tumor microenvironment and growing evidence suggest that this intratumoral heterogeneity vastly contributes to the therapeutic resistance of the tumor [6,7].Several studies have shown that the higher levels of intratumoral heterogeneity are strongly associated with poor outcomes in lung, ovarian, head and neck, and pancreatic cancers, as it implies that the tumor is more likely to harbor a rare pre-existing resistant subclone [6,[8][9][10].Furthermore, the spatial distribution of immune cells within the tumor microenvironment has a significant impact on the prognosis and therapeutic responses [4,[11][12][13][14].Therefore, the identification of individual cells within the tumor microenvironment is a vital step for tumor characterization in many complex tasks such as tissue classification, cancer diagnosis, subtyping and histological grading [15][16][17][18].
The visual assessment of the Hematoxylin & Eosin (H&E)-stained tissue slides under the microscope is the conventional and widely utilized approach to tumor characterization and cell identification.However, manual cell identification can be cumbersome due to both the time-consuming nature of assessment of large numbers of cells (tens of thousands in a single slide) and the intraand inter-observer variability [19].Machine learning and deep learning models coupled with the digitization of pathological material offer opportunities for computer-aided cell identification [20][21][22].Despite the long history of machine learning research in cell classification using handcrafted features [23][24][25], significant improvements have been reported by solely employing the deep learning-based models [21].
Even though these models can potentially reduce the manual workload of cell identification, they require a large number of cell-level annotations for training.However, this annotation collection process still relies on laborintensive manual examination of the tissue by pathologists.Furthermore, to apply these models to a new tissue type, the data collection and labeling process has to be carried out again.To address this issue, a number of studies have utilized unsupervised approaches for cell representation learning and clustering.Hu et al. [26] adopt InfoGAN [27] to train an implicit classifier, and in another attempt, Vunuu et al. [28] use a deep convolutional auto-encoder (DCAE) to learn the embeddings of cells.However, these studies focus on only one tissue type and also ignore the surrounding environment of a cell while many studies have shown that cells are directly impacted by their environment [29][30][31].The former jeopardizes the generalization of the models from one tissue to another while the latter can potentially have an impact on the accuracy of the model.
Recently, self-supervised learning (SSL) techniques have emerged as an important step towards generalizable representation learning.SSL is a technique developed for image representation learning that guides the training of the model by using the augmentations of an image as the label for that image.The utility of this technique has been investigated on different tasks in the natural image domain where Caron et al. [32] demonstrate the capability of this technique in object classification, and Sohn et al. [33] show its efficacy in object detection.Despite the fact that the a few studies [34,35] examine the utility of self-supervised methods in the patch-level classification, the potential of self-supervised techniques for labeling individual cells (rather than just classifying image patches) are largely ignored.Furthermore, it is difficult to link the findings of these studies to the biological mechanisms in the tumor micro-environment as they cannot identify individual cells.
In this paper, we propose a self-supervised framework for cell representation learning in histopathology images by introducing a novel technique to incorporate the mutual relationship between the cell and its environment for improved cell representation.We benchmarked our model on data representing more than 700, 000 cells in four cancer histotypes with three to six cell types in each dataset.Results confirm the superiority of our model in memory-efficient cell type representation compared to the state-of-the-art.We further utilized the proposed model in the context of ovarian and endometrial cancers and demonstrated that our cell representations, without any human annotations, can be utilized to identify the known histotypes of ovarian cancer as well as identifying novel insights that link histopathology and molecular subtypes of endometrial cancer.This framework consists of two main blocks: 1) Cell Block; 2) Environment Block.The Cell Block learns the cell embeddings by contrasting individual cell-level images while the Environment Block incorporates environment-level information into the cell representations.

Cell Block
The architectural design of the Cell Block is similar to our previously proposed model [36], which has shown promising performance in cell representation learning tasks.In this block, cell embeddings are learned by pulling the embeddings of two augmentations of the same image together, while the embeddings of other images are pushed away.Let X = {x i | 1 ≤ i ≤ N } be the input batch of cell images and N to be the number of images in the batch.Each x i is a small crop of the H&E image around a cell in a way that it only includes that specific cell.Two different sets of augmentations are applied to X to generate We call these sets query and key, respectively.q i and k j are the augmentations of the same image if and only if i = j.The query batch is encoded using a backbone model, a neural network of choice, while the keys are encoded using a momentum encoder, which has the same architecture as the backbone.This momentum encoder is updated using Equation (1) in which θ t k is the parameter of momentum encoder at time t ,m is the momentum factor, and θ t q is the parameter of the backbone at time t Consequently, the obtained query and key representations are passed through separate Multi-Layer Perceptron (MLP) layers called projector heads.Although the query projector head is trainable, the key projector head is updated with momentum using the weight of the query projector head.We restrict these layers to be 2-layer MLPs with an input size of 512, a hidden size of 128, and an output size of 64.In addition to the projector head, we use an extra MLP on the query side of the framework, called the prediction head.This network is a 2-layer MLP with input, hidden, and output sizes of 64, 32, and 64, respectively.Similar to the last fully-connected layers of a conventional classification network, the projection and prediction heads provide more representation power to the model.
The networks of the cell block are trained using the InfoNCE [37] loss which is shown in Equation ( 2) In this equation, τ is the temperature that controls the sharpness of the distribution, is the cardinality, Q is the number of items stored in the queue from the key branch, f q is the equal function for the combination of the backbone, query projection head, and query prediction head, and f k shows the equal function for the momentum encoder and the key projection head.
The augmentation pipelines include cropping, color jitter (brightness of 0.4, contrast of 0.4, saturation of 0.4, and hue of 0.1), gray-scale conversion, Gaussian blur (with a random sigma between 0.1 and 2.0), horizontal and vertical flip, and rotation (randomly selected between 0 to 180 degrees).To ensure the model always views the whole-cell image of the cell on one side, we remove the cropping operation from one of the pipelines.Therefore, the pipeline with cropping generates local regions of the cell image while the images generated by the other augmentation pipeline are global, containing the wholecell view.
Cell embeddings are generated from the trained momentum encoder at the inference time and are clustered by applying the K-means algorithm.One can use either the encoder or momentum encoder for embedding generation; however, the momentum encoder provides more robust representations since it aggregates the learned weights of the encoder network from all of the training steps (an ensembling version of the encoder throughout training) [32].

Environment Block
Many studies have shown that the Tumor Micro Environment (TME) plays an important role in the tumor progression behavior [31,38].Motivated by these findings, we ask: should the representation of a cell reflect its environment as well?Inspired by this question, we hypothesize that a deeper knowledge of the environment leads to a better general understanding of the cell.In a mathematical formulation, this hypothesis is equivalent to the assumption that there exists mutual information between cells and their environment.Therefore, to validate this hypothesis, we propose to increase the mutual information between the corresponding cell and environment representations during the training process.Previous studies [39] have shown that the InfoNCE loss maximizes the lower bound of mutual information between different views of the image.Thus, we will use this loss function to achieve the aforementioned target by performing cross-modal contrastive learning as an auxiliary task.
Let E = {e i | 1 ≤ i ≤ N } be the corresponding environment patches of the cells represented by X.Here, we refer to the environment as a large region around a cell in a way that includes the surrounding tissue and cells.Therefore, for ∀i ∈ 1, 2, ..., N , x i and e i are centered on the same cell (however, for the cases where the cells are located on the edge of the patch, we limit the patch border to the border of the image).After applying an augmentation pipeline, the environment patches are passed through an encoder network, called an environment encoder.Simultaneously, we apply a new projection head, the environment projection head, to the cell representations obtained from the query backbone in the Cell Block.Finally, one can train the Environment Block using these two sets of representations (environment and cell) and Equation (3) Therefore, the final loss of the whole framework can be written as Equation (4), in which λ is a hyperparameter.Increasing the value of λ prioritizes the mutual information of the cell with its environment over the consistency of the representation for different augmentations of the same cell The augmentation pipeline of the Environment Block uses the same operations as that of the Cell Block except for cropping.
To prevent the model from focusing on the overlapping regions between the corresponding cell and environment images (called shortcut [40], meaning that the model uses undesired features to solve the problem), we mask the target cell in the environment patch.Furthermore, the rest of the cells in the environment patch are also masked to ensure that the model does not bias the representation of a cell toward the neighboring cell types.We will investigate the effectiveness of the masking operation in the ablation study.

Data Preparation
The aforementioned datasets included patch-level images, while we required cell-level ones for the training of the model.To generate such data, we used the instance segmentation provided in each of the external datasets to find cells and crop a small box around them.However, for the Oracleand SarcCell datasets, the instance segmentation masks were generated by applying HoVer-Net [21] segmentation pre-trained on the PanNuke dataset.
An adaptive window size was used to extract cell images from the H&E slides.The window size was set to twice the size of the cell for the CoNSeP dataset while it was equal to the size of the cell for the rest of the datasets.Finally, cell images were resized to 32 × 32 pixels and were normalized to zero mean and unit standard deviation before being fed into our proposed framework.
Ground-truth label generation of the Oracleand SarcCell dataset cells was performed by finding the most expressed biomarker (by intensity and quantity) in the same position of the corresponding IHC image.To accommodate for the potential noise associated with image registration, two post-processing steps were performed: 1) the size of the window in the IHC image was set to 5 times of the window size in the H&E core (however, this scale was set to 1 for the SarcCell dataset due to more accurate co-registration performance); 2) the most expressed biomarker was considered as the label only if it contained at least 70% of the biomarker distribution in the IHC window.

Implementation Details
The code was implemented in Pytorch (v1.9.0), and the model was run on one and two V100 GPUs for the w/ and w/o environment settings, respectively.The batch size was set to 1024 (unless specified otherwise), the queue size to 65536, and pre-activated ResNet18 [41] was used for the backbone and momentum encoder in the Cell Block.The environment encoder architecture was set to LambdaNet model [42] as it extracts more informative patch representations using self-attention while keeping the computation and memory usage tractable.The stack was trained using the Adam optimizer for 500 epochs (unless specified otherwise) with a starting learning rate of 0.001, a cosine learning rate scheduler, and a weight decay of 0.0001.We also adopted a 10epoch warm-up step.The momentum factor in the momentum encoders was 0.999, and the temperature was set to 0.07.
In Table 1 experiments, the training epoch count and batch size of our models were set to 200 and 512 for the PanNuke Breast, Lizard, Oracle, and SarcCell datasets.Additionally, for the training of our model on the Oracledatasets, we used 15,000 randomly selected cells from the training set, to reduce the training time.
In the self-supervised to supervised transfer learning step (cell classification), we adopted SGD (Stochastic Gradient Descent) with a starting learning rate of 0.001 using a cosine learning rate scheduler for 300 epochs with a batch size of 1024.Also, the weight decay was set to 0.00001.In the case that we allowed the encoder to be fine-tuned, we set the encoder's learning rate to 0.0001.
It is worth mentioning that for the cell classification of NuCLS, we followed the same super-class grouping of the original paper [22].In this regard, we only used 3 super-classes out of 5 for cell type classification, including tumor, stromal, and sTILs.

Baselines
The performance was also compared against five baselines.The pre-trained ImageNet model used weights that were pre-trained on the ImageNet dataset to generate the cell embeddings.The Morphological Features approach [43] adopted morphological features to produce a 30-dimensional feature vector, consisting of geometrical and shape attributes.Prior to clustering, the feature vectors were normalized to zero mean and unit standard deviation, and their size was reduced to 2 using t-SNE.The third baseline was Manual Feature [26] which used a combination of Scale-Invariant Feature Transform (SIFT) and Local Binary Patterns (LBP) features to provide representations for the cells.Similar to the previous baseline, we exercised standardization on the computed feature vectors.Additionally, our baseline set included two state-of-the-art unsupervised deep learning models.More specifically, the Auto-Encoder baseline adopted a deep convolution auto-encoder alongside a clustering layer to learn cell embeddings by performing an image reconstruction task [28].And finally, the last baseline was GAN [26] which adopted the idea of InfoGAN [27] and developed a Generative Adversarial Network (GAN) for cell clustering by increasing the mutual information between the cell representation and a categorical noise vector.

Cell representation learning framework and benchmarking
Figure 1 depicts an overview of our proposed Environment-Aware Contrastive Cell Representation Learning framework (VOLTA).This framework consists of two major blocks.The cell block takes an image of a cell and applies two sets of augmentation operations to create visually distinct perspectives of a cell.These two augmented images are then transformed into their respective representation vectors using a stack of deep neural networks and given that these representations correspond to the same cell, the models are trained to minimize the distance between the two representations.The environment block of our proposed framework is utilized to increase the mutual information between the cell and a larger patch that captures the environment around it.By using the InfoNCE loss function [37], it accomplishes this by performing a contrastive cross-modal learning between the cell representation and that of its environment.To prevent the model from biasing towards other cells appearing in the environment, we mask out these cells in the environment patch before feeding it to the model.Finally, the cell representations for downstream tasks such as cell clustering and classification can be obtained by using the backbone model trained in this setting.
We benchmarked these representations across multiple tasks and datasets.More specifically, seven public and private datasets representing 700, 000 cells and four cancer types (colon, breast, and ovarian cancers and sarcomas) were utilized to evaluate the performance of the proposed cell representation model (Supplementary Table A1).Even though our model requires no labels for training, in all settings, we split the data into train and test sets and use the former for the training of the model.
We also conducted ablation studies on different components of our model to measure their effects on the performance (see Supplementary Section A.4).Our experiments suggest that the cell masking operation, whole-and local-view augmentations, memory storage, and momentum encoder provide noticeable performance improvements to our model.

Identification of distinct cell clusters by self-supervised cell representation learning
VOLTA provides cell representations from histopathology images, and such representations should be capable of differentiating between biologically distinct cell types.To test this hypothesis, we adopted our method to identify cell VOLTA clusters in each dataset.In particular, after learning the cell representations in a self-supervised manner using VOLTA, we performed unsupervised clustering on the cell representations and examined the enrichment of the identified clusters with specific cell types.To show the utility of our approach, we compared the performance of VOLTA with the state-of-the-art morphology-based and deep learning-based models for cell representation.As shown in Table 1, our model outperformed all counterparts by a large margin across multiple clustering metrics (adjusted mutual index (AMI), adjusted rand index (ARI), and Purity -see Supplementary Section A.2) in all datasets, reaching twice the performance of the best-performing baselines in some of the datasets (except for ARI on the Oracle dataset where GAN performs better).More importantly, while the performance of the baseline models varies from one cancer to another, our model shows consistent results regardless of the cancer type.For instance, while the morphology-based representation method has the best performance compared to the other baselines over the NuCLS and PanNuke Breast cancer datasets, it has an inferior performance on PanNuke Colon and CoNSeP.
Figure 2 shows the Uniform Manifold Approximation and Projection (UMAP) representations of various cell types that were derived by VOLTA.The learned representations provide distinct and separable cell populations confirming the comparison metrics that were presented in Table 1.Also, one can observe that our model can differentiate between immune cells (T-cell, Bcell, and Natural Killer cells) and tumor cells in the Oracle dataset.Similarly, in the NuCLS dataset, our model is able to differentiate between the stromal tumor-infiltrating lymphocytes (sTILs) and the cancer cells.The same observations can be seen in the PanNuke Colon and CoNSeP datasets where various cell types such as epithelial and inflammatory cells are mapped onto different locations of the embedding space.

Supervised cell classification accuracy and efficiency improvement
We then aimed to assess the effectiveness of the proposed model in few-shot cell classification in a supervised machine learning setting where labeled samples were available.Specifically, we trained the model using our self-supervised framework and utilized the learned cell representations as inputs for training a simple Multi-Layer Perceptron (MLP) for cell classification.The performance of the trained model on CoNSeP and NuCLS datasets across various settings is shown in Figure 3.
We also demonstrated the effectiveness of our self-supervised cell representation learning framework by using a subset of the labeled cell identities to train the MLP-based cell classifier.Our results showed that the proposed model achieved a reasonable performance with a small subset of the labeled training data (Supplementary Table A2).For instance, with only 0.1% of the training labels, our models achieved 62.7% and 72.6% Top-1 accuracy on the CoNSeP and NuCLS datasets, respectively, while a model that utilized the entire labeled dataset achieved 80.2% and 76.3%.Furthermore, as the number of training labels increased, the classification accuracy consistently improved to an extent that our model outperformed state-of-the-art Hover-Net model [21] results on the CoNSeP dataset, even with 70% of the training data.It is of note to mention that the number of the parameters of our proposed model is reduced by 60% compared to the HoVer-Net model (Supplementary Table A5).Our model reached an accuracy that was close to the Masked-RCNN model which led to state-of-the-art results in the NuCLS dataset [22].Given that the training and validation sets of this dataset are collected from different sites, we hypothesize that variations in staining and color profiles could lead to the over-fitting of the models to training data (see Section 3.4).

Self-supervised cell representation learning is robust to undesired color variations
Previous studies have shown that normalization and domain adaptation methods can enhance the performance of supervised models when the train and test datasets are collected from different sites.Therefore, we studied the effect of such methods on our proposed model when it was utilized for cell representation learning and supervised cell classification settings.To serve this purpose, we used the Vahadane normalization method [44] within the context of the NuCLS dataset where the slides were stained and scanned in different institutions.Supplementary Table A4 illustrates the effect of the normalization in the self-supervised setting on the NuCLS dataset.Although patch and slide classification tasks can benefit from cross-institution stain normalization, we noticed that our self-supervised cell representation approach does not benefit much from color normalization strategies.This finding can be attributed to the strong augmentations that were utilized in our self-supervised model training.Moreover, we investigated the effect of color normalization in the supervised fine-tuning setting.Interestingly, although self-supervised clustering results were robust to stain normalization, the supervised fine-tuned model benefited from it to an extent that it outperformed the NuCLS model on this dataset (Supplementary Table A3).It is of note to mention that the normalization method was only applied to the test set while the self-supervised model was still trained on the original data (i.e., without any normalization).

VOLTA as a building block for unsupervised cancer subtype identification
We sought to investigate the utility of our proposed self-supervised cell representation model as a building block for annotation-free cancer subtyping.Therefore, we put together a small H&E tissue microarray (TMA) cohort of 14 ovarian cancer cases comprising of clear cell, endometrioid, high-grade serous, and low-grade serous ovarian carcinomas.By exercising the same procedure as described in Section 3.1, we utilized the cells extracted from these images to train our self-supervised model.Subsequently, TMA core images were divided into 400 × 400 pixel patches, our pre-trained VOLTA model was applied to each patch to derive cell representations, cells were divided into multiple clusters, and the number of cells within the predicted clusters was counted for each patch.To prevent biasing towards patches that appear frequently (e.g., stromal regions), we used a clustering method to split all of the patches into multiple distinct clusters based on the aforementioned cell distributions.Afterwards, we selected 100 patches from each cluster and used their cell distributions to predict the clustering label of the original TMA core (see Supplementary Figure A1).The results demonstrate that our model is capable of separating the epithelial ovarian cancer histotypes without a need for annotation or prior knowledge of the histotypes (Figure 4a).In particular, four major clusters enriched with each of the four specific histotypes were identified with only two cases that were grouped with other subtypes.These results suggest an 86% accuracy (12 out of 14 that were correctly grouped) in ovarian cancer subtyping; a finding that is in line with results reported in the literature [45].To make this process more interpretable, we visualized the cell clusters on multiple patches, asked an expert to label each cluster, and combined the clusters with the same label (referred to as sub-clusters).
We observed that each of the cell clusters is typically enriched with a specific type of cell, demonstrating the capability of the model in capturing morphological differences between cell types (Figures A6 to A10).Supplementary Table A11 represents the cell distributions across the epithelial ovarian histotypes after combining sub-clusters, while Supplementary Figure A2 depicts the boxplot of the cell distributions before combination.Notably, we noticed that 5 identified sub-clusters corresponded to differences in tumor cell morphology between ovarian cancer histotypes.High-grade serous and clear cell tumors were relatively enriched for tumor cell clusters containing larger cells (tumor clusters 2, 4, and 5) compared to low-grade serous and endometrioid tumors (see Figures A3 and A4), consistent with the well-known high-grade nuclear histology of high-grade serous and clear cell carcinomas.We next utilized VOLTA to demonstrate its application for exploratory cancer subtype discovery.To do so, we scanned 19 whole-section slide images (WSI) corresponding to three molecular subtypes of endometrial cancer (EC): DNA polymerase epsilon (POLE)-mutant cases, cases with mismatch repair deficiency (MMRd), and cases with p53 abnormality as assessed by immunohistochemistry.We next asked whether the proposed model could identify features in the H&E slides that would aid us in identifying molecular subtypes of EC.Following a similar approach that we took for the ovarian cancer cohort, we subjected EC WSI representations to clustering and identified three clusters of patients (Figure 4b).Interestingly, each of the three clusters was enriched with a specific molecular subtype of endometrial carcinoma.Similar to the procedure taken for the ovarian dataset, we also visualized the cell clusters within a patch for each of the EC subtypes (Supplementary Figures A11 to A13).Furthermore, the distribution of cell clusters across the predicted histotypes and molecular subtypes before and after sub-cluster combination is shown in Supplementary Table A12 and Figure A5, respectively.In line with recent findings, MMR-deficient tumors had the highest proportion of lymphocytes in the endometrial cancer dataset [46][47][48].

Discussion & Conclusion
In this paper, we proposed a novel self-supervised framework (VOLTA) for learning cell representations from annotation-free H&E images.Our investigations from multiple perspectives confirm the superiority of VOLTA over the state-of-the-art models.Specifically, we demonstrated that VOLTA significantly outperformed the state-of-the-art unsupervised morphology-and deep-learning-based cell clustering methods on seven datasets, four cancer types, and three to six cell type categories within each dataset.Such unsupervised learning of the cell representations introduces unique opportunities for discovery, prediction, and development purposes.For instance, as part of our experiments, we illustrated that VOLTA can be successfully used as a building block for cancer histotype clustering by applying it to 14 cases of ovarian and 19 cases of endometrial cancer, separately.This finding is interesting from two aspects: 1) even though our model does not receive any patient labels at training time, it is able to identify clusters of patients that are similar to pathologist diagnosis or molecular subtypes; 2) VOLTA is extremely data efficient to an extent that it worked on two datasets with 10-20 patients samples while having a large dataset is usually a prerequisite for other deep learning models.We also demonstrated that these improvements are not only exclusive to the unsupervised aspects of the model but also can be extended to a supervised setting.By using our pre-trained VOLTA as an initialization weight for a supervised model, we could achieve a performance equal to that of the stateof-the-art supervised model with as low as 10% of the labeled data, while it surpassed this performance with the full data.Additionally, our self-supervised model is robust to undesired staining biases, which facilitates the utilization of the model on datasets collected across different centers.
Our investigation has demonstrated the efficiency of VOLTA as a tool for cell discovery within multiple pathology pipelines.Leveraging a self-supervised engine, the model can be seamlessly integrated with a wealth of histopathology archives accessible from various clinical centers to enable the generation of extensive cell-level representation databases.Furthermore, the model has the potential to alleviate the laborious cell type labeling process by annotating cell clusters instead of individual cells and be used in an interactive pathology pipeline.In addition to its utilization in cell type discovery, we have also demonstrated that the model can serve as a foundational element for both histotype and molecular subtype identification.This illustrates the wide-ranging potential of our model for discovery at multiple levels, from morphological features to molecular basis.These findings point to interesting directions for linking histopathology data to more advanced and in-depth areas such as genomic and molecular information.

VOLTA
The spatial distribution of cells within a tumor has been widely acknowledged to have a profound impact on the progression and prognosis of the disease.As demonstrated by Pogrebniak et al. [6], the bivariate analysis of immune and tumor cells can yield a wealth of information about the underlying biology of the disease.By utilizing metrics such as the Morisita-Horn index [49], Ripley's K function [50], and Intra-Tumor Lymphocyte Ratio (ITLR) [51], researchers have gained meaningful insights into the relationship between the spatial distribution of cells and clinical outcomes, identify immune-cancer hotspots, and predict chemotherapy response [31,38,52].Considering the crucial role of cell identification in these applications, our research has the potential to be instrumental in enabling the aforementioned studies to be conducted at more extensive scales.This, in turn, can lead to a more profound understanding of the intricate correlation between disease phenotype and the spatial arrangement of the tumor microenvironment.
Supplementary information.This article includes supplementary material.Fig. 3 After pre-training using our self-supervised framework, a fully-connected layer (single-or double-layer) was added to the end of the backbone (the model generating the cell representations), and they were fine-tuned using the labeled data.We compared finetuning with both frozen and unfrozen backbone (a -CoNSeP and b -NuCLS).To account for the color differences in the train and test cohorts of the NuCLS dataset, we also performed the Vahedain color normalization before the fine-tuning process, which showed a significant boost compared to the unnormalized approach (c).The results demonstrate that our finetuned model can achieve the same performance as the supervised baselines (HoVer-Net and NuCLS) using only 20% of the labeled data while outperforming these baselines with the full set of the labeled data (a and c).To achieve this, we first train our model for delivering cell representations in a self-supervised manner.Then, it will be applied to patches of a slide, the cell cluster distributions will be counted, and the slides will be clustered into distinct cohorts based on the variation of the cell cluster distributions across patches of each slide.In the case of endometrial cancer (b), the supercluster on the right (yellow) demonstrates a cohort of patients that mostly have the POLE subtype (only one sample from p53abn is in this group), the supercluster in the middle (red) depicts mainly the MMRd patients (with only one POLE case misclassified), and the superclass on the left (purple) shows the p53abn cases with only one POLE case misplaced.

A.1.4 Lizard
included colon tissue tiles collected from 6 different sources across multiple sites in the world.The image tiles were scanned at the magnification scale of 20×.Cells were divided into 6 categories including epithelial, connective tissue, lymphocyte, plasma, neutrophil, and eosinophil cells.This dataset contained 3 folds, of which the first and third were used for training and testing, respectively.
A.1.5 Oracle was an internal ovarian dataset collected in British Columbia, Canada.It included 192 TMA cores stained with H&E alongside the multi-color IHC scans of an adjacent slice for each core.IHC images captured different biomarkers including CD3 (T-cell), CD94 (Natural Killer cell), FoxP3 (T-cell), CD79a (Bcell), CD8 (killer T-cell), CD68 (macrophages), CD16a (natural killer cell), and PanCK+ (epithelial cancer cells), each of which could be used as a map to find the type of cells in the corresponding TMA core.To this end, we registered each IHC image with its corresponding core.However, due to the circular shape of the images, only 11 of them were visually matched.The cells included in these images were used as the test set.

A.1.6 SarcCell
was an internal soft tissue sarcoma dataset collected from epithelioid sarcoma cases.Whole tissue sections were obtained from formalin-fixed paraffinembedded source blocks and stained with CD3 (T-cell, 1:100, Leica Biosystems), CD20 (B-cell, 1:200, Biocare), and CD208 (mature dendritic cells, 1:50, Novus Biologicals) by multiplex.Adjacent sections were stained by H&E.Stained sections were scanned at 40X and were co-registered together.IHC biomarkers were used as an indicator of the cell type for label generation.The distribution of extracted cells was extremely skewed towards T-cells (96% of all the cells).Although all the extracted cells were used in the training set, we selected a balanced subset of the training set as the test set (including all the B-cells and dendritic cells, but a portion of the T-cells).

A.2 Evaluation Metrics
The clustering performance was measured based on multiple metrics, including Adjusted Mutual Information (AMI) [55], Adjusted Random Index (ARI) [56], and Purity of the identified cell clusters by the model and ground truth labels.
In particular, AMI captures the agreement between two sets of assignments using mutual information while it is adjusted to mitigate the effect of chance on the score.On the other hand, ARI is the chance-adjusted form of the rand index [57], which calculates the quality of the clustering based on the number of matching instance pairs.Also, Purity measures how the samples within each cluster are similar to each other.In other words, it demonstrates whether each cluster is a mixture of different classes or not.

A.3.2 Color Normalization
Since the CoNSeP dataset contains the images from one patient and they were scanned in one institution, we excluded it from this experiment.

A.3.3 Model Size Comparison
Despite the fact that our model is capable of outperforming the supervised models, we noticed that it has 70% fewer parameters compared to HoVer-Net.However, one might relate this matter to the fact that Hover-Net performs 2 tasks of cell detection and classification, simultaneously.To address this, we only included the parameters of the common encoder and the classification head of this model to have an impartial comparison (Table A5).

A.3.4 Cancer subtype classification
To predict the subtype, the TMA core image was divided into 400 × 400 pixel patches, cells were divided into multiple clusters using the K-means algorithm, the number of cells within each predicted cluster was counted for each patch, and then patches were split into multiple groups based on the distributions of cell clusters.Afterward, we selected 100 patches from each group and predicted the clustering label of each original image based on the distribution of cell clusters in the selected patches.The final grouping process was conducted by using a hierarchical clustering algorithm with Ward's linkage method, preceded by dimensionality reduction using PCA.

A.4 Ablation Study
To study the effects of each component of the framework, we also conducted multiple ablation studies.In this section, we limited the training duration to 200 epochs and reduced the batch size to 512, and the studies were performed only on the CoNSeP and NuCLS datasets to decrease the time and memory requirements.

A.4.1 Memory Bank
Table A6 shows the results of the unsupervised clustering performance in the presence and absence of the negative sample memory bank.These results confirmed that the presence of the memory bank is critical to the performance of the model.
Comparing the results of this table with that of Table 1 (which was trained on 500 epochs with a batch size of 1024), we found that although the performance over the CoNSeP dataset dropped, the model produced almost similar results on the NuCLS dataset.We hypothesized that this must be related to the size of the dataset.Therefore, as the NuCLS dataset was 2× larger than the CoNSeP, the model could converge to a steady performance with even fewer training epochs.

A.4.2 Masking cells in the environment patch
In this section, we studied the effects of the cell masking operation.In this regard, instead of masking all cells present in the environment patch, we only masked the target cell (the one that already exists in the associated single cell image fed into the Cell Block).The results (Table A7) showed that the masking operation increased the general performance of the model, implying that reduced bias towards other cells' types and more focus on the noncellular environment such as tissue structure was important for environment integration.

A.4.3 Multi-cropping
We compared the effect of multi-cropping (global-local cropping augmentations) with the conventional local-local relation where both of the augmentation pipelines in the Cell Block contained the cropping operation.Surprisingly, although multi-cropping showed a significant improvement on NuCLS dataset, it reduced the performance of the model on CoNSeP (Table A8).We hypothesized that this was due to the fact that the local augmentations obtained in this setting were not diverse enough for the model to be properly trained as the CoNSeP dataset had fewer data samples.Therefore, we increased the number of training epochs on the CoNSeP dataset and compared the effect of multi-cropping again.Our results (Table A9) demonstrated the positive effect of multi-cropping on the performance when the model was trained for a longer time.

A.4.4 Ensembling
We also compared the impact of using the trained momentum encoder instead of the backbone in the testing phase, and as Table A10 suggests, using the momentum encoder improved the performance of the model.This finding was expected since the momentum encoder equation forms a model similar to Polyak-Ruppert averaging [32], aggregating the the encoder network weights across all training epochs.

Figure 1
Figure 1 depicts an overview of the proposed self-supervised method for cell classification.In this figure, X is the input cell image, Q together with K are the augmented cell images, E is the environment image, E Aug is the augmented environment image, and L cell alongside L Env show the cell and environment loss values.

Fig. 1
Fig.1Overview of our proposed framework.The cell block trains the backbone model by applying two augmentations on the same cell image, encoding the images, and bringing their representations close to each other.While the backbone is trained through back-propagation, the momentum encoder averages the weights from the backbone.On the other hand, the environment block combines the cell representation created by the cell block with the surrounding environment (a larger region around the cell).We mask all of the cells in the environment patch to prevent the model from favoring the cell representation toward that of these cells.

Fig. 2
Fig. 2 Embedding space representation of each dataset using UMAP.(a) CoNSeP, (b) NuCLS, (c) PanNuke Colon, (d) PanNuke Breast, (e) Lizard, (f) Oracle, (g) SarcCell.Contours with the same color demonstrate the distribution of the learned representations by our model for that specific cell types.Despite not using labeled data in the training process, our model learns to map cells with the same type close to each other.The co-centered contours with the same color show the distribution of the representation for cells with a specific type.

Fig. 4
Fig.4(a) Ovarian cancer and (b) endometrial cancer datasets are hierarchically clustered based on cell cluster proportions.To achieve this, we first train our model for delivering cell representations in a self-supervised manner.Then, it will be applied to patches of a slide, the cell cluster distributions will be counted, and the slides will be clustered into distinct cohorts based on the variation of the cell cluster distributions across patches of each slide.In the case of endometrial cancer (b), the supercluster on the right (yellow) demonstrates a cohort of patients that mostly have the POLE subtype (only one sample from p53abn is in this group), the supercluster in the middle (red) depicts mainly the MMRd patients (with only one POLE case misclassified), and the superclass on the left (purple) shows the p53abn cases with only one POLE case misplaced.
fig.A1 demonstrates the flowchart of the process for cancer subtype classification on ovarian and endometrial datasets.fig.A3 depicts the boxplot of the cell area for each cell clusters of the ovarian dataset, and fig.A4 shows the proportion of tumor 2, 4, and 5 clusters with respect to all the tumor clusters.

Fig. A1
Fig. A1 Workflow of unsupervised cancer subtype classification based on the unsupervised cell representation learning

Fig. A3
Fig. A3 Cell area boxplot for cell clusters of the ovarian dataset

Fig. A5
Fig. A5 Boxplot of cell cluster distribution across patches of the Endometrial Cancer for all of the patients

Fig. A7
Fig.A7Inferred cell cluster labels in a representative example of a high-grade serous ovarian carcinoma.Cell cluster labels are depicted as coloured nuclear outlines, and represent morphologically-distinct clusters that are associated with cell types.Green captures cells with darkly staining nuclei, and brown labels spindled cells that are mostly stromal.

Fig. A10
Fig.A10Inferred cell cluster labels in a representative example of an endometrioid ovarian carcinoma.Cell cluster labels are depicted as coloured nuclear outlines, and represent morphologically-distinct clusters that are associated with cell types.Green captures cells with darkly staining nuclei, and brown labels spindled cells that are mostly stromal.

Fig. A11
Fig. A11 Inferred cell cluster labels in a representative example of a POLE-mutated endometrial carcinoma.Cell cluster labels are depicted as colored nuclear outlines, and represent morphologically-distinct clusters that are associated with cell types.Green cells comprise a mixture of stromal and cancer cells with elongated nuclei.Brown denotes cells with intermediate-sized nuclei, comprising a mixture of tumor, inflammatory, and stromal cells.In this example, some cancer cells are labeled green, highlighting the uncertainty of cell type classification in morphologically pleomorphic POLE tumors.

Table 1
Unsupervised clustering of cell representations across different methods and datasets.The baseline models include both morphology-based and state-of-the-art deep learning methods for cell representation.Some of the baseline results are listed as "-" meaning calculation of the feature vectors was not possible due to the limitation of the model on the small-sized cells.

Table A1
Datasets summary.The datasets are distributed across 4 tissue and 18 cell types to demonstrate the utility of the method.

Table A2
Fine-tuning accuracy for CoNSeP and NuCLS datasets.The supervised baselines demonstrate the performance of the HoVer-Net and NuCLS models on the CoNSeP and NuCLS datasets, respectively.

Table A6
Memory bank ablation study

Table A7
Masking cells in the environment patch ablation study

Table A11
Distribution of cell clusters across epithelial ovarian cancer histotypes (with standard deviation).

Table A12
Distribution of cell clusters across endometrial cancer molecular subtypes (with standard deviation).The non-MMRd group encompasses p53abn and POLE tumors.