PatchSorter: a high throughput deep learning digital pathology tool for object labeling

The discovery of patterns associated with diagnosis, prognosis, and therapy response in digital pathology images often requires intractable labeling of large quantities of histological objects. Here we release an open-source labeling tool, PatchSorter, which integrates deep learning with an intuitive web interface. Using >100,000 objects, we demonstrate a >7x improvement in labels per second over unaided labeling, with minimal impact on labeling accuracy, thus enabling high-throughput labeling of large datasets.


Main
The increasing digitization of routine clinical histology slides into whole slide images (WSI) has spurred great interest in the development of WSI-based biomarkers for diagnosis, prognosis, and therapy response [1][2][3] .These biomarkers are typically based on patterns associated with the location and type of individual histologic objects (e.g., cellslymphocytes/epithelial; glomeruli -globally sclerotic (GS)/non-sclerotic (non-GS/SS)/segmentally sclerotic (SS); tubules -distal/proximal; tumor buds -present/absent).While current hardware and machine learning algorithms can locate and type objects at scale, the manual assignment and review of large labeled datasets used to train or validate models remains arduous.For example, a single WSI may contain over 1 million cells, which, if requiring a modest 1 second per cell to label, would result in approximately 12 non-stop days of effort.To aid experts (e.g., pathologists) in this labeling process, several image analysis algorithms have been proposed [4][5][6][7][8][9] .However, these algorithms tend to either (a) not be integrated into polished, user-friendly tools, making them unsuitable for usage by domain experts, or (b) are of a closed source, for-profit nature, creating a barrier to their broad-usage, which potentially limits their continuous improvement via the facile integration and evaluation of new algorithms 10 .
Appreciating the need for an open-source force multiplier for labeling histological objects, we here describe and make available to the community PatchSorter (PS).PS is a user-friendly, browser-based tool, which allows the user to leverage deep learning (DL) to quickly review and apply labels at a group, as opposed to a single object, level (Figure 1).We demonstrated that this "bulk" labeling approach improves labeling efficiency across four use cases, spanning three levels of increasing object complexity (i.e., objects comprised of increasing number of cells and cell types) (Table 1).PS enables labeling speed improvements by using DL derived features to embed patches containing the object of interest (e.g., glomeruli) into a 2-dimensional embedding space, such that similarly presenting objects are proximally located.The user then reviews patches within a localized region that are likely to correspond to the same class, thus enabling assignment of labels in bulk (i.e., assignment of the same label to multiple objects at once) with increased efficiency.The DL model and associated embedding space is then iteratively refined with the user's feedback, yielding improved class separability, further improving subsequent labeling efficiency (Figure S1 and S2).
To evaluate this improved efficiency, a labels per second (LPS) metric was compared between PS and an unaided approach, Quick Reviewer (QR, see Online Methods) 11 , across four use cases (Table 1, see Online Methods) totaling over 120,000 objects.QR was used to label a random subset of the data to estimate manual LPS (M t ) per use case.Efficiency improvement was measured as the ratio (θ t ) between PS's LPS (PS t ) and M t .To ensure labeling efficiency improvements did not come at the cost of label fidelity, concordance between QR and PS assigned labels was measured.Labeling for all use cases was conducted by board-certified pathologists, after having received an introduction to the PS and QR user interfaces.
These results indicate that (a) PS provides sizable efficiency improvements in labeling objects of all levels of cellular and structural complexity, while (b) not coming at the cost of a loss of labeling accuracy (Table 1).Interestingly, differences remain in labels generated via PS and QR.This difference can be at least partially attributed to label uncertainty related to ambiguous objects, wherein labeling is likely to suffer from inter/intra observer variability (Figure S3-S6).
The usage of PS appears to proceed in two distinct workflows: (a) rapid bulk labeling on the periphery of the embedding space where objects with more obvious labels tend to be grouped, and (b) slower intricate labeling at the interface between classes where object labels tend to be more challenging to determine.Notably, these challenging data points often drive improved class separation.As such, our suggested best practice is to alternate between the two workflows: (1) when class separation is high in the embedding plot the operator should focus on bulk labeling, while (2) if class separation is low, labeling should be performed at the interface between classes.This interface labeling should result in improved class separation in the next embedding iteration, thus facilitating again bulk labeling (Figure S1).
The transition point between these two workflows appears to be use case specific (Figure 2).While in the nuclei use case labeling speed improves with DL training, in the glomerular use case, a more time-consuming careful evaluation Table 1.Description of the datasets used for validating PatchSorter along with the demonstrated efficiency gains in terms of labels per second and concordance with an unaided approach.The difference between human time and total time is the inclusion of model training and embedding in the lableing time in total time, while it is removed for human time, as the human reader can be dismissed to perform other nonlabeling related tasks.Manual time for the same task is estimated based on the extrapolatation of manual labeling of a subset of the data.Upper and lower intervals for PatchSorter human time are estimated by measuring PSt in 15-minute intervals, corresponding to total QuickReviewer labeling time.For the nuclei use case, speed increases of up to 22.3x (9.6 LPS) are observable while only being slightly slower than manual labeling in one of the measured 15 minute intervals.For tubules, tumor buds and glomeruli, PatchSorter offers a speed increase over manual labeling efforts, even for worst-case estimates.SS = segmentally sclerotic, GS = globally sclerotic, non-SS/GS = non-sclerotic.
is required throughout the task, due to the difficult nature of differentiating between transitioning classes (e.g., SS with small areas of scarring mimicking non-GS/SS or with extensive segmental sclerosis mimicking GS).As the entire dataset is labeled, performance decr eases as easyto-discern object labels are exhausted.For (B) glomeruli labeling, the initial embedding allowed for bulk annotation of non-SS/GS, GS and SS at the edge of the embedding plot, while later, nuanced labeling had to be employed due to the task's difficulty.For (C) tubule labeling, the initial embedding allows for bulk annotation.While due to changes to the initially assigned labels and imbalanced labeling of the 4 classes, class separation decreased in the subsequent iteration, further iterations led to increased class separability and labeling efficiency.Lastly, for (D) tumor bud candidates, initial labeling efficiency was only marginally higher than manual baseline LPS.As class separabil ity increased, so did labeling efficiency.
From a usage perspective, after PS installation, no internet connection is required, enabling its use in clinical environments where data may not be anonymized.PS can be installed locally on commodity desktops or deployed on servers for remote access by experts (i.e., bringing the expert to the data), as datasets become too large to quickly transfer and clinical environments further restrict the installation of 3 rd party software.While PS has been validated in this study on hematoxylin and eosin (H&E) and periodic acid-Schiff (PAS) staining, given the DL-based back end, PS can be considered agnostic to stain type and be used with any stain, image, or object type.
In conclusion, PS is a user-friendly, high-throughput object labeling tool being publicly released for community usage, review, and feedback.PS has demonstrated significant improvement in efficiency in object labeling in the hands of domain experts without sacrificing labeling accuracy.The source code of PS is freely available for use, modification, and contribution at www.patchsorter.com.

PatchSorter workflow
As per PS workflow (Figure S2), images were uploaded to PS together with a corresponding segmentation mask highlighting object location.PS then extracts patches, with user configurable patch sizes, around the center of these objects to create an internal database for high-speed training.While a number of different self-supervised approaches are supported by PS (e.g., BarlowTwins 12 , and AutoEncoder 13 ), a SimCLR 14 using a ResNet18 15 backbone was trained using contrastive loss, creating a dataset specific DL feature space.Feature vectors are computed for each patch using this learned feature space, and are subsequently embedded using Uniform Manifold Approximation and Projection (UMAP) 16 into 2-dimensions.As a result of this process, objects which look the same tend to be plotted near each other in the embedding plot.This allows the user to lasso regions on the embedding plot and provide the label for the selection in the grid plot (Figure 1A).As more objects are labeled, PS is increasingly able to learn a more discriminative feature space for the categories of the specific task.As a result, subsequent iterations should demonstrate improved localized clustering "purity" (i.e., objects in the same cluster have the same label).This approach has two consequences, (a) the user can avoid intractably manipulating individual objects and instead provide bulk annotations to groups of objects with a single input, and (b) as the DL model (and thus the embedding space) is refined with the user's feedback, the user can begin to see regions in the 2d space, where the underlying model is struggling to differentiate between class-types.The visibility of such regions affords the user the opportunity to better invest their time in selecting objects that when labeled are most likely to further improve class separability in the next iteration, which in turn further improves subsequent labeling efficiency.
To facilitate the efficiency of this bulk labeling process, features from modern operating systems were implemented, such as drag-select and numerous intuitive keyboard shortcuts for (a) selecting all objects, (b) inverting the selection, as well as (c) changing the desired label (e.g., "1" selects the first class).If specific objects of interest are sought, PS provides content-based image retrieval, wherein the user may upload a patch of the object of interest, and similar objects from the dataset will appear for labeling within the standard workflow.PS was designed in a decoupled, modular, manner such that its backend technologies can easily be exchanged to evaluate different DL technologies, with minimal modifications to the base application.To ease integration with other workflows and pipelines, the output of PS is highly portable: mask images with color indicating class membership (Figure S1D).For more advanced users, the internal database can be directly employed in common downstream tasks, such as training large custom DL models.It is important to note, that the user retains full control over the accuracy of object labels at all times, and only confirmed labels are stored.Usefully, these newly generated ground truth labels (as well as predicted labels), can be visualized through PS for rapid tile-level review, wherein individual object labels may still be modified as needed (Figure 1E).

Manual unaided baseline efficiency estimation
Quick Reviewer (QR) 11 , an open-source object labeling tool, was employed as the unaided baseline approach for comparison against PS.QR is a simple web-based framework which presents an image patch to the user, one at a time, and collects their label determination via a button click.It should be noted that QR already offers notable efficiency advantages over true unaided manual object labeling pipelines, as objects are directly presented to the user, which obviates the time-consuming process of (a) finding specific objects in WSIs, and (b) transitioning between different WSIs.As such, QR times can be considered optimistic as compared to a "fully" unaided approach, which are increasingly becoming less common in practice.

Metrics for evaluating PS efficiency improvement
For comparing PS to QR we introduce a labels per second (LPS) metric.For each of the 4 use cases described below, QR was used to label a random subset of the data to estimate LPS and extrapolate manual LPS (M t ) for the entire dataset.For PS, we measure LPS in total time and human time (PS t ).The difference between human time and total time is the inclusion of model training and patch embedding in total time, while it is removed for human time, as the the human reader can be dismissed to perform other non-labeleing related tasks.Efficiency improvement is then measured as the ratio (θ t ) between PS t and M t .To ensure these labeling efficiency improvements did not come at the cost of unacceptable fidelity loss, the subset of data manually labeled is quantitively compared using the concordance metric to the labels produced via PS.

Use Case 1: Lymphocyte labeling in triple-negative breast cancer
Tumor infiltrating lymphocytes (TILs) have emerged as a biomarker of interest in breast cancer, with mounting evidence of their prognostic and predictive value in triple-negative breast cancer 17 .TILs are labeled in accordance with the immune oncology working group guidelines for immune infiltration scoring in breast cancer 18 into lymphocyte and non-lymphocyte.
To begin, 2000 1000x1000 pixel image tiles were randomly cropped from n=21 deidentified H&E WSIs scanned at 40x Magnification from the MATADOR 19 cohort, ensuring sufficient quality (e.g., exclusion of tissue folds or blurry regions).ROIs were stain normalized based on a reference tile from the MATADOR 19 cohort using the Vahadane stain normalization 20 implementation from StainTools (https://github.com/Peter554/StainTools). Using the HoverNet 21 implementation from histocartography 22 , nuclei were segmented to provide the object location information to PS.Following the PS workflow (Figure S2), ROIs and corresponding object segmentation mask were uploaded into PS where nuclei were extracted from the ROI into 64x64 pixel patches with the nuclei centered.

Use Case 2: Detection of tumor budding in pulmonary squamous cell carcinoma
Tumor buds, defined as clusters of cancer cells composed of fewer than five cells 23 , is an invasive pattern that has been described in solid tumors (e.g., colon cancer).Tumor budding has attracted interest as a prognostic biomarker in lung cancer, with the presence of tumor buds being associated with worse patient outcome.
Here, 27 2000x2000 pixel ROI were extracted at 40x from n=3 fully deidentified H&E stained lung cancer samples.A u-net 24 model was applied to each ROI to segment potential tumor bud candidates for further labeling into absent/present.ROI were stain normalized using the Vahadane stain normalization 20 method implemented in StainTools and each ROI was downsampled to 500 by 500 pixel using nearest-neighbor interpolation.1761 tumor bud candidates were extracted into 64x64 pixel patches by PS with a single potential tumor bud centered.
Small changes to the PS user interface were made to show a larger 256x256 image instead of the 64x64 image used for training the DL model.This provided additional context was requested by the reader to improve their decisionmaking comfort; these changes are available in the PS code repository.In QR, patches were presented with an overlay of the u-net segmentation mask for indicating tumor bud position in the ROI, as multiple tumor buds might be present in the ROI.
In addition to absent/present, PS and QR were set-up to include an 'unsure' label, allowing for the labeling of patches where the pathologist was not comfortable in making a definitive decision during the experiment.The reported accuracy is measured between all labels present in QR and PS (absent/present/unsure).
Discussion of the discordant cases between QR and PS indicated that the additional context provided by QR led the pathologist to be less confident in labeling patches as 'absent', while in PS, patch similarities to other 'absent' examples in the embedding space led the pathologist to more likely label these patches as 'absent' (Figure S4).Therefore, the user-perceived agreement between PS and QR is likely higher than the concordance score indicates.

Use Case 3: Renal tubular classification
Tubules are a major component of the nephron, the functional unit of the kidney.The two major types of tubules in the kidney cortex are the proximal and distal tubules, and they are vulnerable to a variety of injuries across diseases (e.g., atrophy, acute injury, osmotic changes, etc.).For this use case, tubules were labeled into four classes: proximal, distal, abnormal, and other (i.e., false positive from the a priori tubule segmentation step and collecting ducts or thin limb of loop of Henly tubules in the medulla). 2516 ROIs were extracted from fully deidentified WSI from the NEPTUNE 26 PAS WSI cohort at 20x Magnification and uploaded into PS.ROIs were stain normalized using the Vahadane stain normalization 20 implementation from StainTools.10,129 Tubules were extracted into 256 by 256 pixel patches with a single tubule centered based on tubule annotations created in QuPath 27 .

Use Case 4: Renal glomerular classification
Glomeruli, the filtration organelles of the kidney nephrons, can undergo a variety of morphologic changes.For this use case, we selected diseases where glomeruli can undergo segmental to global scarring.Glomeruli were labeled into 5 categories: globally sclerotic (GS), segmentally sclerotic (SS), non-sclerotic glomeruli (non-SS/GS), non-glomeruli (i.e.false positive from a priori glomeruli segmentation step) and uncertain (i.e.distinction between SS and GS is challenging by visual inspection). 28,29The high complexity of these organelles consisting of various cell types, a capillary tuft, a mesangial stalk, a urinary space, and a capsule, and the high heterogeneity in image presentation of GS and SS glomeruli, allows for the showcasing PS's ability to provide improved labeling efficiency of complex objects.The reported accuracy is measured between GS, SS, non-GS/SS and non-glomeruli labels.Cases labeled as uncertain were excluded as their ambiguous nature would not lead to meaningful conclusions regarding the concordance between PS and QR.
For the experiment, 16,158 glomeruli from 241 fully deidentified NEPTUNE 26 and CureGN 30 PAS WSIs were used.Glomeruli were previously manually segmented using QuPath 27 and preprocessed into 256 by 256 pixel ROI extracted at 40x magnification, each containing a singular glomerulus centered in the ROI.ROI were normalized using Vahadane stain normalization 20 using the StainTools library.ROI and corresponding segmentation masks were uploaded into PS according to the PS workflow (Figure S2).Patches were created using the full ROI.

Configuration and hyperparameters
The default version of PS is nearly fully configured.The few hyperparameters of interest are easily modifiable through the configuration file.In the use cases discussed here, the hyperparameters requiring change relate to the patch size extracting the objects from the ROI images as well as the encoder size of the DL model, governing how much information for a given patch can be used by the model to assess patch similarities.Patch size was chosen based on object size and magnification, such that each object is fully visible in a patch.In the use cases presented (Table 1), the encoder size was set equal to the patch size.For example, in the glomeruli classification use case, patch size was configured as 256x256 pixels, with the encoder size being configured as 256.This parameter setting approach appears to yield a sufficient starting point for using PS efficiently.

Experiment setup
Each experiment was conducted on an Ubuntu Server 20.04LTS equipped with a Nvidia GeForce RTX 2080 Ti.

Figure 1 .
Figure 1.PatchSorter user interface.(A) The embedding plot after initial embedding (left) with corresponding grid plot (right).The twodimensional embedding plot places patches with the same deep learned features in close proximity, causing objects with the same object class to cluster.The user lassos points (black contour with green arrow) which then appear in the grid plot for labeling usi ng efficient keyboard shortcuts.In the embedding plot, a subset of patches can be overlaid to aid in selecting regions in the embedding space (orange arrow).(B) The embedding plot allows for coloring patches by prediction and ground truth (purple arrow).The embedding plot shows the same data set as (A) after eight model iterations where the embedding space is well separated by ground truth labels.Hovering over a point in the embedding space shows the corresponding patch (red arrow).(C) Grid plot coloring shows current predictions and ground truth.The inner square color represents ground truth while the outer square color represents model prediction, with black indicating that the patch is not yet labeled.Right-clicking on patch in the grid plot shows a larger region of interest (ROI) for context (green arrows).(D) From the image pane, prediction and ground truth labels can be visualized (blue arrow) in the output reviewer.(E) Here, objects labels can be updated via a right click on the object (yellow arrow).

Figure 2 .
Figure 2. Efficiency metric PSt over time measured in 5-minute intervals demonstrating the improvement in labeling speed of PS for (A) nuclei, (B) glomeruli, (C) tubules, and (D) tumor bud use case.The x-axis is the human annotation time in minutes and the y-axis is the labeling speed per second for a given time interval.Performance improvement over time varies per use case.For (A) nuclei labeling, a consistent performance increase is noted, consistent with increased class separation.As the entire dataset is labeled, performance decr eases as easyto-discern object labels are exhausted.For (B) glomeruli labeling, the initial embedding allowed for bulk annotation of non-SS/GS, GS and SS at the edge of the embedding plot, while later, nuanced labeling had to be employed due to the task's difficulty.For (C) tubule labeling, the initial embedding allows for bulk annotation.While due to changes to the initially assigned labels and imbalanced labeling of the 4 classes, class separation decreased in the subsequent iteration, further iterations led to increased class separability and labeling efficiency.Lastly, for (D) tumor bud candidates, initial labeling efficiency was only marginally higher than manual baseline LPS.As class separabil ity increased, so did labeling efficiency.