Development and validation of a deep learning algorithm for improving Gleason scoring of prostate cancer

For prostate cancer patients, the Gleason score is one of the most important prognostic factors, potentially determining treatment independent of the stage. However, Gleason scoring is based on subjective microscopic examination of tumor morphology and suffers from poor reproducibility. Here we present a deep learning system (DLS) for Gleason scoring whole-slide images of prostatectomies. Our system was developed using 112 million pathologist-annotated image patches from 1226 slides, and evaluated on an independent validation dataset of 331 slides. Compared to a reference standard provided by genitourinary pathology experts, the mean accuracy among 29 general pathologists was 0.61 on the validation set. The DLS achieved a significantly higher diagnostic accuracy of 0.70 (p = 0.002) and trended towards better patient risk stratification in correlations to clinical follow-up data. Our approach could improve the accuracy of Gleason scoring and subsequent therapy decisions, particularly where specialist expertise is unavailable. The DLS also goes beyond the current Gleason system to more finely characterize and quantitate tumor morphology, providing opportunities for refinement of the Gleason system itself.


Introduction
Adenocarcinoma of the prostate is the second most common cancer diagnosed in men, with approximately one in nine men diagnosed in their lifetime. 1 For prostate cancer patients, subjective microscopic tissue examination remains the gold standard for diagnosis. Here, the Gleason score and tumor stage have remained the most powerful predictors of prognosis in virtually every large prostate cancer outcome study. 2 The Gleason system was initially developed in 1966 and stratifies prostate malignancies by tumor architectural patterns. The system has since been revised significantly 3 , 4 in an attempt to better reflect tumor biology.
Importantly, the Gleason score (and its associated Gleason grade group 2 ) is central to risk stratification and the National Comprehensive Cancer Network guidelines 5 , which are widely-used clinically to guide standardized patient management decisions. Despite its indisputable role in prognostication and patient management, Gleason scoring by pathologists is a subjective exercise and suffers from suboptimal interobserver and intraobserver variability, with reported Gleason score discordance ranging from 30-53%. [6][7][8][9][10][11][12][13][14] A potential approach to increasing the consistency and accuracy of Gleason grading lies in the field of artificial intelligence, where recent advances have been applied productively to imaging diagnostic tasks across dermatology 15,16 , ophthalmology 17-20 , radiology 21-23 , and histopathology. 24-29 Artificial intelligence systems have been developed for prostate cancer detection in needle core biopsies 26 and Gleason grading of tissue microarrays 27 , which comprise carefully selected sub-regions of tumor specimens used for research purposes, outside of routine clinical workflow. These studies have not explored the task of Gleason grading specimens used for clinical diagnosis.
Expertise and consistency in Gleason scoring have been shown to significantly improve its prognostic utility. 9, 30 We thus reasoned that the availability of an accurate Gleason scoring tool for the whole-slide sections used in clinical workflows could help address the problem of grading variability, improve prognostication and optimize patient management. To this end, we developed a DLS to perform Gleason scoring and quantitation on prostatectomy specimens.
The DLS accuracy is compared against a cohort of pathologists, where the reference standard was defined by genitourinary specialist pathologists. We further compared the risk stratification provided by our DLS, the cohort of pathologists, and our specialist-defined reference standard in predicting disease progression. Lastly, we also explored the potential of artificial intelligence to provide more fine-grained measures of tumor grading and the resulting potential to provide more precise prognostication.

Overview of the Deep Learning System (DLS) and Data Acquisition.
Our approach is a 2-stage DLS: first a deep convolutional neural network-based regional Gleason pattern (GP) classification followed by a k-nearest-neighbor-based whole-slide Gleason grade group classification (Fig. 1). The first stage was trained using image patches extracted from the slide and the corresponding label derived from pathologist-labeled pixel-level annotations (Fig. 1). In total, we collected and used 112 million image patches derived from 912 slides (approximately 115,000 mm 2 of tissue), which required approximately 900 pathologist hours to annotate. To our knowledge, this constitutes the largest pixel-annotated histopathology dataset, roughly 4x larger in annotated tissue area than the training slides in the widely used Camelyon16 dataset 24 . The second stage was trained using 1,159 slide-level classifications provided by pathologists.
The DLS was evaluated on an independent validation dataset collected from three sources, consisting of 331 slides from 331 patients (Table 1). At least 3 pathologists provided initial reviews for each slide. A genitourinary specialist pathologist subsequently reviewed each slide along with the initial pathologists' comments to provide a final grade for use as the reference standard (Methods). We additionally looked at three grade group (GG) decision thresholds: GG≥2, GG≥3, and GG≥4. The DLS achieved areas under receiver operating characteristic curves (AUCs) of 0.95-0.96 at each of these thresholds (Fig. 2c). The largest difference occurred at the GG≥4 threshold, where the DLS demonstrated both a higher sensitivity and specificity than 9 out of 10 individual pathologists.

Comparison of DLS to pathologists on Gleason Pattern Quantitation.
In addition to the grade group, more granular reporting of the relative amounts of Gleason patterns is recommended by the International Society of Urological Pathology (ISUP), College of American Pathologists (CAP), World Health Organization (WHO), and recent publications. [31][32][33][34] As such, we also compared the DLS's accuracy in Gleason pattern quantitation to that of pathologists.
Relative to the genitourinary pathologist reference standard, the DLS had a 4-6% lower mean absolute error than the average pathologist for quantitation of patterns 3 and 4 (Fig. 3). In subgroup analysis, for slides in grade groups 2 and 3 (where the amount of pattern 4 can change the overall grade group), the DLS again achieved better quantitation (8% lower mean absolute error). The trend for grade groups 4 and 5 (where quantitation of pattern 5 is significant) was similar. More details are available in Supplementary Tables 7-8.
Insights from DLS Region-Level Classifications. Furthermore, we evaluated the DLS's ability to classify tissue regions within each slide. We collected exhaustive region-level annotations for Next, we characterized the DLS's prediction for regions where the pathologists were discordant by plotting the confidence score of the DLS for each category as a function of inter-pathologist agreement ( Fig. 4b and Supplementary Fig. 2). For tissue regions where pathologists are concordant on Gleason pattern 3, discordant between 3 and 4, or concordant on Gleason pattern 4, the DLS' prediction scores change smoothly with the pathologists' classification distribution. The same trend is seen as we move from Gleason pattern 4 to 5. We further used the DLS's prediction scores directly to classify regions as fine-grained Gleason patterns (e.g. Gleason patterns 3.3 or 3.7). We found that by doing so, that DLS was able to represent a more gradual transition from well-to-poor differentiation than allowed by the canonical coarse Gleason pattern buckets ( Fig. 4c and Supplementary Fig. 3).

Measuring Effectiveness of Gleason Scoring in Risk Stratification for Disease
Progression. Lastly, we compared the ability of the DLS, the cohort of pathologists, and genitourinary specialist pathologists (who comprised the reference standard) to risk stratify patients for biochemical recurrence or disease progression (see Methods). In this analysis, we measured prognostic performance using the c-index , which is an extension of AUC that handles censored data in survival analysis. On the validation set, the DLS-predicted Gleason grade group achieved a c-index of 0.65. The pathologist-provided grade groups yielded a median c-index of 0.63 (see Methods), while the genitourinary specialist pathologists achieved a c-index of 0.69. Kaplan-Meier and hazard ratio analyses using a binary GG≥3 threshold, where hazard ratios for GG3 have previously been shown to be three-fold higher than GG2 2 , to stratify patients into 'high risk' and 'low risk' categorizations showed the same trend (Fig. 5).
In addition to the risk stratification performance of GGs, we also used Cox models 35 to evaluate the prognostic ability of the underlying quantified Gleason patterns. The c-indices of these models were 0.697 for the DLS, 0.674 for the cohort of 29 pathologists , and 0.690 for the specialist-defined reference standard. As proof of concept that finer grained Gleason patterns can improve risk stratification, we also evaluated Cox-regression models trained on a more granular representation of the tumor pattern composition. Adding 'GP3.5' to the canonical Gleason patterns (thus summarizing the tumor composition as %GP3, %GP3.5, %GP4, and %GP5) raised the c-index to 0.704 . F urther adding %GP4.5 resulted in a c-index of 0.702 (Supplementary Table 10).

Discussion
The present study shows that a DLS was more accurate than a cohort of 29 board-certified pathologists in Gleason scoring whole-slide images from prostatectomy patients.
Previous studies have highlighted the value of expertise in pathologic interpretation.
Central histologic reviews provided by pathologists experienced in genitourinary pathology improved prognostication relative to reviews provided by the local institution. Encouragingly, the risk stratification performance (as measured by the c-index and hazard ratio) in this study followed the same trend. 9,30 Due to the importance of genitourinary expertise in pathologic review, a second review has been recommended for high-risk patients after prostatectomy and for needle biopsies prior to prostatectomy. 8,9,36 In routine pathologic workflows, DLS-predicted Gleason scores could be computed on-demand and serve as a decision support tool. Future research is necessary to evaluate the potential clinical impact of the use of these predicted Gleason scores for patient prognostication and associated therapy decisions.

Implications of DLS region-level pattern classifications and quantitation.
We further explored the implications of the DLS on each step of Gleason scoring and their respective scoring variability. The first aspect of Gleason scoring is the region-level classification of Gleason patterns across each slide. In this step, two-dimensional histologic examination of the three-dimensional tissue structures creates inherent ambiguity. Substantial additional variability arises from applying discrete categorizations to glandular differentiation that lies on a continuous spectrum, such as the Gleason pattern 3/4 transition between small glands and poorly defined acinar structures or the Gleason pattern 4/5 transition between fused glands and nests or cords. 12,37,38 Our data show that for regions where pathologists are discordant in Gleason pattern categorization, where the underlying histology is likely closer to the cusp between patterns, the DLS reflects this ambiguity in its prediction scores (Fig. 4b) and demonstrates the potential to assign finer-grained Gleason patterns (Fig. 4c). This finer-grained categorization provides opportunities to mitigate variability stemming from coarse categorization of a continuum, and opens avenues of research for more precise risk stratification (see Supplementary Table 10).
The next step in Gleason scoring after region-level categorization involves visual quantitation of the relative amounts of each Gleason pattern to determine the most prevalent patterns. Quantitation also allows for more granular prognostication. For example, prior studies have shown that prognosis of grade group 2-3 patients worsened for increases of percent Gleason pattern 4 as small as 5-10%. 34 As such, reporting of the quantitation of Gleason patterns is recommended. 4,31,39 However, visual quantitation is associated with inherent subjectivity. 40 In this regard, the DLS bypasses the variability introduced by visual quantitation through direct quantitation of Gleason patterns from its underlying region categorizations. The DLS's natural advantage in this regard and its more accurate quantitation than the cohort of pathologists (as measured by agreement with a specialist-adjudicated reference standard) suggest opportunity for more precise prognostication.
Relation to previous works. or order consults for challenging cases. To account for this, pathologists were asked to indicate when they would prefer additional resources or consults to provide a more confident diagnosis.
Corresponding sensitivity analysis excluding these cases is provided in Supplementary Table 9, showing qualitatively similar results.
Next, this study focuses on grading acinar prostatic adenocarcinoma (the vast majority of prostate cancer cases) in prostatectomy specimens, where the grade group informs postoperative treatment decisions rather than the decision to undergo the prostatectomy itself.
As such, clinical outcomes after prostatectomy are less confounded by divergent treatment pathways than biopsies, supporting analyses of correlations with clinical follow-up data. In addition, prostatectomy specimens contain more tissue than biopsies, providing greater context during histological examination and improving the quality of the reference standard. However, important future work will generalize and validate the DLS for biopsies, other histologic variants, and other prognostic categorizations to aid clinical decisions throughout prostate cancer treatment pathways. Lastly, validation on larger clinically annotated datasets is required to evaluate the statistical significance of trends associated with prognostication demonstrated in this work.

Conclusions
We have developed a DLS that outperforms a cohort of 29 generalist pathologists in Gleason scoring prostatectomy whole-slide images. Additionally, the DLS provides more accurate quantitation of Gleason patterns, finer-grained discretization of the well-to-poor differentiation spectrum, and opportunities for better risk stratification. In doing so, our DLS demonstrates the potential to enhance the clinical utility of the Gleason system for better treatment decisions for patients with prostatic adenocarcinoma.   Table 2), 2 due to extensive artifacts or poor staining that hindered diagnosis, and 20 because of the inability of a genitourinary pathology specialist to confidently assign a diagnosis (Supplementary Table 3). The final validation dataset consisted of the remaining 331 slides (n=183 from TCGA, n=144 from the hospital, and n=4 from the laboratory).

Methods
Overview of Pathologists' Annotations and Reviews. A total of 35 pathologists reviewed slides for this study, all of whom completed residency in human anatomical pathology.
Twenty-nine pathologists were U.S.-board-certified (the "cohort of 29"), and another 3 had genitourinary specialization (1 Canadian-board-certified and 2 U.S.-board-certified). The remaining 3 pathologists were formerly board-certified or certified outside of North America, and provided annotations for the training and tuning datasets but not the validation dataset.
We collected slide-level reviews and region-level annotations from pathologists.
Slide-level reviews categorize each slide into its Gleason grade group. Region-level annotations label specific tissue regions (such as specific Gleason patterns) within a slide. We describe the annotation protocol for the validation dataset here, and include additional details and the protocol for the training and tuning datasets in the "Grading" section and Supplementary Figure   5 in the Supplement.
Collection of Slide-Level Reference Standard. The slide-level reference standard was used to validate the DLS's and general pathologists' performance. For each slide, the reference standard was provided by one genitourinary specialist pathologist. To improve accuracy, the specialist reviewing each slide also had access to initial Gleason pattern percentage estimates and free-text comments from prior reviews of at least 3 general pathologists. The specialist then In the second stage of the DLS, we first obtained a categorical prediction for each region by taking the class with the highest calibrated likelihood, where calibration weights were determined empirically using the tuning set. Next, for each slide, the number of regions predicted as each category was summarized and used for evaluation of (GP) quantitation (%GP3, %GP4, and %GP5). The three %GPs, together with the tumor involvement, were used as features (Fig. 1), similar to what a pathologist would need for Gleason scoring. Finally, we For clinical follow-up analysis, the concordance index was used to measure the overall effectiveness of grade group risk-stratification with respect to an adverse clinical endpoint (disease progression or biochemical recurrence as described above). The hazard ratio and associated Kaplan-Meier curves were used to evaluate risk-stratification at the binary classification of GG > 3. For these risk stratification analyses, the cohort-of-29 pathologists Grade Group classifications were sampled to approximate equal representation of each pathologist (see "Statistical Analysis" in the Supplement). Analysis on the sampled classifications that produced the median concordance and hazard ratios respectively among 999 sampling iterations is reported here.
Confidence intervals for all evaluation metrics were computed using a bootstrap approach (see "Statistical Analysis" in the Supplement      Fig. 2 for a similar analysis on images with mixed-grade labels. c, The continuum of Gleason patterns learned by the DLS reveals finer categorization of the well-to-poorly differentiated spectrum (see "Fine-grained Gleason Pattern" section in the Supplement). Each displayed image region is the region closest (of millions in our validation dataset) to its labeled quantitative Gleason pattern. Columns 1, 4, and 7 represent regions for which the highest confidence predictions are Gleason patterns 3, 4, and 5 respectively. The columns in between represent quantitative Gleason patterns between these defined categories. See Supplementary Fig. 3 for additional examples.   For software infrastructure, logistical support, and slide digitization services, we thank members of the Google AI Healthcare Pathology team. Lastly, we are deeply grateful to the pathologists who provided annotations for this study.   In these datasets, 1-7 slides from each patient were used, and each slide was reviewed by 3-5 pathologists. Slides were excluded from training/tuning if any pathologist deemed the slide ungradable due to variants or poor image quality. Slide-level Gleason scores and region-level Gleason pattern annotations were collected for overlapping subsets of these slides, with the breakdown described in the table above.    Table 3: Analysis on slides excluded from validation set due to genitourinary specialist lack of confidence when diagnosing. 20 slides were excluded from the analysis in the main text where the specialist adjudicator was not able to provide a confident diagnosis. Consults were subsequently provided by the other two GU experts. Of the 12 cases where the original adjudicator and two consulting experts came to a consensus, the DLS was concordant on 9 (highlighted in green) and within 1 grouping on the remaining 3 (highlighted in red).         We derived the slide-level Gleason score (e.g. 3+4) from the predominant GP and next-most-common GP. This is used instead of the directly provided Gleason scores because we noted inconsistent application of tertiary replacement (replacing the secondary Gleason score with '5' if %GP5 is greater than 5%), leading to even greater diagnostic variability. 2 The GG (e.g. GG2) was then directly determined using the Gleason score using the published definitions. 2 Pathologists were additionally instructed to note if a slide contained histologic variants (listed in Supplementary Table 2), did not contain tumor, or if they

Supplementary
were not confident in their diagnosis.

Pathologist Region-Level Annotation Protocol
The region-annotations for all datasets (training, tuning, and validation) were performed using custom free-hand drawing tools in a custom histopathology viewer (see Supplementary Fig. 4) with the ability to zoom between magnifications. Most annotations were performed between 5X and 20X magnifications. Artifacts that affected the ability to make a confident interpretation were labeled as artifacts, and regions where the pathologists were not able to assign confident categorizations based on their best clinical judgement were assigned a "consult" label. Regions where different GPs were either ambiguous or difficult to delineate exactly were assigned mixed-grade labels such as '3+4'. Perineural and lymphovascular invasive tumor and intraductal carcinoma were labeled as non-Gleason-gradable tumors.
For the training slides, at least one pathologist non-exhaustively annotated characteristic regions of each slide (annotated tissue for each slide <1% to 100%, median of 57%). For the tuning slides, we obtained higher-confidence labels by asking three pathologists for exhaustive annotations. In this set, to improve annotation efficiency (retaining slide-level diversity while reducing the overall annotation workload), the pathologists annotated only a subset of each slide, specifically two 3.8x3.8mm square regions from each quadrant on the slide. The locations of the two squares within each quadrant were randomly selected, and all three pathologists annotated the same eight regions (annotated tissue for each slide <1% to 35%, median of 14%). Only image patches with concordance between at least two annotators were used.
To train the stage-1 DLS, we processed the training dataset annotations to retain only regions with unambiguous labels. Ambiguity arising from multiple different labels were resolved by majority vote.
Regions labeled 'artifact' were interpreted as non-tumor to reduce false positive predictions on artifact-containing regions. Regions labeled as 'mixed-grade' were interpreted as the primary pattern ( e.g. , '5+4' was interpreted as GP5), based on empirical observations of a resultant boost in stage-1 region-level accuracy. For the tuning datasets, only regions for which all three annotators provided a label were considered (similar to the validation dataset). In the main text, we report results only for patches labeled non-tumor, GP3, GP4, GP5. The analysis of image patches that are labeled with mixed-grades are presented in Supplementary Fig. 2.

Development of the Deep Learning System
We used a Inception-V3 3 image classification network, with fewer filters per layer (depth_multiplier=0.1) and modified to be fully-convolutional to improve inference throughput on whole-slide images (manuscript under review). To avoid introducing grid artifacts, the fully-convolutional modification involved using 'VALID' instead of 'SAME' padding in convolutions and differential cropping of the output of 'branches' in the Inception architecture. This network takes as input image patches of size 911x911 pixels at 10X magnification (equivalent to 911 ✕ 911 µm). The region "assessed" by the network is a 32 ✕ 32 µm region centered in each image patch.
The training process involved feeding image patches into the network with a specific sampling strategy to avoid bias towards specific slides or classes: first select a class according to the ratios 4:2:2:1 for the four classes respectively, then select a slide containing regions labeled as that class, and finally select an image patch from that slide. To help improve generalization performance, we applied data augmentation techniques to randomly perturb the actual images seen by the neural network (image perturbations for saturation, contrast, brightness, hue, and orientation) during training. 4 Training was performed in TensorFlow 5 using an RMSProp optimizer 6 and the softmax cross-entropy loss function.
Hyperparameters such as the four-class sampling ratios, magnitude of image perturbations, the learning rate decay schedule, and L2 regularization decay were tuned via Gaussian-Bandit search on Google Vizier . 7 After tuning model hyperparameters, hard negative mining and ensembling were employed to further improve model performance. See below section for details of hard-negative mining.
After model convergence (as determined by the patch-level four-way classification performance on the tuning set, as measured by Cohen's kappa), we applied ensembling at three levels. First, the actual network weights used were smoothed using an exponential moving average with decay constant of 0.9999. Second, for each patch, the model predictions across eight image orientations (4 90° rotations and 2 left-right flips) were averaged using the geometric mean. Lastly, these orientation-averaged predictions were again averaged across four independently trained models (each with a separate hard-negative mining process), again using the geometric mean.
In the second stage of the DLS, we first calibrated each region's class predicted likelihoods. The calibration weights were determined empirically to produce the best slide-level predictions on the tuning set. Next, to obtain a categorical prediction for each patch, we applied the argmax function. Finally, each slide's patch-level predictions were summarized as four features: %Tumor, %GP3, %GP4, and %GP5.
We linearly rescaled these features to have a minimum of 0 and a maximum of 1 in the training set, and trained a k-nearest neighbor (kNN) model for each prediction task: 4-way GG classification (GG 1, 2, 3 or 4-5), and each of the three binary classifications of GG > 2, GG > 3, and GG > 4. The hyperparameter "k" (number of nearest neighbors) and neighbor-weighting method (uniform versus reciprocal of distance) were selected based on the performance of each model on the tuning set, as measured by kappa for GG and area under receiver operating characteristic (AUC) for the binary predictions. Our final selected hyperparameters were k=24 with uniform neighbor weighting. In addition, we evaluated the performance of several other machine learning algorithms, such as logistic regression, and random forest on the tuning set. kNN was selected to avoid over-fitting based on the limited size of the slide-level dataset and for ease of interpretability (as visualized in Fig. 1).

Hard-Negative Mining
Our DLS stage-1 development process includes large scale, continuous "hard-negative mining" which aims to improve algorithm performance by running inference on the entire training dataset to isolate the hardest examples and further refine the algorithm using these examples.
In hard negative mining, inference was run hourly by applying the partially-trained network to the entire training dataset (over 112 million image patches) for the entire duration of the training. These inference results were then used to alter the patch-sampling probabilities for every slide in the training set.
For a given class in each slide, these sampling probabilities were initialized at the start of training to be uniform across all image patches. After every inference round, the sampling probabilities were updated to be proportional to the cross-entropy loss of each patch, such that incorrect classifications were sampled more frequently. In other words, as training proceeded, the DLS learned from harder and harder examples, which improved its accuracy more efficiently than random examples. While previous works employing deep learning on histopathology images have employed hard negative mining in an offline "batch-mode" 8-10 , we observed that performance improves with the frequency of inference on the entire training dataset, resulting in the "quasi-online" hard-negative mining approach (>30,000 DLS stage-1 inferences per second) used here. We anticipate that the benefits of this continuous hard negative mining approach may be applicable to developing other deep learning algorithms on histopathology images as well.
For histopathology applications on whole-slide imaging, hard negative mining is a computationally expensive process, requiring inference over 112 million image patches in our training dataset. While previous works employing deep learning on histopathology images have employed hard negative mining in an offline "batch-mode" 8-10 , we observed that performance improves with the frequency of inference on the entire training dataset, resulting in the "quasi-online" hard-negative mining approach (>30,000 DLS stage-1 inferences per second) used here. We anticipate that the benefits of this continuous hard negative mining approach may be applicable to developing other deep learning algorithms on histopathology images as well.

Fine-grained Gleason Pattern (GP)
To provide a more quantitative GP that smoothly interpolates between existing GPs (3, 4, and 5), we processed the calibrated DLS-predicted likelihood for each GP. First, the predictions for the two GPs with highest confidences were used to interpolate between the two GPs using the formula likelihood 1 / (likelihood 1 + likelihood 2 ). For example, if the GP 3,4,5 predictions were [0.7, 0.2, 0.1], then the computed value was 0.7 / (0.7 + 0.2) = 0.78, and the quantitative GP was 3+0.78 = 3.78. To visualize these quantitative GPs (e.g. in Fig. 4a), we used the International Commission on Illumination "Lab" (CIELAB) color space, which is designed to be perceptually uniform with respect to the underlying numerical values.
To select regions that represent desired quantitative GPs ( Fig. 4c and Supplementary Fig. 3), we located the image patches among all validation dataset slides for which the computed quantitative GP most closely matched the desired GP (e.g. 3.5).

Comparison with the Cohort-of-29
Comparison of the DLS with the cohort-of-29 pathologists required a modified permutation test 11 to account for the different numbers of slide-level annotations provided by each pathologist. Specifically, 10 pathologists annotated all the slides (331 annotations each), while 19 pathologists collectively annotated all the slides 3 times (about 50±10 annotated slides by each pathologist). The 10 pathologists that annotated all the slides were selected based on slide reviewing speed and availability. To represent each pathologist equally, we modify the permutation test as follows: define our test statistic as the difference between the DLS accuracy and the mean accuracy among pathologists in the cohort-of-29. In each iteration of the permutation test, for each slide, randomly swap the model's given rating with one of the 14 ratings given for that slide (allowing the model to "swap" with itself with probability 1/14), and compute the test statistic on the result. After 5000 iterations, this gives a null distribution of the test statistic against which we compare the observed difference to compute a two-tailed p value.
In the risk stratification analyses, the cohort-of-29 pathologists annotations were sampled to approximate equal representation of each pathologist. For each slide, the sampled annotation can come from either one of subgroup-of-10 annotations or one of the 3 available subgroup-of-19 annotations.
Specifically, for each slide, an annotation was selected from one of the 10 available subgroup-of-10 annotations with 1/29 probability, or from one of the 3 available subgroup-of-19 annotations with (19/29)*(1/3) probability.

Bootstrap Approach for Confidence Intervals
To compute confidence intervals for the pools of 10, 19, and 29, we bootstrapped both slides and annotators by resampling both with replacement in each iteration of the bootstrap. In the case of the pool of 29, to replicate our experimental design in each iteration, we separately resampled the subsets of 10 and 19.

DLS Region-level Errors
Here, we present a qualitative analysis of the errors made by the DLS's first stage, at the region level. Several errors were related to spatial localization. For example, the spatial extent of each predicted Gleason pattern region was sometimes imprecise; if two tumor-containing regions were separated by a small strip of non-tumor tissue, the DLS would sometimes categorize the intervening non-tumor as tumor.
Similarly, delineating the precise stroma-tumor interface was difficult for the DLS, in particular for GP5 and stroma (non-tumor). This was likely because GP5 can present as individual tumor cells in a background of connective tissue, and outlining each individual cell was impractical. The "impurity" of the underlying region-level annotation made it difficult to develop a DLS that was precision with respect to the boundary.
In many other cases, the errors made by the DLS was one where the underlying histology was ambiguous, such as when a tangential cut into a GP3 region caused it to resemble the fused-gland pattern that defines GP4. Because the DLS was trained to interpret the image patch surrounding the region, it will not take into account context from beyond its input image.
The remaining region-level errors involved true prediction mistakes that will naturally improve with more data. The second stage of the DLS is fairly robust against all of these errors by summarizing the predictions from all regions on the slide as a small number of features.