Association of artificial intelligence-powered and manual quantification of programmed death-ligand 1 (PD-L1) expression with outcomes in patients treated with nivolumab ± ipilimumab

Assessment of programmed death ligand 1 (PD-L1) expression by immunohistochemistry (IHC) has emerged as an important predictive biomarker across multiple tumor types. However, manual quantitation of PD-L1 positivity can be difficult and leads to substantial inter-observer variability. Although the development of artificial intelligence (AI) algorithms may mitigate some of the challenges associated with manual assessment and improve the accuracy of PD-L1 expression scoring, use of AI-based approaches to oncology biomarker scoring and drug development has been sparse, primarily due to the lack of large-scale clinical validation studies across multiple cohorts and tumor types. We developed AI-powered algorithms to evaluate PD-L1 expression on tumor cells by IHC and compared it with manual IHC scoring in urothelial carcinoma, non-small cell lung cancer, melanoma, and squamous cell carcinoma of the head and neck (prospectively determined during the phase II and III CheckMate clinical trials). 1,746 slides were retrospectively analyzed, the largest investigation of digital pathology algorithms on clinical trial datasets performed to date. AI-powered quantification of PD-L1 expression on tumor cells identified more PD-L1–positive samples compared with manual scoring at cutoffs of ≥1% and ≥5% in most tumor types. Additionally, similar improvements in response and survival were observed in patients identified as PD-L1–positive compared with PD-L1–negative using both AI-powered and manual methods, while improved associations with survival were observed in patients with certain tumor types identified as PD-L1–positive using AI-powered scoring only. Our study demonstrates the potential for implementation of digital pathology-based methods in future clinical practice to identify more patients who would benefit from treatment with immuno-oncology therapy compared with current guidelines using manual assessment.

Digital pathology and artificial intelligence (AI)-powered approaches can aid pathologists in overcoming the challenges associated with manual scoring [16][17][18] . While AI-based methods have demonstrated moderate to high correlation with pathologist scoring in urothelial carcinoma (UC), melanoma (MEL), and breast cancer [19][20][21] , studies directly comparing their performance in large randomized controlled trials using traditional response and survival endpoints are limited 22 .
In this study, we developed unique AI-powered algorithms to retrospectively evaluate PD-L1 expression on tumor cells (TCs) across multiple tumor types, including samples from patients with non-small cell lung cancer (NSCLC), squamous cell carcinoma of the head and neck (SCCHN), MEL, and UC. The performance of AI-powered analysis was then compared with manual scoring of PD-L1 expression that was prospectively generated as part of phase II and III clinical trials across two different PD-L1 expression cutoffs in patients treated with nivolumab ± ipilimumab (NIVO ± IPI).  [23][24][25][26][27][28] .

Study procedures
Clinical assessments. Patient responses were assessed according to Response Evaluation Criteria in Solid Tumors v1.1 as previously described [23][24][25][26][27][28] . Responses were categorized as complete response, partial response, stable disease, progressive disease, or response not evaluable. Objective response rate (ORR) was calculated using the percentage of patients who achieved a complete or partial response compared with those who achieved stable or progressive disease or were not evaluable. Survival was assessed using overall survival (OS) for CheckMate 057, 275, 067, and 141, recurrence-free survival for CheckMate 238, or progressionfree survival for CheckMate 026. For more information regarding survival endpoints in each clinical trial, refer to the Supplementary Methods, "Clinical assessments" section.
Sample preparation and biomarker assessment. Formalin-fixed, paraffinembedded tissue slides were stained using the Dako PD-L1 IHC 28-8 pharmDx assay (Agilent, Santa Clara, CA, USA) per the manufacturer's instructions as part of the respective clinical trial [23][24][25][26][27][28] . PD-L1 TC expression was derived from the percentage of TCs with complete circumferential or partial PD-L1 expression at any level of intensity divided by all TCs. For more information regarding PD-L1 testing in each clinical trial, refer to the Supplementary Methods, "PD-L1 assessment in each clinical trial" section.

Outcomes
Development of PD-L1 AI-powered scoring algorithms. To develop a deeplearning model that can generate an AI-powered PD-L1 expression score, whole slide images (WSIs) of PD-L1-stained slides were generated using the Aperio AT2 image-scanning platform (Leica Biosystems, Vista, CA, USA) at 0.5 microns/pixel resolution (20× objective). These WSIs were used to develop tumor-specific algorithms.
Board-certified pathologists from the PathAI network provided more than 250,000 cell-level annotations on a training set of digital WSIs from a mix of commercial and clinical trial biopsy samples from each tumor type stained for PD-L1 expression by IHC. These included 217 samples from patients with NSCLC, 600 from MEL, 400 from SCCHN, and 293 from patients with UC. Annotations defined PD-L1 expression on individual TCs and immune cells (ICs), including macrophages and lymphocytes. For SCCHN and MEL, deep-learning models were trained to recognize and quantify PD-L1-expressing TCs using these annotations while automatically excluding regions that would interfere with PD-L1 scoring, such as areas of background staining, anthracotic pigment, necrosis, areas of poor image quality, and, in the case of MEL samples, areas of melanin filled macrophages (melanophages). With NSCLC and UC samples, the algorithms were trained to recognize areas of background staining, anthracotic pigment, necrosis, etc., as negative for PD-L1 expression. Annotations for normal tissue, tumor parenchyma, and tumor stromal regions were also provided.
Outputs consisting of quantitative features summarizing slide-level PD-L1 expression on TCs were generated for each sample (AI-powered score). Tumor samples were then classified as PD-L1-positive or PD-L1-negative (as described in the previous section), using cutoffs of 1% and 5%. Quality control was performed by board-certified pathologists on tissue samples evaluated for PD-L1 expression. A sample was deemed evaluable if there were ≥100 viable TCs that were in focus and not obscured by artifact or background staining.
To ensure that the overall cell-and tissue-level AI classifications were appropriate, pathologists were asked to review PathAI heatmap overlays of regions of interest and to evaluate whether the algorithm accurately determined TC and IC PD-L1 expression. Each region of interest included tumor, intratumoral stroma, and peritumoral stroma, while areas containing crushed tissue or artifacts were excluded. Pathologists were given thresholds of PD-L1 TC expression to choose from (0-5%, 5-25%, 25-50%, 50%). The AI-powered score was marked as correct if it was within the same threshold as the manual score or as incorrect if it was not. An overview of AI-powered and manual assessment of PD-L1 expression on TCs is provided in Fig. 1.
Application of AI-powered algorithm to test set. Model performance for each tumor type was assessed using an independent set of commercial and clinical trial-procured samples (distinct from the training set) that were stained for PD-L1 expression and digitized using the Aperio AT2 image-scanning platform. High-quality 150 × 150-pixel frames of subregions were defined from WSIs. Exhaustive annotations from five different pathologists from LabCorp (Burlington, NC, USA) were used to classify cell types and identify the absolute number of PD-L1-positive TCs in each frame. Additional details on samples used for training of the AI algorithm can be found in Supplementary Table 1. The median number of PD-L1-positive TCs was used to generate a consensus score. Agreement between the pathologist consensus score and the modelgenerated PD-L1 score was calculated using Pearson's correlation coefficients. An overview of this frame-based validation method is provided in Fig. 2. scoring at cutoffs of ≥1% and ≥5%. For CheckMate 026, only patients with a PD-L1 expression level of ≥1% underwent randomization and were stratified according to a PD-L1 expression level of < or ≥5% 25 .

Assessment of AI-powered scoring algorithm across multiple scanners
The AI-powered algorithm, trained and validated as described in the previous "Development of PD-L1 AI-powered scoring algorithms" and "Application of AI-powered scoring algorithm to test set" sections, was used to assess PD-L1 expression on TCs using 20

Statistical analysis
Inter-scanner and inter-and intra-day precision. Average and standard deviation (SD) statistics were computed for PD-L1 expression within each group of images pertaining to the same slide scanned at distinct times with different scanners. Analysis of variance tests were performed to determine the significance between the difference in % TC across days, times, and scanners. Coefficients of variation for % TC across all slides were estimated as the SD divided by the mean, multiplied by 100 ([SD/mean]*100).
Clinical outcomes. Association of PD-L1 expression on TCs with clinical efficacy was assessed using cutoffs of ≥1% and ≥5%, as evaluated by AIpowered and manual scoring. Kendall's tau coefficient was used to evaluate the correlation between AI-powered and manual scores within each trial. Odds ratios (ORs) were calculated using logistic regression to examine associations with objective response. Objective response predictions by AI-powered and manual scoring across all trials were assessed by plotting summary receiver operating characteristic curves and calculating  the area under the curve (AUC) using metaROC from the R-package with a fully non-parametric estimation with random effect 29 . Hazard ratios were estimated using Cox proportional hazards models to examine associations with progression-free survival, recurrence-free survival, or OS. Kaplan-Meier curves were used to illustrate comparisons of survival in samples identified as positive using both AI-powered and manual scoring, additional samples only identified as positive by AI-powered scoring, additional samples only identified as positive by manual scoring, and samples identified as negative using both AI-powered and manual scoring.

Validation of AI-powered scoring algorithm
Validation of the AI-powered scoring algorithm was conducted using a frame-based comparison of AI-powered scoring and pathologist-derived scoring of PD-L1 expression on TCs on WSIs. Using a combination of commercial and clinical tumor samples from patients with NSCLC, SCCHN, MEL, and UC (Supplementary  Table 1), AI-based assessment was highly correlated with the median score from manual assessment of PD-L1-expressing TCs by 5 pathologists (r ranging from 0.73 to 0.85) and variability fell within the range of pathologists' scores ( Supplementary Fig. 1).
We then compared the performance of the AI-based algorithm with manual scoring to evaluate prevalence of PD-L1-positive patients from multiple clinical trials.

Prevalence of PD-L1 expression by AI-powered and manual scoring
The algorithm tended to identify a higher prevalence of PD-L1-positive patients as compared with manual scoring. This trend was observed across the majority of tumor types (Table 1) and was consistent across both the 1% and 5% cutoffs. In patients with NSCLC (CheckMate 057), UC (CheckMate 275), and MEL (CheckMate 067 and 238), the prevalence of PD-L1-positive patients increased in the range of 5% to 39% and from 6% to 25% with AI-powered scoring compared with manual scoring at PD-L1 expression cutoffs of ≥1% and ≥5%, respectively. In patients with SCCHN (CheckMate 141), a lower prevalence of PD-L1-positive patients was seen with AI-powered scoring (42.5% and 28.8%) compared with manual scoring (54.9% and 34.0%) at cutoffs of ≥1% and ≥5%, respectively, though the difference was not significant (Table 1). This could be due to a number of factors, such as presence of crush artifact or low PD-L1 membrane staining with cytoplasmic positivity (blush) (see Fig. 3 and Discussion). Given the observed trend for higher prevalence with the AIpowered scoring algorithm, we assessed whether this affected prediction of treatment response.
Comparison of AI-powered and manual scoring as predictors of response The combined sensitivity and specificity of AI-powered and manual scoring for predicting ORR was assessed for all trials and both PD-L1 expression cutoffs used in this study. AUC values derived from summary receiver operating characteristic curves were similar for AI-powered (AUC = 0.602) (Fig. 4A) and manual scoring (AUC = 0.596) (Fig. 4B), suggesting that the performance of each scoring method was similar in predicting ORR.

Association of PD-L1 expression with ORR
To assess the potential impact, including increased prevalence, of AI-powered scoring on ORR, we reanalyzed PD-L1 expression in each study at the predefined cutoffs of ≥1% and ≥5% using AIpowered assessment and directly compared the value with that obtained using manual scoring as part of the original trial. In general, the majority of OR point estimates suggested a slight increase in the association between ORR and patients identified as PD-L1-positive using AI-powered scoring compared with manual scoring in four out of five studies ( (Fig. 5).
In all but three studies, ORRs in patients identified as PD-L1-positive were similar, regardless of the AI-powered or manual method used, suggesting that AI-powered assessment can correctly identify PD-L1-positive patients who would respond to and thereby benefit from immuno-oncology therapy. The exceptions included CheckMate 057 (NSCLC), where patients identified as PD-L1-positive at cutoffs of ≥1% and ≥5% using AI-powered scoring were associated with a lower ORR (21.1% and 25.5%) compared with manual scoring (28.3% and 32.5%) (Fig. 5). Likewise, in the NIVO + IPI arm of CheckMate 067 (MEL), ORR increased when assessed by manual scoring (73.8%) compared with AI-powered scoring (60.9%) at a cutoff of ≥5%. Conversely, in CheckMate 141 (SCCHN), there was a slight increase in ORR in patients identified as PD-L1-positive using AI-powered scoring (20.0% and 25.0%) compared with manual scoring (16.7% and 21.2%) at cutoffs of ≥1% and ≥5% (Fig. 5). We then determined the impact of AIpowered scoring on survival outcomes as determined in each clinical trial.
Association of PD-L1 expression with survival PD-L1 expression on TCs at cutoffs of ≥1% and ≥5% assessed using AI-powered or manual scoring was significantly associated with recurrence-free survival in NIVO-treated patients with MEL (CheckMate 238) (Fig. 6A). In patients with NSCLC (CheckMate 026) identified as PD-L1-positive by either AI-powered or manual scoring, progression-free survival was prolonged at a cutoff of ≥5% in patients treated with NIVO (Fig. 6B). Additionally, PD-L1 expression assessed by either method was significantly associated with OS in patients with NSCLC (CheckMate 057) and UC (CheckMate 275) at both cutoffs (Fig. 6C). In patients with MEL (CheckMate 067) treated with NIVO, both methods were significantly associated with OS at a cutoff of ≥1%, but not at the ≥5% cutoff. In the same trial, no association with OS was seen at either cutoff using both AI and manual PD-L1 methods in patients treated with NIVO + IPI (Fig. 6C). In patients with SCCHN, OS benefit was similar for PD-L1-positive patients identified by AIpowered and manual scoring (Fig. 6C). Across all tumor types and cutoffs, patients identified as PD-L1-positive by both manual and AI-powered scoring demonstrated improved associations with survival compared with patients identified as PD-L1-negative by both manual and AIpowered assessment. Additionally, patients identified as PD-L1-positive by AI-powered scoring alone demonstrated improved associations with survival compared with those identified as PD-L1-negative by both methods in some tumor types and clinical trials across different cutoffs ( Supplementary Fig. 2).

Analytical precision of AI-powered scoring algorithm
Inter-and intra-reproducibility of the AI-powered scoring algorithm. To assess whether our algorithm can produce consistent results when WSIs are obtained from multiple slides scanned using different scanners, we evaluated the inter-scanner precision. No significant variation in % TC values obtained from each slide scanned with either scanner 1 or scanner 2 was observed (Fig. 7). Additionally, mean % TC values did not significantly differ between different days in which slides were scanned (p > 0.05) or at different times on the same day (p > 0.05).

DISCUSSION
Recent approvals of immune checkpoint inhibitors with companion PD-L1 IHC assays in various tumor types demonstrate the increasing utility and widespread clinical use of PD-L1 testing to determine patients who may benefit from these therapies. However, classification and stratification of patients based on manual IHC methods may not always be reproducible, as a number of factors can create challenges for pathologists when scoring PD-L1 on TCs, such as heterogenous PD-L1 expression   Fig. 4 Comparison of artifical intelligence-powered and manual scoring as a predictor of ORR across trials. A Artificial intelligencepowered scoring. B Manual scoring. Associations of each trial population at the 1% and 5% cutoffs to ORR were plotted in the gray solid lines. Dotted line is the null reference representing a classifier predicting association of PD-L1 expression with random chance. Fitted sROC and 95% confidence intervals are drawn in blue. For CheckMate 026, only patients with a PD-L1 expression level ≥1% underwent randomization and were stratified according to a PD-L1 expression level of < or ≥5%. No response data are available for the adjuvant CheckMate 238 study, which was therefore excluded from this analysis. AUC area under curve, ORR objective response rate, PD-L1 programmed death ligand 1, sROC summary receiver operating characteristic.
within the tumor microenvironment and variable staining patterns in different cellular compartments (e.g. membrane vs. cytoplasmic staining), potentially leading to substantial inter-observer variability 6,15,17,30,31 . As demonstrated by our frame-based validation method, the results of AI-powered scoring were shown to be comparable to manual assessment of PD-L1 expression on TCs and fell within the variability limits of human error observed across pathologists. However, the reproducibility of AI-powered scoring can reduce inter-observer variability and subjectivity while potentially increasing sensitivity and specificity when scoring and interpreting stains 16 . In studies using manual scoring as the reference standard, an AI-powered approach has been shown to increase inter-observer reproducibility and accuracy of biomarker scoring in breast cancer, NSCLC, and MEL samples, leading to better identification of patients who may benefit from ICI therapy 16,17,21,[32][33][34] . AI-powered scoring can also be applied to algorithms that include ICs, such as combined positive score 35 . In these algorithms, PD-L1 IC expression can be difficult to reliably assess visually, and thus pathologist concordance tends to be lower 13,14,36,37 . AIpowered scoring methods may thereby offer more precise and consistent results when defining PD-L1 expression on both TCs and ICs across multiple tumor types and cutoffs. However, despite these advantages, there is a reluctance to utilize digital pathology approaches in biomarker scoring and drug development, due to a lack of large-scale clinical validation studies in the oncology setting.
Of relevance to this current study, associations of manually scored TC PD-L1 expression with clinical benefit of NIVO ± IPI have been studied across multiple tumor types and PD-L1 expression cutoffs with varying results [23][24][25][26][27]38 . Given the development of AIbased IHC quantitation methods and their potential for scalability and use in routine clinical practice, we sought to evaluate the performance of an AI-based algorithm to quantify PD-L1 expression using samples from several pivotal trials evaluating NIVO ± IPI across multiple tumor types. In one of the largest sample sizes to date (n = 1746), we assessed both AI-based and manual quantification of PD-L1 expression on TCs and compared their associations with response and survival.
We found that more patients with PD-L1 expression at cutoffs of ≥1% and ≥5% were identified by AI-powered scoring compared with manual scoring in patients with NSCLC, UC, and MEL. This increase in measured prevalence of positive patients using AIbased method is likely a result of multiple factors. The algorithm exhaustively analyzes and classifies every cell on the tissue image, thereby providing a highly precise measure of the true PD-L1 positivity on TCs. Although the algorithm is extensively evaluated for accuracy in cell classification, some level of misclassification is expected. In general, the observed discordances between manual and AI-powered scoring were associated with multiple factors. In certain scenarios, the model correctly identifies TCs but does not classify them as PD-L1-positive. These misclassifications could be due to factors such as the presence of clustered, membranous PD-L1-positive TCs overlapping with PD-L1-negative TCs or misclassification of PD-L1-positive ICs as PD-L1-positive TCs. In our frames-based validation analysis shown in Supplementary Fig. 1, we have observed certain discordant frames with examples of both the model and the pathologists overestimating the number of PD-L1-positive TCs. Based on this analysis, such errors were relatively low, and any sample with a large number of misclassifications was flagged during the quality control process. Conversely, a higher prevalence of PD-L1-positive samples was identified by manual scoring compared with AI-powered scoring at both cutoffs in patients with SCCHN. Interpreting PD-L1 expression requires reproducibility across the spectrum of SCCHN differentiation. Manual assessment of PD-L1 expression in basaloid or poorly differentiated SCCHN tumors can be challenging, due to issues such as crush artifacts from tissue handling; such cases may be accurately identified as PD-L1-positive by manual scoring but  misclassified as PD-L1-negative by the algorithm. Additionally, non-specific cytoplasmic blush staining coincident with weak membrane PD-L1-positive staining may lead to under-detection of membrane staining by the stringent AI model developed for SCCHN (examples of these can be found in Fig. 2). Another challenge pertaining to assessment of PD-L1 expression in moderate to well-differentiated SCCHN is the presence of keratinized degenerate and anucleate cells, which may be identified as PD-L1-positive by manual scoring but as PD-L1-negative by the algorithm. The model was intentionally trained to reduce false positive detection due to these factors, with a consequent decrease in the detection of low membrane staining of PD-L1, especially in basaloid variant tumors. However, the algorithm sufficiently identified the majority of the responders in CheckMate 141, consistent with manual scoring, as demonstrated by the similar ORR in patients identified as PD-L1-positive using AI-powered scoring as compared with manual scoring. This demonstrates the need for the development of algorithms optimized to account for morphological features unique to each tumor type. We then assessed clinical endpoints to determine if the increase in prevalence of patients identified as PD-L1-positive using AIpowered scoring was associated with clinical benefit. In evaluated patients with NSCLC, UC, and MEL, treatment response and survival were similar in patients identified as PD-L1-positive using either AI-powered or manual scoring, suggesting that the additional PD-L1-positive patients identified using AI-powered scoring had a similar treatment response and survival to those identified as PD-L1-positive by both methods. AI-powered scoring of PD-L1 expression may therefore detect patients with PD-L1-positive tumors that express low levels of PD-L1, which may go undetected by manual scoring methods.
Finally, we conducted a separate analysis using our previously trained and validated algorithm to assess the reproducibility of AIpowered scoring of PD-L1 expression across different scanners, days, and times of day. No significant variations in identification of PD-L1-positive TCs based on day, time of day, or scanner were identified. These results demonstrate the ability of the AI algorithm to overcome analytical factors that may occur during a typical workflow to produce consistent and accurate results.
To our knowledge, this is the first study utilizing a large sample size across various tumor types to develop and compare the ability of AI-based scoring and manual assessment to identify PD-L1 expression on TCs and its association with clinical efficacy in a large cohort of patients from multiple trials. Previous studies on single tumor types with a small number of patients have also sought to compare digital and manual assessment of PD-L1 expression. Koelzer et al. sought to create a standardized digital protocol for the assessment of PD-L1 staining in MEL (n = 69) and to compare the output data and reproducibility to conventional assessment by expert pathologists. Consistent with our results, high correlation was observed between digital and manual assessment in MEL samples. Additionally, the image analysis protocol had high inter-reader reproducibility and reduced variability compared with manual assessment of PD-L1 expression 33 . Another study compared the results of PD-L1 expression using combined positive score in samples from a small phase II trial in patients with gastric cancer (n = 39), as measured by digital image analysis and pathologist interpretation, and its ability to predict response to pembrolizumab. Similar to our findings, both methods were predictive for response to pembrolizumab in patients with gastric cancer. However, there are some important differences in this study compared with our study, including use of a small set of samples from one clinical trial and the inability of the image analysis tool to distinguish between PD-L1-positive TCs and ICs, that limit the ability to determine the respective role of each cell type in predicting response 39 .
This investigation has limitations due to the retrospective nature of our treatment response and survival analyses. Additionally, since we sought to compare AI-powered scoring with manual scoring carried out as part of the original trials, the majority of which did not assess PD-L1 positivity in immune compartments, we limited our analysis to evaluating PD-L1 expression on TCs only. Therefore, our results cannot be extrapolated to other scoring methods or assays. However, our scoring algorithm has the potential to be used to determine staining in additional cell types 19,40 and warrants further study to include additional scoring methods that incorporate assessment of ICs, such as combined positive score, and application in additional tumor types.
Our study demonstrates that AI-powered quantification of PD-L1 expression on TCs identified more PD-L1-positive samples compared with manual scoring across several tumor types explored in this study, while demonstrating consistent associations with response and survival across multiple clinical trial datasets. Compared with manual scoring, our AI algorithm has the potential to identify more patients who may benefit from immuno-oncology therapy. The findings of our study could serve as a framework for incorporation of AIpowered scoring as a precise, reproducible, scalable, and exhaustive approach to quantifying PD-L1 expression on TCs in routine practice, leading the way for application in future prospective large-scale clinical trials.

DATA AVAILABILITY
Any additional data not included in the manuscript or supplementary files that support the findings of this study are available from the corresponding author VB.