Automated tumor proportion score analysis for PD-L1 (22C3) expression in lung squamous cell carcinoma

Programmed cell death ligend-1 (PD-L1) expression by immunohistochemistry (IHC) assays is a predictive marker of anti-PD-1/PD-L1 therapy response. With the popularity of anti-PD-1/PD-L1 inhibitor drugs, quantitative assessment of PD-L1 expression becomes a new labor for pathologists. Manually counting the PD-L1 positive stained tumor cells is an obviously subjective and time-consuming process. In this paper, we developed a new computer aided Automated Tumor Proportion Scoring System (ATPSS) to determine the comparability of image analysis with pathologist scores. A three-stage process was performed using both image processing and deep learning techniques to mimic the actual diagnostic flow of the pathologists. We conducted a multi-reader multi-case study to evaluate the agreement between pathologists and ATPSS. Fifty-one surgically resected lung squamous cell carcinoma were prepared and stained using the Dako PD-L1 (22C3) assay, and six pathologists with different experience levels were involved in this study. The TPS predicted by the proposed model had high and statistically significant correlation with sub-specialty pathologists’ scores with Mean Absolute Error (MAE) of 8.65 (95% confidence interval (CI): 6.42–10.90) and Pearson Correlation Coefficient (PCC) of 0.9436 (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p < 0.001$$\end{document}p<0.001), and the performance on PD-L1 positive cases achieved by our method surpassed that of non-subspecialty and trainee pathologists. Those experimental results indicate that the proposed automated system can be a powerful tool to improve the PD-L1 TPS assessment of pathologists.

www.nature.com/scientificreports/ (IHC) assays has been approved to be associated with remarkably longer overall survival and with fewer adverse events on NSCLC [10][11][12] . The Dako PD-L1 IHC 22C3 and 28-8 pharmaDx assays employ tumor proportion score (TPS) to measure PD-L1 expression. The TPS is determined as the percentage of PD-L1 positive stained tumor cells (TCs) with at least partial membrane staining relative to the total number of TCs, excluding tumor-associated interstitial cells (ICs), necrotic, normal or non-neoplastic cells from the evaluation: Therefore, the pathological assessment of tumoral PD-L1 expression is based on two semi-quantitative information of PD-L1 positive tumor cell number and total viable tumor cell number.
In clinical practice, pathologists determine the TPS by microscopic examinations. Specifically, when the slide has a single PD-L1 positive tumor area, the TPS is determined as the product of the % of positive staining area and the % of positive TCs in the area. Whereas for the slide with heterogeneous tumor areas, TPS is calculated by averaging the stained tumor cell percentages of several divided tumor areas 13 . Manually assessment is obviously time-consuming and subjective. It is hardly possible for pathologists to conduct visually quantitative analysis based on the whole slide images (WSI), which may contain millions of cells. Furthermore, imprecise and subjective definition of stain intensity makes the assessment even more difficult to ensure the inter-and intra-observer reproducibility.
With the advent of whole slide imaging scanners and dramatic improvements in computer vision algorithms, an increasing number of clinicopathologic Computer Aided Diagnosis (CAD) systems have been proposed 14 . Previous works have shown promising performance in many straightforward tasks, such as metastasis detection 15 , tumor region detection 16,17 , and cell detection 18 , etc. Researchers also introduced high level assessment system based on comprehensive information. Liu et al. proposed an end-to-end deep learning framework for automatic histochemical-score assessment for breast cancer tissue microarray 19 . A deep reinforcement learning based model was introduced for HER2 scoring, which was able to select diagnostically relevant regions 20 . Christiansen et al. showed their pioneering research of predicting fluorescent labels from transmitted-light images of unlabeled samples 21 using supervised machine learning techniques. Inter-and intra-observer concordances are essential and widely used for CAD system evaluation. Luo et al. compared their proposed gastrointestinal artificial intelligence diagnostic system with three endoscopists of varying degrees of expertise 22 . Eleven pathologists of four experience levels were involved to assess the deep learning assistant system 23 .
Given the recent advances in the field of deep learning and computer vision, researchers have proposed several deep learning based frameworks for automatic PD-L1 scoring 24,25 . However, prior literature has focused on the methodology. In this study, we introduced computer aided TPS system named Automated Tumor Proportion Scoring System (ATPSS) for PD-L1 tumor proportion score assessment on lung squamous cell carcinoma slides. The system was designed to mimic the scoring process of the pathologists by only using PD-L1 stained IHC digital slides as input. We first constructed a tumor image and a positive tumor image, which were produced by deep learning based model of tumor area segmentation and image processing based module of positive membrane detection. The numbers of total TCs and PD-L1 positive TCs were counted on those two images by nuclei detection model for TPS calculation. To get a better understanding of AI aided PD-L1 TPS system, we conducted a multi-reader multi-case study. The main contribution of this paper is threefold and summarized as follows: • To the best of our knowledge, this is the first comprehensive study involving different experience level pathologists and different PD-L1 expression level cases for the automatic TPS system evaluation. • The proposed system was designed with recently published deep learning techniques and trained with the simplest labels of tumor region and tumor nuclei center. • The results demonstrated that the ATPSS achieved promising results on PD-L1 positive cases, and could significantly improve the assessment accuracy of pathologists with lower levels of relevant expertise.

Materials and methods
The study was approved by the institutional review boards of the participating institution, i.e., the Fudan University Shanghai Cancer Center. Informed consent was obtained from all subjects. All clinical data and digital slides used were anonymized. Methods and experiments in this study were performed in accordance with relevant guidelines and regulations.

Materials.
We collected two PD-L1 IHC WSI datasets, i.e., algorithm development cohort and evaluation cohort, for lung squamous cell carcinoma. Both were obtained from Fudan University Shanghai Cancer Center from 2018 to 2020. For each case in those two datasets, newly cut sections from formalin-fixed paraffin-embedded samples obtained by surgical resection and biopsy were used to avoid degradation of PD-L1 protein due to long-term archiving, so as to ensure the reliability of the immunohistochemical results for PD-L1 expression. It is noted that all cases in evaluation cohort were from surgical resection, while biopsies were collected for algorithm development. Two serial sections were cut from each tissue for Haematoxylin and Eosin (H&E) and PD-L1 IHC. H&E sections were stained using the Sakura Tissue-Tek Prisma staining machine (Sakura Prisma-J2S). PD-L1 IHC slides were performed on the Dako Autostainer Link 48 platform according to the automated staining protocol with the PD-L1 22C3 antibody. All slides were digitized by KFBIO FK-Pro-120 slide scanner. www.nature.com/scientificreports/ The algorithm development set consists of 45 PD-L1 WSIs. According to our proposed algorithm, we extracted two different patch image datasets of TumorSeg and NucleiDetect for tumor region segmentation module and nuclei detection module respectively. TumorSeg consisted of 22,000 patch images of size 512 × 512 captured at 20× optical magnification ( 0.475 µm/pixel) with pixel-wise labels on tumor region. NucleiDetect contained 4600 patch images of 256 × 256 captured at 40× optical magnification, where nuclei centers of tumor cells were manually marked. The patch image ratios of PD-L1 expression negative to positive in both datasets were 3:7. Manual annotations were conducted by two experienced pathologists and four graduate students of pathology. Both datasets were randomly split into training and validation sets for ten-fold cross-validation.
A total of 51 tumor tissues from lung squamous cell carcinoma patients were collected as the evaluation cohort in this study. Inclusion criteria were that tissues contained a sufficient number of viable TCs for PD-L1 IHC testing, and the PD-L1 tumor expression of the evaluation cohort were balanced in three PD-L1 expression levels of < 1% , 1-49%, and ≥ 50% . Detailed patient demographics and PD-L1 results are summarized in Table 1.
Automatic TPS algorithm development. We proposed an artificial intelligence based algorithm combining both deep learning and image processing techniques for PD-L1 TPS assessment. Only PD-L1 stained IHC WS image was utilized as input. The system produced two intermediate results, i.e., tumor mask and positive stain mask, and the final TPS estimation by counting tumor nuclei on those two masked regions. The overview of the automatic TPS prediction framework is shown in Fig. 1. The framework consists of three main modules: We employed the fully convolutional network (FCN) architecture named Res50-UNet to detect and segment the tumor area. The model was built on U-Net 26 , which has a symmetric encoder-decoder architecture with skip connection between downsampling and upsampling paths. We utilized the pre-trained Resnet-50 27 as the encoder, and removed the striding in the last two blocks and applied dilated convolution with rate of 2 and 4. In addition, Atrous Spatial Pyramid Pooling (ASPP) was utilized between encoder and decoder to capture contextual information at multiple scales. The ASPP module contained four parallel atrous convolutions with increasing dilation rates of [1,2,4,7].
The proposed positive membrane detection module does not require manual annotation. The DAB channel image I DAB was first extracted from input original RGB images using color deconvolution 28 . In order to detect the positive stained membrane, we applied a difference of Gaussian (DOG) filter on the DAB channel image, which is then processed by luminance weighted thresholding 19 . Specifically, the Luminance Adaptive Multi-Thresholding (LAMT) 29 was first utilized to classify the positive stained pixels. We assigned luminance values to positive pixels for later thresholding. The idea was that the luminance instead of the value of I DAB could correctly describe the stain intensity. Finally, the positive stained region mask was obtained by post-processing with morphological operations and hole filling. The morphological operations are open and close operations, which arm to eliminate small dot noise and complete the partial membrane respectively. The positive stained cells are determined that if cells fall within the positive stain mask.
The tumor cell counting module was implemented by a nuclei detection model of Micro-Net 30 , which also resembled U-Net. The network was characterized by multi-resolution input and output for extracting multi-scale

Results
Intermediate results of the proposed ATPSS. Comparison of ATPSS to pathologists on WSIs TPS assessment. In order to evaluate the impact of the automatic TPS assessment system, we designed a multi-reader multi-case study. Six pathologists involved www.nature.com/scientificreports/ in this study were divided into three groups based on their experience: pathology trainees (2), non-subspecialty pathologists (2), and subspecialty pathologists (2). Subspecialty and non-subspecialty pathologists have experiences of PD-L1 TPS assessment, while pathology trainees had been trained before the experiment. All pathologists can access the corresponding H&E slide during the assessment, while ATPSS only used PD-L1 IHC whole slide image for prediction. The ground truth TPS results of the evaluation cohort were from the pathology reports, which were interpreted by two subspecialty pathologists blinded to the clinical data under a light microscope (Olympus BX43). We evaluated the results using the 3-class classification (PD-L1 expression levels of negative [<1%], low expression [1-49%], high expression [ ≥ 50%]) accuracy (ACC). Meanwhile, Mean Absolute Error (MAE) and the Pearson correlation coefficient (PCC) between the ground truth and the TPS predicted by ATPSS and pathologists were used as the evaluation metrics. As a reference, the MAE and PCC between the predicted TPS given by the two pathologists with the same experience level were also calculated.
The average accuracy for PD-L1 expression level classification of all 6 pathologists was 84.3%, and the pathologist experience level had a significant effect on TPS assessment accuracy. Among the pathologist subgroups, subspecialists was unsurprisingly accurate on all three expression levels with a classification accuracy of 97.06% and an MAE of 4.17(95% CI 3.13-5.20). The within-group correlation between two sub-specialists is 0.96 ( p < 0.001 ). Whereas the ACC and MAE of non-subspecialists were 84.30% and 7.91 (95% CI 5.84-9.99), and those of trainees were 71.55% and 11.22 (95% CI 8.77-13.68) respectively, which had a significant accuracy drop compared to subspecialists (see Table 2). The within-group correlation of non-subspecialists and trainees were 0.86 ( p < 0.001 ) and 0.90 ( p < 0.001 ) respectively, and also demonstrated unreliability.

Discussions
In this study, we developed an automatic TPS assessment algorithm named ATPSS for IHC PD-L1 whole-slide images of squamous cell lung cancer patients, which is highly helpful to improve pathologists' diagnosis accuracy and efficiency. The ATPSS utilized widely proved deep learning models and image processing techniques, and the intermediate results were calculated according to pathologists' assessing process.
According to the results of ATPSS and the histopathological manifestations of the cases, we can conclude that ATPSS was able to accurately distinguish tumor parenchyma, interstitium, most infiltrating immune cells and tissue cells in lung squamous cell carcinoma, and could precisely identify PD-L1 membrane-positive tumor cells from histiocytes, which showed a high level of agreement with the ground truth. In addition, the performance of tumor region segmentation was consistent on the IHC images with different PD-L1 expression levels. With the uniform thresholds of DAB stains, the positive membrane detection module showed more precise results than pathologists.
However, ATPSS easily mis-classified the PD-L1 negative cases with TPS <1% as low PD-L1 expression of TPS 1-49% (see Fig. 5). Among the 11 cases that were mis-classified as level of 1-49%, most of the cases (8/11,  The last three negative PD-L1 expression cases were assessed with TPS ranging from 15.78 to 27.8% by ATPSS. There were two reasons leading to the large deviation. Firstly, poor tissue processing and staining quality, such as poor immunohistochemical staining, peeling off, tissue-fold, etc, lead to the poor elution of second antibodies and chromogenic reporter, resulting in non-specific staining in many areas, which directly affected the performance of tumor region detection and tumor cell detection. Secondly, those cases had common characteristics, i.e., PD-L1 positive non-tumor cells are closely intermixed with tumor nests, or distributed in clusters on the edge and inside of tumor nests. Furthermore, those cases with large deviation usually showed complex histologic architecture due to poor differentiation, such as poorly cohesive pattern and single cell stromal invasion. Such complex architecture and mixed immune infiltration pattern results in identification mistakes, e.g. mis-recognizing positive non-tumor cells as positive tumor cells.
It should be noted that similar problems of the non-specific staining and complex cancer tissue structure also occurred in the positive expression slides. Nevertheless, such incorrectly identified area was significantly smaller than the whole tumor area, which would not have a great influence on the final TPS assessment. The main causes of large assessment error by ATPSS on positive PD-L1 expression cases were low image quality, such as out-of-focus, tissue-fold, and unclear nuclei caused by strong DAB staining. Therefore, it is of great significance to control the quality of samples submitted for inspection in future works.
The study has several limitations. From the perspective of model development, the unbalanced development cohort was also a possible reason of poor accuracy on PD-L1 negative cases. Although two datasets (e.g., Tumor-Seg and NucleiDetect) had a sufficient amount of PD-L1 negative patch images from PD-L1 positive cases, the morphological particularity of the PD-L1 positive cases were ignored. Therefore, a balanced development cohort may be one of the possible solutions to this problem. The ground truth of the evaluation cohort was assessed manually by pathologists without accurate quantification of tumor areas or tumor cells. As a result, the TPS values have a quantitative step-size of 5. If a WS image was assessed at 70%, it means that the image has a value of around 70%. However, as a fully automatic and quantitative assessment system, the ATPSS predicted TPS at cell level. Therefore, in the strict sense, the ground truth cannot be used as the gold standard for validating ATPSS performance. More accurate ground truth can be obtained by summarizing several experienced pathologists' results or manually correction of ATPSS's intermediate results with computer assisted system. According to the suggested methods for determining TPS provided by 'PD-L1 IHC 22C3 pharmDx Interpretation Manual -NSCLC' 13 , the TPS assessment on the whole slide images are based on selected tumor areas. Therefore, instead of whole slide image assessment, the tile-based assistant experiment (e.g. pathologists first visually divide the tumor area into tiles with equal size, and select several tiles for TPS prediction) may be closer to the clinical diagnosis process. Finally, our study was limited to the cases from a single hospital of Fudan University Shanghai Cancer Center. A further study incorporating larger datasets from multiple hospitals, digitized with different scanners, as well as more pathologists with border experience levels will be necessary to validate the automatic tumor proportion scoring system. Future work can further develop this research in aspects: biopsy ability, adenocarcinoma support, and computer assisted system. First, routine diagnostics of NSCLC are typically needle biopsy. To develop the ability on biopsy, a specialized dataset should be collected for training the tumor area segmentation model, and the model should also have a larger receptive field for extracting long range morphological information. Second, in this work, we only study the squamous cell carcinoma of NSCLC. Adenocarcinoma has a more complex morphology of tumor compared to squamous cell carcinomas. Our preliminary study found that manual annotation is www.nature.com/scientificreports/ time consuming and has serious inter-difference between different pathologists. To solve this, future research can employ manual correction based on the transfer result of the tumor area segmentation model of ATPSS to accelerate the annotation process. Finally, future research can further develop computer assisted system which combines the results of artificial intelligence based algorithm and pathologists for more accurate TPS assessment.
In summary, we have developed computer-aided system using IHC PD-L1 whole slide images of lung squamous cell carcinoma for automatic tumor proportion scoring. ATPSS was able to achieve high assessment accuracy on positive PD-L1 expression cases and significantly surpass to that of non-subspecialists and trainees. In doing so, we show the potential of automatic or computer assisted tumor proportion scoring to improve the effectiveness and accuracy of quantitative PD-L1 assessment.