Introduction

Convolutional neural networks (CNNs) are state-of-the-art machine learning techniques that have led to many breakthroughs in image classification1,2, object detection3,4, and segmentation5,6. Applications of CNN to medicine have improved the performance of computer aided diagnosis7,8. Deep learning performance of CNNs may be enhanced by using massive datasets. Although the use of massive manual labeled datasets is highly time-consuming, these datasets showed comparable performance to expert clinicians. For example, the diagnostic performance of a CNN model, trained using 130 K fundus images, was comparable to that of expert ophthalmologists in diagnosing diabetic retinopathy7. Moreover, a CNN model trained using 130 K dermoscopy images and patients’ skin images was as accurate as dermatologists in distinguishing skin carcinoma from benign lesions8.

CNNs can be used to develop fully automatic pathologic diagnosis systems. Although it is ideal to examine the entire area of a specimen with an light microscope, it is impossible to closely examine all specific regions of each specimen in real clinical settings. Thus, to reach a final diagnosis pathologists alter the magnification. Moreover, pathologic images are very complex, with eye fatigue reducing diagnostic accuracy over time. In addition, subjective evaluations might be susceptible to inter- and/or intra-observer variations. These drawbacks may be overcome by CNNs. For example, CNNs have been applied to giga-pixel immunostained images to detect breast cancer metastases to sentinel lymph nodes9,10,11 and prostate cancer in biopsy specimens11,12. CNN models have also been applied to immunostained images to detect brain and colon cancers13,14.

The demand for kidney transplants is increasing worldwide. Renal biopsy is the gold standard for the evaluation of allograft rejection. Deposition of C4d in peritubular capillaries (PTCs), the tiny blood vessels surrounding renal tubules, is an established marker of antibody-mediated allograft rejection15. C4d score, defined as the proportion of C4d positive PTCs on immunostaining16, is one of the most important factors in the diagnosis of antibody-mediated rejection. Ideally, C4d score should be determined by counting all C4d positive and negative PTCs. However, it is practically impossible for pathologists to quantify all PTC samples, as the microscopic evaluation of PTCs is too time consuming, poor reproducible, and labor-intensive. Pathologists must therefore visually estimate the proportion of C4d positive PTCs. However, they may overlook some microscopic foci or inaccurately estimate the proportion of C4d positive PTCs, and their estimates may be susceptible to inter- and/or intra-observer variations17,18,19. Because automated PTC counting may result in a more accurate diagnosis, deep learning studies using CNN models are required to diagnose allograft rejection in kidney transplant recipients.

To develop clinically applicable system to identify regions of interest and to detect C4d positive and negative peritubular capillaries in giga-pixel immunostained slides, we proposed and evaluated deep-learning-assisted labeling with more efficiency, enhancing the detection model with pathologists’ insights with enlarged masks, and a fully automated system with combining CNN based classification and detection as routine pathologists’ workflow to predict renal allograft rejection. The overall procedure is described in Fig. 1. The system scans digital images from immunostained pathologic slides and removes background areas by Otsu’s thresholding20. Histogram equalization is processed to reduce variations such as illumination or degree of staining. After selecting all candidate regions of interest (ROIs) with sufficient tissue in each slide, the CNN model classifies ROIs as feasible or non-feasible and detects all C4d positive and negative PTCs in feasible ROIs to determine C4d scores. To train the CNN detection model, enlarged masks with certain sized margins are used as input, so that each enlarged mask includes neighborhood information, such as renal tubules, present near the PTCs. Deep-learning-assisted labeling from independent dataset are used for results determined by the detection model which was trained using dataset by manual labeling. The effectiveness of the enlarged mask and deep-learning-assisted labeling was assessed by comparison with FROC. The CNN detection model using enlarged masks trained with margin sizes of 50 and 40 pixels performed better than those without enlarged masks for the detection of C4d positive and negative PTCs, respectively. In comparisons of deep-learning-assisted labeling, the CNN detection model trained with either or both data by deep-learning-assisted labeling performed better than the model trained with data by manual labeling.

Figure 1
figure 1

Overall procedure of our proposed method.

Materials and Methods

The institutional review board for human investigations at Asan Medical Center (AMC) approved the study protocol with removal of all patient identifiers from the images, and waived the requirement for informed consent, in accordance with the retrospective design of this study. A total of 380 needle biopsies of renal allografts were obtained from patients who underwent renal transplantation at AMC from 2009 to 2016; all samples had been stored in the Department of Pathology.

To obtain representative samples, cases in the period were randomly selected without consideration for specific pathologic diagnosis. Consequently, 108 C4d positive and 272 C4d negative cases were retrieved including 46 zero-day allograft biopsies. Two pathologists meticulously reviewed all slides and modified false negative results; finally, 189 cases were classified as C4d positive and 191 cases as C4d negative. C4d was assessed immunohistochemically using a Ventana BenchMark XT autostainer (Ventana Medical Systems, Tucson, AZ, USA), and 380 whole slides were imaged using a digital slide scanner (Pannoramic 250 Flash, 3DHISTECH, Budapest, Hungary) with a 20× objective lens (specimen-level pixel size, 0.221 × 0.221 μm). All samples were anonymized before analysis and labeling.

The 380 slides were divided into subsets 1 and 2, consisting of 200 and 180 slides, respectively, to validate the feasibility of using deep-learning-assisted labeling. Subset 1 was used to train the CNN model for classification and detection, and subset 2 was used to validate the model. The slides in the two subsets were randomized for training (60%), test (20%), and validation (20%). Computational complexity of feasible ROI classification and PTC detection takes constant time. The average time (and standard deviation) for automatic system per slide was 785.81 (±176.97) sec. In this average time, the time for classification per slide and detection per ROI was 712.23 (±132.62) and 0.49 (±0.02) sec where the number of average feasible ROIs was 147.61 (±105.02).

Feasible ROI Classification

All randomly identified candidate ROIs were independently labeled by three pathologists as feasible or non-feasible criteria. Sensitivity was maximized by identifying as many feasible ROIs as possible within each slide. A ROI was classified as non-feasible when more than two-thirds of its image consisted of suboptimal areas, defined as 1) an artifact or poorly stained area that limited proper interpretation; 2) areas without PTCs, such as a large vessel, glomerulus, or vacant area; and 3) scarred or infracted areas21. Examples for criteria are shown in Fig. 2.

Figure 2
figure 2

Decision criteria to classify feasible and non-feasible ROIs. (a) Feasible ROI, (bd) non-feasible ROIs from dominant ambiguous regions including scar, glomerulus, and vessels.

A ROI size of 1024 × 1024 pixels was defined by the pathology team, as an image of this size provided a field of vision similar to a 400 × optical microscopic view, the maximum magnification used in routine practice. The total number of ROIs in subset 1 was 2723, including 2134 feasible and 769 non-feasible ROIs.

The CNN classification model was trained to classify feasible and non-feasible ROIs using an Inception v3 network22 and an ImageNet pre-trained model23. To prevent model overfitting from unbalanced data, adjacent regions were assessed to equalize proportions between classes. Augmentation methods in real time included horizontal and vertical flipping, rotation (0–90°), and zooming in and out (0–10%). The model was implemented in Keras using the Tensorflow with an NVIDIA GTX 1080 Ti GPU, binary-cross entropy loss, stochastic gradient descent optimizer (SGD) with learning rates of 10−5, and dropout with probability of 0.5. The learning rate was reduced to one-tenth per one-third of total epochs 2000 and more detailed parameters are listed in Table 1. Training was terminated at the lowest loss of the test set. The performance of the CNN classification model was evaluated by determining its sensitivity and specificity.

Table 1 Parameters used for training CNN classification model and CNN detection model.

PTC Detection

Three pathologists independently labeled C4d positive and C4d negative PTCs in feasible ROIs of subset 1 by hand drawing using in-house software. After completing these tasks by self, they had a meeting for the discussion of conflicted cases and made a consensus. In addition, pseudo negative PTCs, consisting of non-PTC regions, such as tubules and glomeruli that can be confused with PTCs, were drawn to train the model robustly.

Widths and heights of labeled PTC masks ranged from 25 to 392 pixels. A total of 1823 PTCs were identified by manual labeling, including 549 C4d positive and 1274 C4d negative PTCs, whereas a total of 3836 PTCs were identified from data by deep-learning-assisted labeling, including 1597 C4d positive and 2239 C4d negative PTCs. Examples are shown in Fig. 3.

Figure 3
figure 3

Gold standard examples of C4d negative and positive in PTC. Blue and red rectangles show the positive and negative PTC in (a) and (b), respectively.

In general object detection tasks, the object is normally placed on a complex or object-independent background, with object boundaries determined using fitted coordinates. However, pathologic detection of PTCs is different. Because PTCs are capillaries located near tubules, their presence constitutes additional information during training using an enlarged rather than a fitted mask as shown in Fig. 4. An enlarged mask was used because the boundaries of tubules near PTCs could help recognize PTCs on slides. To evaluate the optimal enlarged margin size around manual labeled data, various margin sizes (0–80 pixels at 10-pixel intervals) were adjusted when training the detection model.

Figure 4
figure 4

Example of labeled C4d positive PTC with various margin sizes. Margin sizes of (a) 0, (b) 10, (c) 20, (d) 30, (e) 40 pixels.

Labeling for massive amounts of data to enhance deep learning performance is time- and labor-intensive. These drawbacks may be overcome by using data by deep-learning-assisted labeling in place of or in addition to manual labeled data. A deep learning model trained for detection can be used to evaluate candidate objects first, followed by confirmation or modification involving little labor to acquire massive data, with the latter called data by deep-learning-assisted labeling. Deep-learning-assisted labeling can reduce the labor required. Figure 5 shows a process used to acquire deep-learning-assisted labeling for detection of C4d positive and negative PTCs. Firstly, two types of CNN model were trained from subset 1 with feasible ROIs and manual labeled mask data (Fig. 5(a,b)). The CNN classification model trained from subset 1 was used to identify feasible ROIs in subset 2 (Fig. 5(c)), and the CNN detection model trained from subset 1 was used to identify candidate C4d positive and negative PTCs in all feasible ROIs. Finally, data by deep-learning-assisted labeling were selected by confirming all candidate PTCs as being C4d positive or C4d negative using an in-house re-labeling tool (Fig. 5(d)). In addition, this procedure was used to test false negative PTCs not detected by the model. If the center of the boundary box identified by the detection model did not deviate significantly from the center of the actual PTC, the PTC was confirmed as C4d positive or negative.

Figure 5
figure 5

Sequence for deep-learning-assisted labeling. All slides are randomly divided into 6:2:2 as training, test, and validation set in subset 1 and 2. (a) Training classification model with feasible ROIs in subset 1. (b) Training detection model with manual labeled masks in the feasible ROIs (c). Extracting candidate feasible ROIs in subset 2 by the classification model. (c) Extracting candidate PTCs by the detection model and confirming results of (d) as deep-learning-assisted labeling.

C4d negative and positive PTCs were detected using region proposal based Faster R-CNN detection algorithm4 with ImageNet pre-trained model based Resnet506. Augmentation methods in real time included horizontal and vertical flipping, rotation (0–90°), and zooming in and out (0–10%). The model was implemented in Keras using the Tensorflow with an NVIDIA GTX 1080 Ti GPU. Smooth L1 loss for bounding box regression and categorical-cross entropy loss for classification network in backbone were used. Adam optimizers with learning rates of 10−5 for region proposal and classification network. More detailed parameters are listed in Table 1. Training was terminated at the lowest loss of the test set. Training was terminated at the lowest loss of the test set.

To evaluate the effectiveness of enlarging margins with faster R-CNN detection algorithm and of using data by deep-learning-assisted labeling with faster R-CNN and one-shot based YOLO v2 detection algorithm24, FROC scores, defined as the average sensitivity at seven predefined false positive rates (1/8, 1/4, 1/2, 1, 2, 4, and 8) per ROI, were calculated.

Stress test to see if the 380 slides datasets where the number of positive PTC and negative PTC masks were 2146 and 3513 are sufficient was conducted. To train the different detection CNN model performance with Faster R-CNN for detecting C4d positive and negative PTCs with different amount of training data, all labeled data including subset 1 and subset 2 were used. All data were shuffled and divided into 80% and 20% as training and fixed validation set. Of training set, different training data were randomly selected to train each model at rates of 40%, 60%, 80%, and 100%. Test set for tuning each detection models were randomly selected at rages of 10% in each different training data. To measure performance for detecting C4d positive and negative PTCs, relative sensitivities at as sensitivity were calculated.

Results

Feasible ROI Classification

The CNN classification model trained from subset 1 was tuned with high specificity to minimize false positives. The sensitivity and specificity of the CNN classification model were 0.7951 and 0.9941, respectively.

To validate the use of deep-learning-assisted labeling, the CNN classification model trained from subset 1 was used to determine candidate feasible ROIs in subset 2. This model was used to extract feasible ROIs from all tissue regions of subset 2 with high specificity. The mean ± SD number of ROIs per slide was 89.23 ± 34.22. An example of classification results for all regions of a slide is shown in Fig. 6. Tissues containing feasible and non-feasible ROIs were colored red. ROIs containing tubules with PTCs were classified as feasible, whereas ROIs containing scars and glomeruli were classified as non-feasible.

Figure 6
figure 6

Feasible and non-feasible ROI classification results. Tissues including feasible ROIs are colored red.

PTC detection

The performances of validations using margin sizes of 0–70 pixels to detect C4d positive and negative PTCs on manual labeled data were compared in Fig. 7 and Table 2 with Faster R-CNN detection algorithm. FROC scores and overall sensitivities for the detection of C4d positive and negative PTCs increased as margin sizes increased. However, overall sensitivities and FROC scores in detecting C4d positive PTCs were optimal at margin sizes of 50 pixels, decreasing at 60 pixels (Fig. 7(a)). Similarly, overall sensitivities and FROC scores in detecting C4d negative PTCs scores were optimal at 40 pixels (Fig. 7(b)). FROC scores were highest for models trained with margin sizes of 50 and 40 pixels for the detection of C4d positive and negative PTCs, respectively.

Figure 7
figure 7

FROC comparisons at different size of margin on manual labeled data. Results for detection of (a) C4d positive and (b) negative PTC.

Table 2 The sensitivities and FROC scores for faster R-CNN detection of C4d positive and negative PTC with various margin sizes (0 to 70) at different mean number of false positives per feasible ROI.

The CNN detection models trained with margins of 40 pixels for the detection of C4d positive and negative PTCs were tuned to maximum sensitivity to generate as much data by deep-learning-assisted labeling as possible. Deep-learning-assisted candidate labeled data were generated by the CNN detection models, which have a recall and precision of 0.8821 and 0.9384, respectively, for the detection of C4d positive PTCs, and of 0.8094 and 0.7108, respectively, for the detection of C4d negative PTCs. The characteristics of manual and deep-learning-assisted labeling differed slightly, in that manual labeled data only included masks fitted to both classes, whereas deep-learning-assisted labeling also included masks that were slightly misplaced locally. Figure 8(a,b) shows inter- and intra-observer variations, respectively, between subset 1 (manual labeling) and subset 2 (deep-learning-assisted labeling). To validate the feasibility of using deep-learning-assisted labeling, FROC scores and sensitivities were compared in models trained with data by manual labeled, data by deep-learning-assisted labeling, and both together at different mean numbers of false positive PTCs per feasible ROI with two different type of detection algorithm. In detecting C4d positive and negative PTCs, the Faster R-CNN model showed better accuracies than those of YOLO v2 model. In addition, both models trained by subset 2 or fusion dataset including subset 1 and subset 2 showed better accuracies (Fig. 8(c,d) and Table 3).

Figure 8
figure 8

FROC comparisons for validation of feasibility of using deep-learning-assisted labeling. FROC comparisons to show inter- and intra-observer variation between different validation set for detection of (a) C4d positive and (b) negative PTC with faster R-CNN detection algorithm. FROC comparisons to validate effectiveness of deep-learning-assisted labeling for detection of (c) C4d positive and (d) negative PTC with faster R-CNN and YOLO v2 detection algorithms.

Table 3 The sensitivities and FROC scores for faster R-CNN and YOLO v2 detections of C4d positive and negative PTC with different detection models trained by different dataset at different mean number of false positives per feasible ROIs (0 to 2 and 0 to 8 for detection of positive and negative PTC, respectively). Model 1: trained by subset 1, Model 2: trained by subset 2, Model 3: trained by fusion of subset 1 and 2.

The CNN detection models for detecting C4d positive and negative PTCs trained with different amount of training dataset were compared as shown in Fig. 9. The performances in detection of C4d positive and negative PTC were shown to be saturated at around 300 slides.

Figure 9
figure 9

Relative sensitivities comparisons of detection models trained with different amount of training data for detecting (a) C4d positive and (b) negative PTC with faster R-CNN detection algorithm.

Discussion

To develop clinically applicable system, deep-learning-assisted labeling with more efficiency, enhancing the detection model with pathologists’ insights, combining CNN based classification and detection for a fully automated system were developed in this paper. Training the CNN detection models with enlarged masks surrounding PTC region is inspired by the actual pathologists’ insights, which enhanced the detection performance highly (Fig. 7 and Table 2). Deep learning models generally need massive data for training. To overcome the problem of small dataset, we tried to determine a feasibility of deep-learning-assisted labeling that is made from independent dataset with low-labor, which could alleviate massive manual labor. In our experiments, using deep-learning-assisted labeling for training, the performance of the detection model was enhanced (Fig. 8(c,d) and Table 3) compared with only dataset by using manual labeling only, because deep-learning-assisted labeling help draw PTC masks more robustly with less variations such as inter- and/or intra-observer, illumination, and degree of staining. In addition, massive dataset would improve the performance of the CNN detection models. In the stress test, we showed that 380 slides were sufficient to train CNN detection model for finding positive and negative PTC.

Deep learning on pathologic images could depend on various scanning conditions such as not only illumination, but also the degree of staining, different equipment, and so on. To overcome these problems, pre-trained network which has been already trained with bunch of tremendous number of images having a variety of complex variation, histogram normalization that is one of stain normalization method, and doubled dataset by deep-learning-assisted labeling were used to train our deep learning model to have robustness of them.

The most important aspects of application of this system to pathology are full automation for objective diagnosis and alleviation of manual labor. This study proposed a fully automated two-step CNN system for the diagnosis of allograft rejection. The first step consists of the use of a CNN classification model to identify feasible ROIs in all tissue regions and the second step consists of the use of a CNN detection model to identify and count C4d positive and negative PTCs, a marker of allograft rejection in kidney transplant recipients. These findings suggest that this system may be applicable to most tasks in digital pathology.

Classification of all tissue regions as feasible or non-feasible ROIs using the CNN classification model is practical, as pathologists cannot determine all feasible ROIs in a tissue sample and have difficulty identifying negative PTCs. By contrast, the CNN classification model can precisely evaluate the entire specimen, and the CNN detection model can accurately count the numbers of C4d positive and negative PTCs in all feasible ROIs. Determining both C4d positive and C4d negative PTCs may alter clinical diagnoses.

In addition, two kinds of performance comparisons were conducted. Firstly, the performances of models trained with different size of margin including PTC region were compared. Enlarged mask with a certain size improved detection CNN model, which method was mimicked by a real clinical experience. This novelty including surrounding regions could be used widely for similar tasks. Secondly, the performances of models trained with data by manual labeling and data by deep-learning-assisted labeling were compared. The generation of data by deep-learning-assisted labeling and confirmation by expert pathologists may help improve the performance of these models. Pathologic labeling is very difficult, even for expert pathologists, whereas the deep-learning-assisted method generated relatively robust labels.

Several obstacles should be overcome before clinical application. The sample size (380 slides) is about 1.2 times the average renal allograft biopsy per year in this center, which is one of the largest medical center in South Korea. Though it is also relatively larger than other studies related to pathologic assessment using convolutional neural network9,11,12, we will try to evaluate the performance of this more with wild dataset from larger data. Also, all cases were recruited from a single center using only one slide scanner, which could lead to less variations such as background illumination or degree of staining. To evaluate the robustness of this method, further studies with multi-center could be needed. In addition, comparisons of the performance and outcomes of this method with those of pathologists are needed to determine the clinical effectiveness of this system.

Conclusion

Pathologic examination is time-consuming, involving the examination of all areas of cells using a digital microscope. Subjectively determined pathologic diagnoses may differ and may be easily susceptible to inter- and/or intra-observer variations. These drawbacks may be overcome by an automatic method of PTC scoring using two types of trained models (window classification and PTC detection). Classification, detection, and scoring comparisons showed that this method yielded reasonable results when evaluating stained giga-pixel digital slides. Use of this system may be feasible diagnostically in detecting other diseases and conditions.