Introduction

An advanced machine learning algorithm, i.e., convolutional neural network (CNN)-based deep learning, has recently been adopted to develop models with higher performance than those of conventional algorithms in the field of computer vision tasks1,2,3. The supervised learning-based algorithm, which is a machine learning method for training a function to correctly map an input to an output with corresponding labeling data, generally performs well when sufficient input datasets including labeling data are fed into the CNN-based architecture while training the model. It makes the model robust for predicting an unseen dataset. In the real world, however, it is difficult to obtain not only the dataset but also label the dataset owing to high labeling costs by experts4,5. Transfer learning has been used to effectively train the CNN model with the limited dataset, which could enhance the model performance by using an initial model that has been trained with numerous public datasets in different domains, to overcome this issue6,7.

As development needs for computer aided-diagnosis (CAD) increased in medical fields such as radiology and pathology, the CNN-based CAD models have been studied and developed. Some studies validated that CNN-based model performance was comparable with experts8,9,10. Deep learning in pathology is becoming increasingly popular. The CNN-based deep learning has been applied to numerous tasks such as detection and segmentation for various organs such as breast, lung, and colon. With regards to detection, mitosis11,12,13 and breast cancer metastasis detection14,15 has been studied. For segmentation, colon glands16 segmentation for breast cancer metastasis17, and multiple cancer subtype regions of lungs18 have been studied as well. Breast cancer is the most common cancer in women globally and the treatment is primary tumor surgical removal for patients with localized breast cancer19. Fast and accurate diagnosis based on sentinel lymph node (SNL) biopsy with the frozen section20, whose quality is inferior to that of the formalin-fixed, paraffin-embedded (FFPE) section, is required; however, this is difficult because of artifacts such as compression, nuclear ice crystals, and nuclear chromatic change artifacts. Moreover, labeling data to train CNN-based deep learning models is also cost- and time-consuming owing to the quality and limited samples.

Transfer learning has been widely used to effectively train models with limited dataset to overcome cost- and time-consuming issue21. It enables models to be trained fast and accurately by extracting relatively useful spatial features at the beginning of training learned from large dataset in different domain. In CAMELYON challenge for detection of metastases in SNL biopsy of FFPE, the ImageNet pre-trained model was used as an initialization of training parameters for efficient training22,23. In our previous study for metastases classification on frozen section24, most participants used ImageNet pre-trained model as an initial parameter as well. By the way, none of study validated effectiveness of CAMLEYON dataset for metastases classification on frozen section as so far.

In this study, we proposed an effective method to train a deep learning-based model with a limited dataset for classification of tumor pathology and the whole slide image (WSI) with the concept of transfer learning. Random initialized pre-trained model and two different pre-trained models based on ImageNet dataset, which is a public dataset for the purpose of classification of 1 k of classes in natural scenes, and the CAMELYON16 dataset, which is a public dataset for tumor slide-based classification of FFPE biopsies in digital pathology were compared. Our contribution is to evaluate and compare training models using scratch learning, CAMLEYON-based pre-trained, and ImageNet-based pre-trained methods with stress tests and various learning configurations.

Materials and methods

Data acquisition

This retrospective study was conducted according to the principles of the Declaration of Helsinki and was performed in accordance with current scientific guidelines. The study protocols were approved by the Institutional Review Board Committee of Asan Medical Center (AMC), University of Ulsan College of Medicine, Seoul, Korea and Seoul National University Bundang Hospital (SNUBH), Seoul National University College of Medicine, Gyeonggi, KOREA. The requirements for informed patient consent were waived by the Institutional Review Board Committee of AMC and SNUBH.

All of the sentinel lymph nodes (SLN) in the AMC and the SNUBH were obtained for frozen section routine surgical procedure24,25. In the AMC, the WSIs of 297 SLNs were scanned by a digital microscopy scanner (Pannoramic 250 FLASH; 3DHISTECH Ltd., Budapest, Hungary) in MIRAX format (.mrxs). More details of the AMC dataset was introduced in our previous study24. In the SNUBH, the WSIs of 228 SLNs were scanned by a digital microscopy scanner (Pannoramic 250 FLASH II; 3DHISTECH Ltd., Budapest, Hungary) in MIRAX format (.mrxs). All WSIs in the AMC were split into 157, 40, and 100 as training, validation, and test sets, respectively. All 228 WSIs was used as an external validation. Demographics for training, validation, and test datasets were described as shown in Table 1. In the case of the CAMELYON16 dataset, only 270 WSIs given for training were split into 216 and 54 WSIs (8:2) for training and validation sets to make a CAMELYON16-based pre-trained model. All training datasets, i.e., 157 WSIs, were split into 2, 4, 8, 20, 40, and 100% at 3, 6, 12, 25, 50, and 157 WSIs. The number of patches were: 12 K, 25 K 52 K 101 K 199 K, and 519 K, respectively.

Table 1 Clinicopathologic characteristics of the patients in AMC and SNUBH datasets.

Reference standard

All WSIs in the AMC in our dataset were manually segmented by one rater, and the annotations were confirmed by two clinically expert pathologists with 6- and 20-years’ experience in breast pathology. And all WSIs in the SNUBH were manually segmented by one rater, and the annotations were confirmed by an expert breast pathologist with 15-years’ experience. Whole regions of metastatic carcinoma larger than 200 μm in the greatest WSI dimension were annotated. More details of the reference standard are introduced in our previous study24.

Study design

In this study, three different model types with initial weights, i.e., (1) scratch-based initial weight that was randomly assigned, (2) ImageNet-based initial weight trained with a public dataset (ImageNet, 1000 classes in natural images), (3) and CAMELYON16-based initial weight trained with another public dataset, (CAMELYON16, 2 classes in H&E pathology images), were used to validate the transfer learning effectiveness as shown in Fig. 1. For more detail, the three different weights were used as initial weights of models trained with different training dataset ratios at 2, 4, 8, 20, 40, and 100% of the training dataset total. Different models with different initial weight and training dataset ratios were evaluated in terms of sensitivity, specificity, accuracy, and area under the curve (AUC) for patch- and slide-level classifications. As an external validation, the SNUBH dataset was used for slide-level classification in AUC terms.

Figure 1
figure 1

A validation workflow for transfer learning. (a) Two initial classification models are trained from two public datasets (i.e., ImageNet and CAMELYON16 datasets). (b) Private dataset from AMC is used to train classification models with different initial weights based on scratch- (random initialization), ImageNet- (pre-trained), and CAMELYON16-(pre-trained) based models. Different dataset ratios are selected at 2, 4, 8, 20, 40, and 100% of the total AMC dataset to train models for validation of the transfer learning effectiveness.

Patch extraction

Tissue region masks were extracted by combining the H and S color channels after Otsu thresholding to extract tumor and non-tumor patches in WSI26. Subsequently, 448 × 448 patch sizes at level zero whose field of view is approximately at 100 µm2 were randomly selected within the tissue region and each patch was classified as either tumor or non-tumor patch according to a certain criterion (tumor patch, over 80% of the tumor region; non-tumor patch, none of the tumor region).

The total number of CAMELYON16 patches for tumor and non-tumor patches are 330 K and 88 K for training and validation sets at a 1:1.2 ratio of tumor and non-tumor, respectively. A total number of patches in our datasets for training, validation, and test sets at the same ratio are 519 K, 92 K, and 27 K, respectively.

CNN training

All models in this study, i.e., CAMELYON16 dataset-based pre-trained model, scratch-based, ImageNet-based, and CAMELYON16-based models, were trained with the same training conditions. Inception v.327 was selected as the classification architecture for validation of transfer learning. It consists of 48 layers to efficiently extract spatial features in different image level from Inception family that enhanced model performance with label smoothing, factorized 7 × 7 convolutional operation. The Inception v.3 showed higher accuracy with lower memory size than other architectures such as VGG28 or ResNet29 series. The classification models were trained with the same conditions, such as optimizer (SGD, learning rate: 5e−4, momentum: 0.9), loss function (binary-cross entropy), augmentations (zoom range: 0.2, rotation range: 0.5, width and height shift range: 0.1, horizontal and vertical flip), and drop out (0.5). The best model was selected at the lowest validation set loss. Stain normalization was used to make the model robust to different scanning conditions.

Confidence maps were generated by interpolating tumor confidence over all WSI regions. A parameter defining how densely the confidence map is generated (i.e., stride) was 320 pixels, i.e., the resolution of confidence maps is 1/320 of the original WSIs. To reduce noise in the confidence map, we used 3 × 3 Gaussian filtering. The maximum value was selected as the confidence to calculate the receiver operating characteristic curve (ROC).

Statistical evaluation

For statistical analysis of classification model performance, we performed AUC comparison using Hanley & McNeil method 32 with respect to that of the CAMELYON16-based model for tumor patch and slide classification. Statistical comparisons between AUCs of CAMELYON16-based and others trained models with the same training dataset ratio were performed to determine which the model’s performances with different initial weights were significantly better. Statistical analyses were performed using MedCalc statistical software (MedCalc Software, version 18.2.1, Ostend, Belgium).

Results

Figure 2 shows plots of training tendency with loss and validation dataset accuracy were observed for three types of models trained with 4, 20, and 100% of the training dataset. Training scratch-based models with 2 to 8% of the training dataset showed no loss and accuracy changes for the validation dataset as shown in Fig. 2a, which indicates that it failed to train the scratch-based models with miniscule training datasets. With 40% or less of the training dataset, CAMELYON16-based models showed not only lower loss and higher accuracy at the first epoch, resulting in fast convergence, but additionally even lower loss and higher accuracy at the last epoch as shown in Fig. 2b. In the case of 100% of the training dataset, ImageNet and the CAMELYON16-based models showed the same tendency for decreased loss and increased accuracy, which showed still lower loss and higher accuracy than that of the scratch-based model as shown in Fig. 2c.

Figure 2
figure 2

Validation loss and accuracy comparison of scratch-, ImageNet-, and CAMELYON16-based models for different AMC dataset ratios. (a) 4%, (b) 20%, (c) 100% of the total training dataset.

Performance comparisons of patch- and slide-level classifications of whether each patch or slide contains a tumor or not are listed in Table 2. In the patch-level performance case on the AMC dataset, CAMELYON16-based models trained with all training dataset ratios demonstrated significantly higher AUCs at 0.843, 0.881, 0.895, 0.912, 0.929, and 0.944 than those of scratch- and ImageNet-based models except for the ImageNet-based model trained with 100% of the training dataset. In the slide-level performance case on the AMC dataset, CAMELYON16-based models trained with all training dataset ratios showed significantly higher AUCs at 0.814, 0.874, 0.873, 0.867, 0.878, and 0.886 than those of scratch- and ImageNet-based models except for that of the ImageNet-based model trained with 40 and 100% of the training dataset.

Table 2 Performance comparison of models based on different initial weights for different ratios of the training dataset.

In the external validation dataset with SNUBH, the AUCs are measured for ImageNet-based and CAMELYON16-based model trained with 2, 4, 8, 20, 40, and 100% of the training set were 0.592, 0.625, 0.667, 0.723, 0.819, and 0.798; and 0.689, 0.667, 0.695, 0.763, 0.749, and 0.804, respectively. In the scratch-based model case, the model performance was observed at between 0.437 and 0.540. The CAMELYON16-based model trained with all ratios of the total training set showed higher AUCs than those of the ImageNet-based model except for 40% of the total training dataset.

Figure 3 shows an example of Grad-CAMs for tumor patch with scratch-, ImageNet-, and CAMELYON16-based models trained with different dataset ratios. Confidence denotes an output value of the last fully connected layer in each model. All models trained with 20 and 100% of the training dataset correctly predicted an input patch as a tumor patch while in the case of using 4% of the training dataset, only a CAMELYON16-based model correctly predicted with higher confidence of 0.82. In the same condition, scratch- and ImageNet-based models incorrectly predicted the input patch as a normal patch with low confidence levels of 0.49 and 0.36, respectively.

Figure 3
figure 3

An example of Grad-CAMs for tumor patch with scratch-, ImageNet-, and CAMELYON16-based models trained with different dataset ratios. Confidence below each result indicates the value of the last fully connected layer in each model. (a) Input patch, Grad-CAM results of models trained with (b) 4%, (c) 20%, and (d) 100% of the total AMC dataset.

Discussion

We validated transfer learning using ImageNet and CAMELYON16 to reduce intensive labor costs by comparing with the scratch-based model. Three model types based on scratch, ImageNet, and CAMELYON16 dataset for different ratios such as 2, 4, 8, 20, 40, and 100% of the training dataset were trained and evaluated.

The scratch-based models failed to train with ratios less than or equal to 8% of the training dataset due to the very limited dataset as shown in Fig. 2a. The scratch-based models, however, began to achieve training as the training dataset increases as shown in Table 2. CAMELYON16-based models trained with all ratios of the training dataset were trained with lower losses and higher accuracies for the validation dataset than those of scratch-based models at the first and last epochs. Its tendency resulted in significantly higher AUCs for patch and slide classification with tumors, which could also save time in effective model training.

In the case of comparisons between ImageNet- and CAMELYON16-based models, the CAMELYON16-based models showed significantly higher AUCs for patch- and slide-level classification when provided with equal or less than 40% and 20%, respectively. Loss and accuracy for both models trained with 100% were close to each other at every epoch, resulting in comparable AUCs for patch-level classification at 0.943 and 0.944 and slide-level classification at 0.888 and 0.886.

External validation with SNUBH was conducted to validate the transfer learning effectiveness. Though overall AUCs of ImageNet-based and CAMELYON16-based models for all training dataset ratios were slightly lower than that in the AMC dataset, a tendency showing increasing AUCs was observed as additional datasets were used for model training. The primary difference between AMC and SNUBH is MPP (micro-meter per pixel) that is related to the definite size for each pixel. This parameter is determined when scanning the whole slide. The MPP of AMC and SNUBH datasets are 0.24 and 0.50, respectively. Considering patches at the same level, the patch resolution in the SNUBH dataset is approximately half that of the AMC dataset patches. A harsher resizing augmentation during training was used to reduce variances between the AMC and SNUBH datasets to overcome this issue.

Grad-CAM of three model types trained at 4, 20, and 100% of the training dataset was used by showing a heat-map as shown in Fig. 3 to observe how reasonably each model was trained. Grad-CAMs of all model types showed higher confidence for tumor patches as the training datasets increase. However, the CAMELYON16-based model trained with 4% of the training dataset showed higher tumor patch confidence at 0.82 than that of scratch- and ImageNet-based models at 0.49 and 0.36. Scratch-based models trained with equal or less than 8% of the training dataset predicted all patches as non-tumor patches at 0.51 confidence, which indicates overfitting for the non-tumor class.

Additional experiments with various ranges of learning rates that are considered as a major factor affecting model performance was conducted by setting the value from 5e−2 (0.05) to 5e−6 (0.000005) to see a tendency of model performance in terms of AUC for patch-level evaluation in AMC dataset. Model performances are listed as shown in Table 3. In the result, a range of learning rate from 5e−4 to 5e−5 was good enough to efficiently train all types of models including scratch-based model. When we set the learning rate greater than or equal to 5e−3, the most training was divergent at the beginning of training. As the learning rates were set to less than 5e−5, the model trainings needed more epochs for convergence and those were converged at relatively higher loss level, which resulted in decreasing model performance.

Table 3 Performance comparison of models based on different initial weights and learning rates for different ratios of the training dataset.

Our study has several limitations. First, the tumor patch was selected to train models when the tumor region is more than 80% of each patch, which resulted in specific models. Patch classification performance depends on selection criteria and different criteria may slightly affect the model’s characteristics. Second, a method of slide-level classification using the maximum value of the confidence map seems so straightforward that the noise in the confidence map could significantly affect the results. Its strategy should be addressed with a robust method such as a random forest classifier with local and global features.

Conclusions

Training of a CNN model with a limited number of pathology datasets is an essential task to reduce intensive labor costs. To overcome this issue on intraoperative frozen SNL biopsy tissue, we validated the effectiveness of transfer learning with pre-trained models of open datasets such as the ImageNet and CAMELYON16 datasets obtained from FFPE tissue by comparing them with a scratch-based model. In our results, the CAMELYON16-based models trained with all training dataset ratios for patch and slide classification that have tumors showed significantly higher AUCs than those of the scratch-based models. When using the 40% or less (i.e., 62 WSIs) of the training set, the CAMELYON16-based models showed higher AUCs than those of ImageNet-based models for patch classification. With the limited datasets, transfer learning using the FFPE tissue-based dataset, could enhance model performance for tumor classification in the intraoperative frozen tissue-based dataset.