Effectiveness of transfer learning for enhancing tumor classification with a convolutional neural network on frozen sections

Fast and accurate confirmation of metastasis on the frozen tissue section of intraoperative sentinel lymph node biopsy is an essential tool for critical surgical decisions. However, accurate diagnosis by pathologists is difficult within the time limitations. Training a robust and accurate deep learning model is also difficult owing to the limited number of frozen datasets with high quality labels. To overcome these issues, we validated the effectiveness of transfer learning from CAMELYON16 to improve performance of the convolutional neural network (CNN)-based classification model on our frozen dataset (N = 297) from Asan Medical Center (AMC). Among the 297 whole slide images (WSIs), 157 and 40 WSIs were used to train deep learning models with different dataset ratios at 2, 4, 8, 20, 40, and 100%. The remaining, i.e., 100 WSIs, were used to validate model performance in terms of patch- and slide-level classification. An additional 228 WSIs from Seoul National University Bundang Hospital (SNUBH) were used as an external validation. Three initial weights, i.e., scratch-based (random initialization), ImageNet-based, and CAMELYON16-based models were used to validate their effectiveness in external validation. In the patch-level classification results on the AMC dataset, CAMELYON16-based models trained with a small dataset (up to 40%, i.e., 62 WSIs) showed a significantly higher area under the curve (AUC) of 0.929 than those of the scratch- and ImageNet-based models at 0.897 and 0.919, respectively, while CAMELYON16-based and ImageNet-based models trained with 100% of the training dataset showed comparable AUCs at 0.944 and 0.943, respectively. For the external validation, CAMELYON16-based models showed higher AUCs than those of the scratch- and ImageNet-based models. Model performance for slide feasibility of the transfer learning to enhance model performance was validated in the case of frozen section datasets with limited numbers.

Scientific Reports | (2020) 10:21899 | https://doi.org/10.1038/s41598-020-78129-0 www.nature.com/scientificreports/ As development needs for computer aided-diagnosis (CAD) increased in medical fields such as radiology and pathology, the CNN-based CAD models have been studied and developed. Some studies validated that CNN-based model performance was comparable with experts [8][9][10] . Deep learning in pathology is becoming increasingly popular. The CNN-based deep learning has been applied to numerous tasks such as detection and segmentation for various organs such as breast, lung, and colon. With regards to detection, mitosis [11][12][13] and breast cancer metastasis detection 14,15 has been studied. For segmentation, colon glands 16 segmentation for breast cancer metastasis 17 , and multiple cancer subtype regions of lungs 18 have been studied as well. Breast cancer is the most common cancer in women globally and the treatment is primary tumor surgical removal for patients with localized breast cancer 19 . Fast and accurate diagnosis based on sentinel lymph node (SNL) biopsy with the frozen section 20 , whose quality is inferior to that of the formalin-fixed, paraffin-embedded (FFPE) section, is required; however, this is difficult because of artifacts such as compression, nuclear ice crystals, and nuclear chromatic change artifacts. Moreover, labeling data to train CNN-based deep learning models is also cost-and time-consuming owing to the quality and limited samples.
Transfer learning has been widely used to effectively train models with limited dataset to overcome cost-and time-consuming issue 21 . It enables models to be trained fast and accurately by extracting relatively useful spatial features at the beginning of training learned from large dataset in different domain. In CAMELYON challenge for detection of metastases in SNL biopsy of FFPE, the ImageNet pre-trained model was used as an initialization of training parameters for efficient training 22,23 . In our previous study for metastases classification on frozen section 24 , most participants used ImageNet pre-trained model as an initial parameter as well. By the way, none of study validated effectiveness of CAMLEYON dataset for metastases classification on frozen section as so far.
In this study, we proposed an effective method to train a deep learning-based model with a limited dataset for classification of tumor pathology and the whole slide image (WSI) with the concept of transfer learning. Random initialized pre-trained model and two different pre-trained models based on ImageNet dataset, which is a public dataset for the purpose of classification of 1 k of classes in natural scenes, and the CAMELYON16 dataset, which is a public dataset for tumor slide-based classification of FFPE biopsies in digital pathology were compared. Our contribution is to evaluate and compare training models using scratch learning, CAMLEYON-based pre-trained, and ImageNet-based pre-trained methods with stress tests and various learning configurations.

Materials and methods
Data acquisition. This retrospective study was conducted according to the principles of the Declaration of Helsinki and was performed in accordance with current scientific guidelines. The study protocols were approved by the Institutional Review Board Committee of Asan Medical Center (AMC), University of Ulsan College of Medicine, Seoul, Korea and Seoul National University Bundang Hospital (SNUBH), Seoul National University College of Medicine, Gyeonggi, KOREA. The requirements for informed patient consent were waived by the Institutional Review Board Committee of AMC and SNUBH.
All of the sentinel lymph nodes (SLN) in the AMC and the SNUBH were obtained for frozen section routine surgical procedure 24,25 . In the AMC, the WSIs of 297 SLNs were scanned by a digital microscopy scanner (Pannoramic 250 FLASH; 3DHISTECH Ltd., Budapest, Hungary) in MIRAX format (.mrxs). More details of the AMC dataset was introduced in our previous study 24 . In the SNUBH, the WSIs of 228 SLNs were scanned by a digital microscopy scanner (Pannoramic 250 FLASH II; 3DHISTECH Ltd., Budapest, Hungary) in MIRAX format (.mrxs). All WSIs in the AMC were split into 157, 40, and 100 as training, validation, and test sets, respectively. All 228 WSIs was used as an external validation. Demographics for training, validation, and test datasets were described as shown in Table 1. In the case of the CAMELYON16 dataset, only 270 WSIs given for training were split into 216 and 54 WSIs (8:2) for training and validation sets to make a CAMELYON16-based pre-trained model. All training datasets, i.e., 157 WSIs, were split into 2,4,8,20,40, and 100% at 3, 6, 12, 25, 50, and 157 WSIs. The number of patches were: 12 K, 25 K 52 K 101 K 199 K, and 519 K, respectively.
Reference standard. All WSIs in the AMC in our dataset were manually segmented by one rater, and the annotations were confirmed by two clinically expert pathologists with 6-and 20-years' experience in breast pathology. And all WSIs in the SNUBH were manually segmented by one rater, and the annotations were confirmed by an expert breast pathologist with 15-years' experience. Whole regions of metastatic carcinoma larger Study design. In this study, three different model types with initial weights, i.e., (1) scratch-based initial weight that was randomly assigned, (2) ImageNet-based initial weight trained with a public dataset (ImageNet, 1000 classes in natural images), (3) and CAMELYON16-based initial weight trained with another public dataset, (CAMELYON16, 2 classes in H&E pathology images), were used to validate the transfer learning effectiveness as shown in Fig. 1. For more detail, the three different weights were used as initial weights of models trained with different training dataset ratios at 2, 4, 8, 20, 40, and 100% of the training dataset total. Different models with different initial weight and training dataset ratios were evaluated in terms of sensitivity, specificity, accuracy, and area under the curve (AUC) for patch-and slide-level classifications. As an external validation, the SNUBH dataset was used for slide-level classification in AUC terms.
Patch extraction. Tissue region masks were extracted by combining the H and S color channels after Otsu thresholding to extract tumor and non-tumor patches in WSI 26 . Subsequently, 448 × 448 patch sizes at level zero whose field of view is approximately at 100 µm 2 were randomly selected within the tissue region and each patch was classified as either tumor or non-tumor patch according to a certain criterion (tumor patch, over 80% of the tumor region; non-tumor patch, none of the tumor region). The total number of CAMELYON16 patches for tumor and non-tumor patches are 330 K and 88 K for training and validation sets at a 1:1.2 ratio of tumor and non-tumor, respectively. A total number of patches in our datasets for training, validation, and test sets at the same ratio are 519 K, 92 K, and 27 K, respectively. www.nature.com/scientificreports/ CNN training. All models in this study, i.e., CAMELYON16 dataset-based pre-trained model, scratchbased, ImageNet-based, and CAMELYON16-based models, were trained with the same training conditions. Inception v.3 27 was selected as the classification architecture for validation of transfer learning. It consists of 48 layers to efficiently extract spatial features in different image level from Inception family that enhanced model performance with label smoothing, factorized 7 × 7 convolutional operation. The Inception v.3 showed higher accuracy with lower memory size than other architectures such as VGG 28 or ResNet 29 series. The classification models were trained with the same conditions, such as optimizer (SGD, learning rate: 5e−4, momentum: 0.9), loss function (binary-cross entropy), augmentations (zoom range: 0.2, rotation range: 0.5, width and height shift range: 0.1, horizontal and vertical flip), and drop out (0.5). The best model was selected at the lowest validation set loss. Stain normalization was used to make the model robust to different scanning conditions. Confidence maps were generated by interpolating tumor confidence over all WSI regions. A parameter defining how densely the confidence map is generated (i.e., stride) was 320 pixels, i.e., the resolution of confidence maps is 1/320 of the original WSIs. To reduce noise in the confidence map, we used 3 × 3 Gaussian filtering. The maximum value was selected as the confidence to calculate the receiver operating characteristic curve (ROC).  Figure 2 shows plots of training tendency with loss and validation dataset accuracy were observed for three types of models trained with 4, 20, and 100% of the training dataset. Training scratch-based models with 2 to 8% of the training dataset showed no loss and accuracy changes for the validation dataset as shown in Fig. 2a, which indicates that it failed to train the scratch-based models with miniscule training datasets. With 40% or less of the training dataset, CAMELYON16-based models showed not only lower loss and higher accuracy at the first epoch, resulting in fast convergence, but additionally even lower loss and higher accuracy at the last epoch as shown in Fig. 2b. In the case of 100% of the training dataset, ImageNet and the CAMELYON16-based models showed the same tendency for decreased loss and increased accuracy, which showed still lower loss and higher accuracy than that of the scratch-based model as shown in Fig. 2c.

Results
Performance comparisons of patch-and slide-level classifications of whether each patch or slide contains a tumor or not are listed in Table 2. In the patch-level performance case on the AMC dataset, CAMELYON16based models trained with all training dataset ratios demonstrated significantly higher AUCs at 0.843, 0.881, 0.895, 0.912, 0.929, and 0.944 than those of scratch-and ImageNet-based models except for the ImageNetbased model trained with 100% of the training dataset. In the slide-level performance case on the AMC dataset, CAMELYON16-based models trained with all training dataset ratios showed significantly higher AUCs at 0.814, 0.874, 0.873, 0.867, 0.878, and 0.886 than those of scratch-and ImageNet-based models except for that of the ImageNet-based model trained with 40 and 100% of the training dataset.
In the external validation dataset with SNUBH, the AUCs are measured for ImageNet-based and CAME-LYON16-based model trained with 2, 4, 8, 20, 40, and 100% of the training set were 0.592, 0.625, 0.667, 0.723, 0.819, and 0.798; and 0.689, 0.667, 0.695, 0.763, 0.749, and 0.804, respectively. In the scratch-based model case, the model performance was observed at between 0.437 and 0.540. The CAMELYON16-based model trained with all ratios of the total training set showed higher AUCs than those of the ImageNet-based model except for 40% of the total training dataset. Figure 3 shows an example of Grad-CAMs for tumor patch with scratch-, ImageNet-, and CAMELYON16based models trained with different dataset ratios. Confidence denotes an output value of the last fully connected layer in each model. All models trained with 20 and 100% of the training dataset correctly predicted an input patch as a tumor patch while in the case of using 4% of the training dataset, only a CAMELYON16-based model correctly predicted with higher confidence of 0.82. In the same condition, scratch-and ImageNet-based models incorrectly predicted the input patch as a normal patch with low confidence levels of 0.49 and 0.36, respectively.

Discussion
We validated transfer learning using ImageNet and CAMELYON16 to reduce intensive labor costs by comparing with the scratch-based model. Three model types based on scratch, ImageNet, and CAMELYON16 dataset for different ratios such as 2,4,8,20,40, and 100% of the training dataset were trained and evaluated.
The scratch-based models failed to train with ratios less than or equal to 8% of the training dataset due to the very limited dataset as shown in Fig. 2a. The scratch-based models, however, began to achieve training as the training dataset increases as shown in Table 2. CAMELYON16-based models trained with all ratios of the training dataset were trained with lower losses and higher accuracies for the validation dataset than those of scratch-based models at the first and last epochs. Its tendency resulted in significantly higher AUCs for patch and slide classification with tumors, which could also save time in effective model training.
Scientific Reports | (2020) 10:21899 | https://doi.org/10.1038/s41598-020-78129-0 www.nature.com/scientificreports/ In the case of comparisons between ImageNet-and CAMELYON16-based models, the CAMELYON16-based models showed significantly higher AUCs for patch-and slide-level classification when provided with equal or less than 40% and 20%, respectively. Loss and accuracy for both models trained with 100% were close to each other at every epoch, resulting in comparable AUCs for patch-level classification at 0.943 and 0.944 and slidelevel classification at 0.888 and 0.886.
External validation with SNUBH was conducted to validate the transfer learning effectiveness. Though overall AUCs of ImageNet-based and CAMELYON16-based models for all training dataset ratios were slightly lower than that in the AMC dataset, a tendency showing increasing AUCs was observed as additional datasets were used for model training. The primary difference between AMC and SNUBH is MPP (micro-meter per pixel) that is related to the definite size for each pixel. This parameter is determined when scanning the whole slide. The MPP of AMC and SNUBH datasets are 0.24 and 0.50, respectively. Considering patches at the same level, the patch resolution in the SNUBH dataset is approximately half that of the AMC dataset patches. A harsher www.nature.com/scientificreports/ resizing augmentation during training was used to reduce variances between the AMC and SNUBH datasets to overcome this issue. Grad-CAM of three model types trained at 4, 20, and 100% of the training dataset was used by showing a heat-map as shown in Fig. 3 to observe how reasonably each model was trained. Grad-CAMs of all model types showed higher confidence for tumor patches as the training datasets increase. However, the CAMELYON16based model trained with 4% of the training dataset showed higher tumor patch confidence at 0.82 than that of scratch-and ImageNet-based models at 0.49 and 0.36. Scratch-based models trained with equal or less than 8% of the training dataset predicted all patches as non-tumor patches at 0.51 confidence, which indicates overfitting for the non-tumor class.
Additional experiments with various ranges of learning rates that are considered as a major factor affecting model performance was conducted by setting the value from 5e−2 (0.05) to 5e−6 (0.000005) to see a tendency of model performance in terms of AUC for patch-level evaluation in AMC dataset. Model performances are listed as shown in Table 3. In the result, a range of learning rate from 5e−4 to 5e−5 was good enough to efficiently train all types of models including scratch-based model. When we set the learning rate greater than or equal to 5e−3, the most training was divergent at the beginning of training. As the learning rates were set to less than 5e−5, the model trainings needed more epochs for convergence and those were converged at relatively higher loss level, which resulted in decreasing model performance.
Our study has several limitations. First, the tumor patch was selected to train models when the tumor region is more than 80% of each patch, which resulted in specific models. Patch classification performance depends on selection criteria and different criteria may slightly affect the model's characteristics. Second, a method of slidelevel classification using the maximum value of the confidence map seems so straightforward that the noise in the confidence map could significantly affect the results. Its strategy should be addressed with a robust method such as a random forest classifier with local and global features. Table 2. Performance comparison of models based on different initial weights for different ratios of the training dataset. In the AMC dataset, sensitivity, specificity, and accuracy at threshold 0.5, and AUC were measured for patch-level evaluation and AUC for slide-level evaluation was measured on AMC and SNUBH datasets. Statistical comparisons between AUCs of CAMELYON16-based model and others trained with the same ratio of the training dataset were performed to determine whether the model's performance with different initial weights were significant. AMC, Asan Medical Center; SNUBH, Seoul National University Bundang Hospital. *p-value < 0.05, **p-value < 0.0005. 2%  ------4%  ------8%  -----

Conclusions
Training of a CNN model with a limited number of pathology datasets is an essential task to reduce intensive labor costs. To overcome this issue on intraoperative frozen SNL biopsy tissue, we validated the effectiveness of transfer learning with pre-trained models of open datasets such as the ImageNet and CAMELYON16 datasets obtained from FFPE tissue by comparing them with a scratch-based model. In our results, the CAMELYON16-based