Classification of subtypes including LCNEC in lung cancer biopsy slides using convolutional neural network from scratch

Identifying the lung carcinoma subtype in small biopsy specimens is an important part of determining a suitable treatment plan but is often challenging without the help of special and/or immunohistochemical stains. Pathology image analysis that tackles this issue would be helpful for diagnoses and subtyping of lung carcinoma. In this study, we developed AI models to classify multinomial patterns of lung carcinoma; ADC, LCNEC, SCC, SCLC, and non-neoplastic lung tissue based on convolutional neural networks (CNN or ConvNet). Four CNNs that were pre-trained using transfer learning and one CNN built from scratch were used to classify patch images from pathology whole-slide images (WSIs). We first evaluated the diagnostic performance of each model in the test sets. The Xception model and the CNN built from scratch both achieved the highest performance with a macro average AUC of 0.90. The CNN built from scratch model obtained a macro average AUC of 0.97 on the dataset of four classes excluding LCNEC, and 0.95 on the dataset of three subtypes of lung carcinomas; NSCLC, SCLC, and non-tumor, respectively. Of particular note is that the relatively simple CNN built from scratch may be an approach for pathological image analysis.

Lung cancer is most common cause of cancer death (18.0%) while it was the second most commonly diagnosed cancer (2,206,771; 11.4%) worldwide in 2020 behind female breast cancer (2,261,419; 11.7%) 1 . The operability of lung cancer is determined by the cancer stage. At the time of diagnosis, approximately 70% of lung cancer patients have advanced stages and are inoperable. In many cases, a final histological diagnosis is made from a small biopsy tissue. After a small biopsy, lung carcinomas are classified as adenocarcinoma (ADC), squamous cell carcinoma (SCC), small cell lung cancer (SCLC), large cell neuroendocrine carcinoma (LCNEC), or "non-small cell carcinoma (NSCC), not otherwise specified (NOS)", etc. by additional special and/or immunohistochemical stains, if necessary. The classification "NSCC, NOS" is the diagnostic term used when a lung carcinoma cannot be classified into a specific type from a small biopsy specimen despite the use of the additional stains or when no stains are available 2 . Subtyping for lung carcinoma in small biopsy specimens is often challenging without these special and/or immunohistochemical stains.
Histological subtyping of lung carcinoma is a vital part of determining a suitable treatment plan. In cases of small cell lung cancer, chemotherapy is usually used alone without any surgical treatment because of the advanced cancer stage at the time of diagnosis; it is hard to meet surgical specimens of small cell lung cancer in practice. In the patients with NSCC, surgery is determined by the cancer stage, and the chemotherapy regimen depends on the histological subtype 3,4 .
With the emergence of digital pathology, there have been studies that attempt to analyze digital slide images of lung cancers using deep learning [5][6][7][8][9][10][11][12] . Deep learning-based pathology image analysis that can accurately diagnose and subtype lung carcinoma would be useful in daily practice as an auxiliary means. There are several deep learning studies that have proposed technologies that are capable of distinguishing non-neoplastic lung tissue and lung carcinoma subtype including adenocarcinoma, squamous cell carcinoma, and small cell lung cancer, which are frequently seen in lung biopsy 8,10,12 , however, LCNEC was not included in the studies.  16 , which were benchmarked on ImageNet dataset. A simple CNN model dedicated to a specific task, such as pathological image analysis, could produce more efficient results. In this study, we propose a CNN model built from scratch along with pre-trained CNN models for classification of subtypes including LCNEC in lung cancer biopsy slides as an efficient method for pathology image analysis.

Results
Overview of AI models for classification of lung cancer subtypes. We developed and evaluated AI models with a convolutional neural network structure for multi-classification of lung cancer subtypes ( Table 2, Fig. 1c-1) either built from scratch on the Keras Sequential API (https:// keras. io/) or based on four well-known, pre-trained CNNs using transfer learning ( Fig. 1c-2). Pathology slides of lung or bronchus biopsies from 203 patients and surgical specimens from 2 patients were collected at the Gyeongsang National University Hospital. Cancer regions on the 205 WSIs were annotated by pathologists, and the tumor areas in these images were extracted and used to generate non overlapping patches 256 × 256 pixels in size at a magnification of 20 × using DeepPATH 7 , as shown in Fig. 1a. Figure 1b shows the dataset containing a total of 5 classes; 21 of ADC, 14 of LCNEC, 22 of SCC, 21 of SCLC whole slide images (WSIs), and 127 of non-tumor WSIs. 10,049 sample patch images were generated from 205 WSIs to represent the four lung cancer subtypes as well as the negative case. Out of the total of 9849 patches, 7089 patches were used to construct a training set, 785 patches were selected for a validation set, and 2175 patches were assigned to the test set as shown in Table 1. Test set consisted of 168 patches from 3 of ADC slides, 539 patches from 2 of LCNEC slides, 346 patches from 3 of SCC slides, 243 patches from 3 of SCLC slides, and 879 patches from 11 of non-tumor slides as shown in Table 1. The architecture of the CNN built from scratch is described in Table 2. Four pre-trained convolutional neural networks, ResNet152 15 , VGG19 16 , Xception 17 , and NASNETLarge 18 , were also used in this study. To compile the model for each CNN, nadam and categorical cross entropy were chosen as the respective optimizer and loss function. Each AI model diagnosed the images and output either one of the four types of lung cancer or the negative case using the two test sets, as shown in Fig. 1d. Diagnostic performances of AI models for multi-classification. To evaluate the diagnostic performance of each multi-classification AI model, confusion matrices and the area under the curve (AUC) of the receiver operating characteristic curve (ROC) for each model over the test sets were calculated as shown in Fig. 2. For the confusion matrices, each row of one of the matrices represents the number of patches in each predicted class according to the AI models, while each column represents the actual number of instances in The model takes a tensor with dimensions of (244, 244, 3) as input, the CNN consists of four convolution blocks with a max pool layer in each. The 1st and 2nd hidden layer have 16 and 32 filters, respectively, with a kernel size of (2, 2) and use a rectified linear unit (ReLU) as their activation functions. The 3rd and 4th hidden layer have 64 filters with a kernel size of (2, 2) and also use rectified linear units (ReLU). The CNN also contains a fully connected dense layer with 5 units that uses softmax as its activation function. When compiling the model, Nadam and categorical cross entropy were chosen as the optimizer and loss function, respectively. (c-2) Transfer Learning with pre-trained CNNs. Four pre-trained convolutional neural networks based on ResNet152 15 , VGG19 16 , Xception 17 , and NASNETLarge 18 were evaluated in this study. After the pre-trained models were chosen, we repurposed their already learned knowledge and carried out fine-tuning for our task. (d) Diagnosis. The AI models diagnosed the input images from the test sets as either one of four types of lung cancer or as a negative case.  Fig. 2b. It produced the weighted average of precision, recall, and f1-score of 0.5384, 0.5237, and 0.4685 as shown in Table 3. The VGG19 model achieved AUC of 0.87 for the macro-average ROC curve, and its AUCs for classifying subtypes of lung cancer ranged from 0.79 to 0.95 on the test set, as shown in Fig. 2d. The weighted average of precision, recall, and f1-score of the VGG19 were 0.6767, 0.6161, and 0.6009, respectively. Figure 2e,f show the confusion matrix and AUROC of the Xception model. The Xception model achieved AUC of 0.90 for the macro-average ROC curve, which is the highest among the four pre-trained CNN models, and its AUCs for classifying each cancer subtype ranged from 0.79 to 0.95. The weighted average of precision, recall, and f1-score of the Xception were 0.6502, 0.6708, and 0.6485, respectively. The NASNetLarge model achieved the AUC of 0.86 for the macro-average ROC curve, and its AUCs for classifying subtypes of lung cancer ranged from 0.77 to 0.93 on the test set, as shown in Fig. 2d. The weighted average of precision, recall, and f1-score of the NASNetLarge were 0.6904, 0.6506, and 0.6293, respectively. Figure 2i,j showed diagnostic performance of the CNN built from scratch. It produced an AUC of 0.90 for the macro-average ROC curve, its AUCs for classifying subtypes of lung cancer ranged from 0.68 to 0.98. The weighted average of precision, recall, and f1-score of the scratch CNN were 0.7632, 0.7503, and 0.7428, respectively. The model Xception and ResNet152 models had the highest and the lowest performance among the pre-trained CNNs. The CNN built from scratch showed similar performance to the model with Xception as the results of AUC, but showed better performance than the Xception model considering overall diagnostic performance (i.e. precision, recall f1-score, and accuracy). Table 4 displays the detailed classification results (i.e. the precision, recall, f1 score, and accuracy) of the CNN models for each class.
Predictive analysis of CNN from scratch for discriminating patterns. We evaluated the diagnostic performance of the AI models for multinomial classification of lung cancer subtypes on the test sets. Dataset′ consisted of four classes excluding LCNEC from the original dataset, and dataset″ composed three classes with NSCLC (combining ADC, LCNEC, and SCC), SCLC, and non-tumor. The confusion matrices and the area under the curve (AUC) of the receiver operating characteristic curve (ROC)s were evaluated for the CNN model built from scratch over the test sets as shown in Fig. 3. Table 4 displays the classification results of the CNN from scratch over the test sets. The built from scratch CNN model produced the AUC of 0.90 for the macro-average ROC curve, its AUCs for classifying subtypes of lung cancer ranged from 0.68 to 0.98 in the original test set as shown in Fig. 3a,b. On the other hand, the macro-average AUC of the model raised up to 0.98 on the test set www.nature.com/scientificreports/ from the dataset′, which excluded the class of LCNEC (see in Fig. 3c,d). Table 3 showed that the weighted average of precision, recall, and f1-score of the scratch CNN raised from 0.7632 to 8788, from 0.7503 to 8674, and from 0.7428 to 8708, respectively as the dataset was excluded LCNEC. The prediction analysis of the scratch CNN for the test set from the dataset reconstructed into new classes (NSCLC, SCLC, and non-tumor; dataset″) were evaluated as shown in Table 4 and Fig. 3e,f. The scratch CNN model produced the AUC of 0.96 for the macro-average ROC curve, its AUCs for classifying subtypes of lung cancer ranged from 0.90 to 0.99 as shown in Fig. 3e,f. The weighted average of precision, recall, and f1-score of the scratch CNN model for the dataset″) were 0.9099, 0.8920, and 0.8964, respectively as Table 3.

Discussion
In this work, we generated various deep learning-based AI models for classification of subtypes in lung cancer biopsy slides. The AI models for the classification of subtype lung carcinoma proceeded with the flow of digital WSI data preparation, convolutional neural networks, loss functions, and diagnostic performance evaluation. Previous research involving lung cancer image analysis with deep learning has focused on differentiating between non-cancer tissue and lung cancer 5,6 to distinguish adenocarcinoma, squamous cell carcinoma 7 , or looked to identify small cell carcinoma in addition to the cancers previously mentioned 8 www.nature.com/scientificreports/ distinguish LCNEC from small cell carcinoma using deep learning, however, this is based on cytology images as the input information 9 . We could not find any studies that have attempted to classify the major carcinoma subtypes including LCNEC, using lung biopsy images as the input.
In this study, we took two approaches to developing AI models that distinguish multinomial patterns of lung carcinoma: transfer learning with pre-trained CNNs and a built from scratch CNN. First, we used transfer learning with pre-trained convolutional neural networks based on ResNet152 15 , VGG19 16 , Xception 17 , and NASNETLarge 18 . Among those pre-trained CNN models, Xception achieved the highest overall performance and ResNet152 produced the lowest performance. These results demonstrate that transfer learning with pre-trained CNNs on ImageNet 19 is relatively easy to access and ensures obtaining a certain level of verified accuracy. However, these results also imply, even among the pre-trained CNN models, the performances of each model may vary in pathology image analysis, as shown in Table 3. The model with ResNet152 achieved the lowest performance despite being known as proven performance. ResNet152 showed significantly lower performance, especially for ADC and LCNEC, than the results of other models. In addition, an interesting result is that ResNet15 misclassified the SCLC cases as well as most cases of ADC and LCNEC, while the other models easily diagnosed the class of SCLC. ResNet152 may require more fine-tuning to analyze for pathology images. The previous studies have tended to use pre-trained CNNs on a large benchmark dataset, for example, ImageNet 19 . However, our dataset consisted only of the scanned hematoxylin-eosin stained pathology slides, which is away from the ImageNet 19 . Our results imply that using a pre-trained CNN with proven performance is one method, but it does not necessarily guarantee high performance. In other words, it may be crucial to choose an appropriate model for the specific task, such as analyzing pathological images.
The second approach was to customize a new convolutional neural network model. The built from scratch CNN model performed similar or better than the transfer learning using pre-trained CNNs on a large benchmark dataset. The CNN built from scratch showed similar performance to the model with Xception as the results of AUC, but showed better performance than the Xception model considering overall diagnostic performance (i.e. precision, recall f1-score, and accuracy). These results imply that the built from scratch CNN model fitted to the specific task at hand with a certain amount of data, such as pathological diagnosis of lung cancer subtypes, could be expected to produce better performances rather than that of transfer learning using the pre-trained CNNs.
For each class, Xception, NASNetLarge, and CNN from scratch were relatively good at distinguishing the classes of SCLC and non-tumor tissue, but not at distinguishing NSCLC subtypes (ADC, LCNEC, SCC; see in Table 3). Although it did not show that the performance of the scratch CNN was superior for ADC, LCNEC, and SCC to others, overall, those were similar or better than the performance of the other models.
The AUC for LCNEC on the CNN from scratch was exceptionally low (AUC = 0.68) compared to others (see in Fig. 2j). Although the reason for the low AUC for LCNEC on our model is not clearly known, the use of surgical specimens in the test set may be one possible reason. Further studies are needed for degree of performance on deep-learning models trained with lung cancer biopsy slides depending on sample subtypes.  www.nature.com/scientificreports/ In additional analyses, the CNN from scratch showed better performance for distinguishing 4 classes (excluding LCNEC) and 3 classes (NSCLC, SCLC, and non-tumor) as shown in Table 4. For classification excluding LCNEC, the CNN from scratch showed similar AUCs (0.94 for ADC, 0.96 for SCC, and 0.99 for SCLC) with Kanavati's study 10 (0.814-0.987 for ADC, 0.959-0.989 for SCC, and 0.994-0.999 for SCLC). The performance change when LCNEC is not classified can be interpreted as follows. One possible reason is that classification of LCNEC makes prediction of other cancer subtypes difficult. Actually, distinguishing among poorly differentiated ADC, poorly differentiated SCC, and LCNEC is challenging without immunohistochemistry, and LCNEC may also be difficult to distinguish from SCLC in biopsy in some cases. Travis et al. reported that unanimous agreements in LCNEC and SCLC were 40% and 70% of cases, respectively, and the most common disagreement occurred between LCNEC and SCLC in surgically resected pulmonary neuroendocrine tumors reviewed independently by five pulmonary pathologists 20 .
Another possibility is increasing the number of classes may cause the prediction to be more challenging in general. www.nature.com/scientificreports/ There are some limitations to our study. First, our data set may be a relatively small number of samples to train and test the models for classifying into five classes. The number of samples in each class might be small, especially in the class of LCNEC. In addition, LCNEC cases in the test set were surgical specimens. However, even with relatively small data, a similar rate of performance results was obtained compared to previous studies with large amounts of data. This suggests that, in the development of AI models, finding a quantitative criterion of data for training and performance testing should be conducted in future research. Secondly, our study was conducted at patch level compared to the actual pathological diagnosis performed at slide level. The diagnoses of biopsy are more difficult than that of surgical specimens for pathologists. Therefore, it appears that NSCLC classification performance is lower than expected, as it may be more difficult to differentiate cancer subtypes at the patch level than at the slide level. Diagnosis of AI models at slide-level is generally performed first at the patch level. A class is determined first at the patch level, and then diagnosed at slide level by the proportion of the certain class patches on the slide. Thus, the performance of AI models at the slide level should be similar or better than that of at the patch level. In previous studies 10 have shown that the prediction rate of the AI model at the patch level was similar to or lower than the prediction rate at the slide level. This study was conducted to evaluate the optimal diagnostic performance at the patch level. The third limitation of our study is that our dataset is "controlled' data, not "real-world" data. The dataset consisted of lung or bronchus biopsy slides from Gyeongsang National University Hospital in 2012. Due to the relatively small number of LCNEC biopsy cases (n = 2) in 2012, we added more LCNEC biopsy cases from 2013 to 2018 to ensure appropriate analysis for LCNEC and to balance among the diagnoses. Our dataset does not represent real-world distribution of diagnoses, so this may produce lower performance on real-world dataset. However, small-scale research is impossible if only realworld data is used when AI pathology studies for cancers with a low incidence rate such as pulmonary LCNEC. Lastly, our models were built using an intra-hospital dataset. In other studies, public data such as the TCGA dataset was used to classify lung cancer versus normal tissue or the multinomial classification of the various cancer subtypes 4,5 , however, at the time of writing, there are no public databases that include LCNEC and small cell carcinoma slides available.
In conclusion, we compared two types of deep learning-AI models for classification of subtypes including LCNEC in lung cancer biopsy slides. The first type of model was a CNN pre-trained using transfer learning, and our experiments showed that this model was able to consistently classify the various classes of lung cancer type at a certain level of verified accuracy using actual pathological data as the input. The performance achieved in the experiments conducted demonstrates that a CNN model has the potential to be the basis for developing diagnostic workflow systems of the diagnosis and subtyping of lung cancers. However, the pre-trained CNNs on ImageNet 19 are generally complicated and it requires a time-consuming process to run. In addition, expensive equipment and lots of electrical energy are indispensable for experiments. This seems to be challenging for AI models to apply and practical use to all hospitals immediately. Nevertheless, studies on AI in pathology have been actively conducted, and it will be expected to continue and produce a lot of progress. The information gathered in this study suggests that one approach to pathology image analysis is to use a relatively simple CNN built from scratch model fitted to the specific task, like the model demonstrated in this paper.

Materials and methods
Patients. We collected hematoxylin-eosin stained pathology slides from the pathology reports of 171 patients that underwent lung or bronchus biopsies at Gyeongsang National University Hospital, Jinju, Korea, in 2012, and the pathology slides of 12 patients diagnosed with large cell neuroendocrine carcinoma in their biopsy at Gyeongsang National University Hospital from 2012 to 2018. Out of the patients, the pathology slides were taken from 18, 19, and 18 of patients were respectively diagnosed with adenocarcinoma, squamous cell carcinoma, and small cell carcinoma, while other slides (n = 116) all came from non-tumor cases. Those slides were constructed as the training and validation in the dataset as shown in Table 1. The test set was composed of additional cases from 2013 and 2015 (3 of ADC, 2 of LCNEC, 3 of SCC, 3 of SCLC, and 11 of non-tumor slides). Among these slides, the two cases of LCNEC (one was from 2013, the other was from 2015) were surgical specimens. The biopsy was not done for the two cases, and those were entirely independent cases from the slides in the training and validation set. Each diagnosis was histopathologically confirmed by two experienced pathologists. This study was approved by the Institutional Review Board of Gyeongsang National University Hospital with a waiver for informed consent (2021-04-016), and all methods were performed in accordance with the relevant guidelines and regulations. Data preprocessing. A total of 205 WSIs were acquired from 205 pathology slides with an Aperio AT2 slide scanner (Leica Biosystems Division of Leica Microsystems Inc., IL, USA) and 400 ×. Two experienced pathologists annotated the cancer regions on the WSIs with Aperio ImageScope v12.4.3 (Leica Biosystems Division of Leica Microsystems Inc., IL, USA). Tumor areas were extracted from the annotated whole slide images (WSIs), the extracted areas were used to generate non-overlapping patches 256 × 256 pixels in size at a magnification of 20 × using DeepPATH based on the OpenSlide library in Python (Fig. 1a) 7 . The patches were removed if the percentage of background in the patch was above 25% according to the DeepPATH 7 program. In this process, 10,049 patches were generated from the original 205 WSIs containing either one of the four lung cancer subtypes or a negative case, as shown in Table 1. www.nature.com/scientificreports/ respectively. The test set was generated by the additional slides; 3 of ADC, 2 of LCNEC, 3 of SCC, 3 of SCLC, and 11 of non-tumor slides independent from the training and validation set. The test set consisted of 2175 patches in five classes. The detailed information of the dataset is seen in Table 1. To compare the performance of AI models, we reconstructed the dataset into dataset′ and dataset″. Dataset′ consisted of only four classes excluding LCNEC from the original dataset in Table 1. Dataset″ is the dataset consisting of three classes; NSCLC, SCLC, and non-tumor. The class of NSCLC was generated by combining ADC, LCNEC and SCC patches from the dataset in Table 1.

Convolutional neural networks.
A deep neural network (DNN) is a supervised classifier that contains multiple layers between input and output layers 21 . A convolutional neural network (CNN, or ConvNet) is a specialized kind of a DNN, CNNs are known to perform particularly when analyzing images 19-22 . We constructed a convolutional neural network model for the multinomial classification of lung cancer biopsies with the possible outputs being the four lung cancer types or the negative case. Our CNN was built on the Keras Sequential API (https:// keras. io/), written in Python and running on TensorFlow (https:// www. tenso rflow. org/) 23 . CNN models take tensors of a certain shape as input, for image analysis CNNs the shape of these tensors are dictated by the height of input images, width, and color channels. Our model takes inputs with dimensions of 244 × 244 × 3 and consists of four convolution blocks with a max pool layer in each. The 1st and 2nd hidden layers of the model have 16 and 32 filters, respectively, with a kernel size of (2, 2) and use a rectified linear unit (ReLU) as their activation functions. The 3rd and 4th hidden layers have 64 filters with a kernel size of (2, 2) and use a rectified linear unit (ReLU) as their activation functions, as shown in Table 2. The fully connected dense layer of the model has 5 units and uses a softmax activation function. Batch size of 200 and 100 epochs were determined as the optimum values for the model when considering both time and computational costs. When compiling the model, Nadam was chosen as the optimizer and categorical cross entropy was selected for the loss function.
Transfer learning with pre-trained ConvNets. We evaluated four AI models that used transfer learning to implement state-of-the-art pre-trained convolutional neural networks. Transfer learning is a subfield of machine learning and artificial intelligence which uses the learned weights of an already trained model to solve a different problem instead of starting the training process of a model over from scratch, this approach saves time and computational costs 24 . Transfer learning for computer vision problems is normally executed by applying pre-trained ConvNet architectures (e.g. VGG, ResNet, Xception etc.) that were trained on large benchmark datasets (e.g. ImageNet 19 ) to solve a particular problem. The pre-trained convolutional neural networks (ConvNets) allow us to build AI models for image classification with relatively high accuracy and diagnostic performance even if the target dataset is small or if the people tackling the problem do not have the required expertise to train a CNN from scratch. Pre-trained image classification networks are trained on a subset of the ImageNet database used in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 25 . Four pre-trained convolutional neural networks; ResNet152 15 , VGG19 16 , Xception 17 , and NASNETLarge 18 were used in this study. ResNet152 is a newer version of ResNet (Residual Network), which is a convolutional neural network built and trained by Microsoft 15 . VGG19, which has a depth of 19 layers, was established by the University of Oxford in 2014 16 . Xception which was built and trained by Google, is a novel deep convolutional neural network inspired by Inception 17 . It slightly outperformed InceptionV3 (GoogLeNet) 26 on the ImageNet database. Lastly, NASNET-Large is a state-of-art neural image classification model built and trained by Google in 2018 18 . The pre-trained models are freely accessible through the Keras Application (https:// keras. io/ api/ appli catio ns/), which is a deep learning library. After the pre-trained models were chosen, we repurposed the knowledge that had already been learned; the layers, features, weights, and biases by fine-tuning to generate the correct outputs for our problem. Batch sizes of 20 and 10 epochs were determined as the optimum values for the pre-trained CNNs in consideration of time and computational costs. When compiling each model, Nadam was chosen as the optimizer and categorical cross entropy was selected for the loss function. www.nature.com/scientificreports/ All statistical analyses were performed using the scikit-learn library (https:// scikit-learn. org/) from Python version 3.8.3 (https:// www. python. org/).

Data availability
The dataset used in this study might be shared upon reasonable request to Jung Wook Yang, MD, PhD.