Introduction

In the WHO classification 2017, more than 100 subtypes of lymphoid neoplasm were defined [1]. Pathological differentiation among these subtypes was achieved by investigating the clinical, morphological, immunohistochemical, and genomic features [1]. The number of antibodies used in immunohistochemical staining required for diagnosis is increasing along with those applied in a differential diagnosis [1]. Moreover, it has been reported that even experienced hematopathologists find it difficult to conduct a diagnosis using standardized diagnostic criteria [2]. Therefore, there is room for improvement in establishing standardized diagnostic procedures, including computer-aided diagnosis.

Deep learning is a computational algorithm composed of multiple processing layers, and has considerably improved state-of-the-art technologies in various fields including speech recognition, visual object recognition, and object detection [3]. Deep convolutional neural networks are a deep learning architecture used for image recognition, and achieved the highest score in an image recognition competition in 2012 [4]. Classifiers based on deep convolutional neural networks actually output confidence scores for each label based on the input image. A confidence score ranges from 0 to 1, and the sum of all confidence scores is equal to 1. In general, a label with the highest confidence score is used as the predicted label.

Numerous studies have recently been conducted on histopathological image analysis using deep neural networks. Coudray et al. distinguished normal tissue from tumors, and adenocarcinoma from squamous cell carcinoma, by applying pathological images [5]. Steiner et al. detected the metastasis of breast cancer in lymph nodes [6]. In the area of malignant lymphoma, Janowczyk and Madabhushi tried to distinguish three lymphoma subtypes, namely, chronic lymphocytic leukemia, follicular lymphoma (FL), and mantle cell lymphoma using deep neural networks [7]. However, these studies only evaluated the abilities of the classifiers and did not compare the results with evaluations by pathological experts.

Diffuse large B-cell lymphoma (DLBCL) and FL are the most common subtypes of non-Hodgkin lymphoma [8]. DLBCL and FL are diagnosed mainly by their morphological and immunohistochemical features, which exhibit a diffuse proliferation of CD20-positive large B-cells [9, 10] and a nodular pattern of medium-sized B-cells with CD20, CD10, and bcl2 expressions [11, 12]. In routine pathological procedures, pathological differentiation among DLBCL, FL, and reactive lymphoid hyperplasia (RL) is performed mostly through a hematopathological diagnosis.

In this study, to investigate the feasibility of a computer-aided diagnosis of malignant lymphomas, we aimed at distinguishing DLBCL, FL, and RL using deep learning, and compared the results with those of a diagnosis with hematoxylin and eosin (H&E) conducted by pathologists, including an experienced hematopathologist.

Materials and methods

Samples

This study analyzed samples of 388 sections composed of 259 DLBCLs, 89 FLs, and 40 RLs, which were nodal and extranodal lesions including in the gastrointestinal tract extracted by 23 core needle biopsies, 333 biopsies, and 32 excisions. Diagnosis of DLBCL, FL, or RL was previously confirmed using immunohistochemistry, including CD3, CD20, CD10, and Bcl2. All sections were diagnosed at Kurume University from 2010 to 2017. The samples applied in this study were approved by the ethics review committee of Kurume University in accordance with the recommendations of the Declaration of Helsinki.

Image preparation

Histopathological images were prepared using the following procedure (Fig. 1). First, we prepared glass slides of H&E stained formalin-fixed paraffin-embedded tissue of the lesion areas. Second, all glass slides were converted into whole slide images (WSIs) using a whole slide system (Aperio AT2 (Leica Biosystems, Inc.)). Third, the pathological findings of WSIs were annotated by hematopathologists. The total number of annotations was 6,183 (DLBCL, 3,865; FL, 1,209; RL, 1,109). Fourth, image patches of 2048 pixels × 2048 pixels were cropped at the center of each annotation in each WSI at magnification of ×5. In addition, from the periphery of each annotation at magnifications of ×20 and ×40, image patches of 2048 pixels × 2048 pixels in size were randomly cropped. If the average RGB value of all pixels was 200 or more, the image patches were discarded because the image patch was considered to either be too white or to show too little of the specimen.

Fig. 1: Procedure for creating image patches.
figure 1

Experienced hematopathologists annotated whole slide images (WSIs) of hematoxylin and eosin (H&E) staining. 2048 pixel × 2048 pixel sized image patches were cropped at the center of each annotation in each WSI at a magnification of ×5, and 2048 pixel × 2048 pixel sized image patches were randomly obtained from the periphery of each annotation at magnifications of ×20 and ×40.

Different additional cropping tasks were conducted during the training and testing phases. During the training phase, data augmentation methods were adopted to improve the general performance of the classifiers, in which smaller image patches were randomly cropped from 2048 pixel × 2048 pixel image patches. Specifically, 128 pixel × 128 pixel image patches were cropped from the ×5 magnification, and 64 pixel × 64 pixel image patches were randomly cropped from the ×20 and ×40 magnifications, respectively, (Fig. 2a). During the testing phase, smaller image patches were sequentially cropped (Fig. 2b). The classifiers made prediction values using sequentially cropped image patches, and these values were averaged to calculate the final prediction.

Fig. 2: Preprocess during the training and testing phase.
figure 2

a Preprocess during the training phase. Image patches with a pixel resolution of 128 × 128 were randomly cropped from 2048 pixel × 2048 pixel sized image patches at a magnification of ×5 (left). Image patches with a pixel resolution of 64 × 64 were randomly applied from 2048 pixel × 2048 pixel sized image patches at a magnification of ×20 and ×40 (right). b Preprocess during the testing phase. Image patches with a pixel resolution of 128 × 128 were sequentially cropped from 2048 pixel × 2048 pixel sized image patches at a magnification of ×5 (left). Image patches with a pixel resolution of 64 × 64 were sequentially applied from 2048 pixel × 2048 pixel sized image patches with magnifications of ×20 and ×40 (right).

Evaluation

The deep neural network classifier was evaluated through a cross-validation and a comparison with pathologists.

Cross-validation

A cross-validation is an established method used to evaluate models statistically and assess how accurately the models predict unknown data [13,14,15,16]. A cross-validation randomly divides objective data into training and test sets, which are used for training and evaluation, respectively. A K-fold cross-validation divides the objective samples into one group or into a K − 1 group. One group is used for a test set, and K − 1 groups are used for a training set. The test can be performed K times by shifting the test set each time, and the results of all test shots are combined and averaged. By doing so, all data are effectively used for both training and testing, and the results of the evaluation can be statistically stable and accurate.

In this study, we prepared test sets of ×5, ×20, and ×40 magnification, and 100 image patches were used in each test set. The others were equally and randomly divided into five groups. The test was conducted five times. In every test, one group was the validation set, which was not used for training but for evaluating the classifier performance for every epoch during training, and the other four groups were used as training sets. The epoch is the unit of training, and a classifier is trained once with a training set in one epoch. In this study, the total number of epochs was 30. The classifier with the best validation accuracy during every test was chosen. After training, the classifier was evaluated using the test set. This process was repeated five times and thus all the data were treated as the validation set once. The accuracy, which is the number of correct predictions divided by the total number of predictions, and the area under the curve (AUC) as measured by the receiver operating characteristic (ROC) curve, were calculated.

Comparison with results from pathologists based on evaluation of H&E staining

For the comparisons, seven pathologists including a pathological trainee, general pathologist, and an experienced hematopathologist evaluated the H&E of WSIs of 388 sections, by which the pathologists could evaluate the overall architecture of the lesion at low- to high-power magnification, and of 100 image patches at magnifications of ×5, ×20, and ×40 prepared for the test set for the classifier, as mentioned above. For evaluation of WSIs, seven pathologists were composed of two pathological trainees, two pathologists, and three experienced hematopathologists, and for that of 100 image patches, two pathological trainees, four pathologists, and one experienced hematopathologist were included.

The performances of the seven pathologists and the classifier for the test set with image patches at magnifications of ×5, ×20, and ×40 were compared. The model ensemble method was adopted to evaluate the classifier. We averaged the confidence scores of all 15 classifiers, which included five classifiers at a magnification of ×5, five classifiers at a magnification of ×20, and five classifiers at a magnification of ×40, trained during a previous task to make the final prediction.

The precision, defined as the number of true positives over the number of true positives plus the number of false positives; the recall, defined as the number of true positives over the number of true positives plus the number of false negatives; and the F1 score, namely, the harmonic mean of the precision and recall, were calculated for comparison.

Results

The procedure of the data preparation was as follows. Image patches for training were prepared by cropping from glass slides (Fig. 1). The total number of image patches was 6,083 (×5), 42,674 (×20), and 42,446 (×40). Example image patches are shown in Fig. 3. Moreover, smaller image patches were cropped for increasing the training data (Fig. 2a, b).

Fig. 3: Examples of image patches.
figure 3

The top row indicates the disease names, the second row shows image patches at a magnification of ×5, the third row shows those at a magnification of ×20, and the bottom row shows those at a magnification of ×40.

The classifier for the study was based on deep neural networks, which consisted of 11 layers, including four convolutional layers and two fully connected layers (Fig. 4). The classifier had been verified in several tasks for analyzing medical images [17, 18].

Fig. 4: Deep neural network classifier used in this study.
figure 4

The classifier consists of 11 layers, including 4 convolutional layers and 2 fully connected layers.

The classifier was trained with the image and examples of the predicted images are shown in Fig. 5. The red area was predicted as DLBCL with a score of 0.7 or higher, the green area was predicted as FL, and the yellow area was predicted as RL. The white area indicates confidence scores of all three categories of <0.7, and were not predicted clearly.

Fig. 5: Examples of predicted images.
figure 5

The red area is predicted as a diffuse large B-cell lymphoma (DLBCL) with as score of 0.7 or higher, the green area is defined as follicular lymphoma (FL), and the yellow area is defined as reactive lymphoid hyperplasia (RL). White (transparent) areas are indistinguishable because the scores of all three categories are <0.7, or only areas of the glass slides are shown.

Cross-validation

The results of the cross-validation are summarized in Table 1 and Fig. 6. The average accuracies of our classifiers were 87.0, 91.0, and 89.0%. The best accuracies were 94.0%, 93.0%, and 92.0% at magnifications of ×5, ×20, and ×40, respectively (Table 1). Almost all ROCs of the cross-validation showed an AUC of higher than 0.95 (Fig. 6).

Table 1 Accuracy of cross-validation. A total of 100 image patches with a pixel resolution of 2048 × 2048 (at magnifications of ×5, ×20, and ×40) were used as the test set.
Fig. 6: Receiver operating characteristic curves of cross-validation.
figure 6

The  black line on the right indicates DLBCL, the dark gray line indicates FL, and the light gray line indicates RL. Almost all of the ROCs of the cross-validation were higher than the 0.95 of the area under the curve (AUC).

Comparison with pathologists

The accuracies of the seven pathologists were 83.3, 81.4, 79.1, 72.4, 71.7, 70.4, and 66.8% for the evaluation of the WSIs, and 76.0, 75.0, 73.0, 72.0, 66.0, 60.0, and 49.0% for the image patches at magnifications of ×5, ×20, and ×40.

The study used the model ensemble, which was a technique used to develop a better classifier by combining values from multiple classifiers. Although the model ensemble has some variation, “Averaging” was adopted as the ensemble method in this study. With this method, the probabilities of all models are averaged (Fig. 7).

Fig. 7: Model ensemble.
figure 7

The “Averaging” method, which is a type of model ensemble used to develop a better classifier by combining values from multiple classifiers, was adopted. This method can obtain an average of the predictions from multiple models.

As a result, the best accuracy of our ensembled classifier for image patches at magnifications of ×5, ×20, and ×40 was 97.0%, which was higher than that of any of the pathologists (Table 2). In addition, the AUCs of DLBCL, FL, and RL of the ensembled classifier were 1.00, 0.99, and 1.00, respectively. The ROC curves of the ensembled classifier showed a higher performance than almost all of the pathologists (Fig. 8). The recall, precision, and F1 score were higher for the ensembled classifier than for the pathologists (Table 3). These results show that the ensembled classifier outperformed the pathologists. The confusion matrices are shown in Fig. 9. Although some of the pathologists were confused regarding the diagnosis between DLBCL and RL, the classifier made only a few errors.

Table 2 Comparison among seven pathologists and the ensembled deep neural network classifier for evaluation of 100 test patches at magnifications of ×5, ×20, and ×40.
Fig. 8: ROC curves of the ensembled classifier as compared with the diagnosis of the pathologists.
figure 8

The performances of the pathologists are indicated by dots, and the classifier performance is represented by the ROC curves. The ROC curves of the ensembled classifier show a higher performance than that of almost all of the pathologists.

Table 3 Recall, Precision, and F1 score of classifier and pathologist.
Fig. 9: Confusion matrices of classifiers and pathologists.
figure 9

The horizontal axis is the diagnosis of the ground truth, and the vertical axis is the predicted diagnosis. The first row/column is DLBCL, the second is FL, and the third is RL. Although the pathologists might confuse the diagnosis between DLBCL and RL, the classifier made only a few errors.

Discussion

In this study, we showed that the deep neural network classifiers achieved an excellent accuracy and outperformed the pathologists in terms of the multi-class classification of three lesions when using only H&E stained images.

Based on a differential diagnosis using only H&E staining among the three lesion types (RL, FL, and DLBCL), the diagnosis of DLBCL is considered to be the most accessible. This might be because the morphological features of DLBCL are extremely different from those of FL and RL. DLBCL exhibits a diffuse proliferation of large neoplastic lymphoid cells, which can be detected at both high and low magnification [4,5,6,7]. Indeed, a higher accuracy in the diagnosis of DLBCL can be achieved by both the classifiers and pathologists (Table 3).

By contrast, the morphological distinction between RL and FL is generally difficult without the immunohistochemistry of CD10 and bcl2 because these lesions have a similar nodular proliferation with medium to large sized lymphocytes, although there are morphological differences in the shape and density of the follicles, as well as the presence of polarity and tingible body macrophages in the follicles [4]. The neural classifier used in this study achieved the highest accuracy for RL and FL, probably because the classifier can detect these morphological differences in much more detail than the pathologists. Moreover, the classifier can possibly detect other morphological findings not focused on by pathologists, as well as more detailed morphological findings that cannot be recognized by pathologists but only by classifiers based on pixel-level image features.

The classifier applied in this study can achieve a high accuracy in the diagnosis of certain types of malignant lymphoma using only H&E slide images. According to the WHO classification [1], the differential diagnosis of malignant lymphoma requires a number of immunohistochemical stains for more than 30 antigens, which might require manpower, a specific technique, time, and cost. A neural classifier might overcome these restrictions and provide a simpler diagnosis in developing countries, where immunohistochemical analyses are difficult to conduct.

Recently, more studies have declared the usefulness of diagnosis by artificial intelligence (AI) for malignant lymphoma [19, 20]. Achi et al. achieved the accuracy at 95% by convolutional neural network for pathological diagnosis among reactive lymphadenopathies, DLBCL, Burkitt lymphoma, and small lymphocytic lymphoma [19]. For pathological evaluation of DLBCL and Burkitt lymphoma, convolutional neural network by Mohlman et al. showed ROC AUC 0.92 [20]. More investigation might confirm the usefulness of AI for pathological diagnosis of malignant lymphoma.

This study has certain limitations that should be overcome before application to a practical diagnosis, although the utility of a neural classifiers for the pathological diagnosis of malignant lymphoma was shown. First, our classifier requires a manual annotation, which requires effort from a hematopathologist. Hence, a method without such an annotation should be developed. Second, there are more than 100 subtypes of lymphoproliferative disorders according to the WHO classification. A neural classifier must be adapted to the diagnosis of other subtypes for a practical diagnosis. Third, the classifier specializes in WSIs, which are suitable for deep learning owing to the amount of data applied. Because whole slide systems are generally expensive, the development of neural classifiers by other media, including CCD cameras and camera phones, is desirable.

In conclusion, the neural classifier can conduct a pathological diagnosis using only H&E staining, and its performance exceeds that of pathologists. AI technology might support a pathological diagnosis, allowing new findings that cannot be detected by pathologists, and achieve a pathological diagnosis at lower costs in developing countries.