Abstract
A pathological evaluation is one of the most important methods for the diagnosis of malignant lymphoma. A standardized diagnosis is occasionally difficult to achieve even by experienced hematopathologists. Therefore, established procedures including a computer-aided diagnosis are desired. This study aims to classify histopathological images of malignant lymphomas through deep learning, which is a computer algorithm and type of artificial intelligence (AI) technology. We prepared hematoxylin and eosin (H&E) slides of a lesion area from 388 sections, namely, 259 with diffuse large B-cell lymphoma, 89 with follicular lymphoma, and 40 with reactive lymphoid hyperplasia, and created whole slide images (WSIs) using a whole slide system. WSI was annotated in the lesion area by experienced hematopathologists. Image patches were cropped from the WSI to train and evaluate the classifiers. Image patches at magnifications of ×5, ×20, and ×40 were randomly divided into a test set and a training and evaluation set. The classifier was assessed using the test set through a cross-validation after training. The classifier achieved the highest levels of accuracy of 94.0%, 93.0%, and 92.0% for image patches with magnifications of ×5, ×20, and ×40, respectively, in comparison to diffuse large B-cell lymphoma, follicular lymphoma, and reactive lymphoid hyperplasia. Comparing the diagnostic accuracies between the proposed classifier and seven pathologists, including experienced hematopathologists, using the test set made up of image patches with magnifications of ×5, ×20, and ×40, the best accuracy demonstrated by the classifier was 97.0%, whereas the average accuracy achieved by the pathologists using WSIs was 76.0%, with the highest accuracy reaching 83.3%. In conclusion, the neural classifier can outperform pathologists in a morphological evaluation. These results suggest that the AI system can potentially support the diagnosis of malignant lymphoma.
Similar content being viewed by others
Introduction
In the WHO classification 2017, more than 100 subtypes of lymphoid neoplasm were defined [1]. Pathological differentiation among these subtypes was achieved by investigating the clinical, morphological, immunohistochemical, and genomic features [1]. The number of antibodies used in immunohistochemical staining required for diagnosis is increasing along with those applied in a differential diagnosis [1]. Moreover, it has been reported that even experienced hematopathologists find it difficult to conduct a diagnosis using standardized diagnostic criteria [2]. Therefore, there is room for improvement in establishing standardized diagnostic procedures, including computer-aided diagnosis.
Deep learning is a computational algorithm composed of multiple processing layers, and has considerably improved state-of-the-art technologies in various fields including speech recognition, visual object recognition, and object detection [3]. Deep convolutional neural networks are a deep learning architecture used for image recognition, and achieved the highest score in an image recognition competition in 2012 [4]. Classifiers based on deep convolutional neural networks actually output confidence scores for each label based on the input image. A confidence score ranges from 0 to 1, and the sum of all confidence scores is equal to 1. In general, a label with the highest confidence score is used as the predicted label.
Numerous studies have recently been conducted on histopathological image analysis using deep neural networks. Coudray et al. distinguished normal tissue from tumors, and adenocarcinoma from squamous cell carcinoma, by applying pathological images [5]. Steiner et al. detected the metastasis of breast cancer in lymph nodes [6]. In the area of malignant lymphoma, Janowczyk and Madabhushi tried to distinguish three lymphoma subtypes, namely, chronic lymphocytic leukemia, follicular lymphoma (FL), and mantle cell lymphoma using deep neural networks [7]. However, these studies only evaluated the abilities of the classifiers and did not compare the results with evaluations by pathological experts.
Diffuse large B-cell lymphoma (DLBCL) and FL are the most common subtypes of non-Hodgkin lymphoma [8]. DLBCL and FL are diagnosed mainly by their morphological and immunohistochemical features, which exhibit a diffuse proliferation of CD20-positive large B-cells [9, 10] and a nodular pattern of medium-sized B-cells with CD20, CD10, and bcl2 expressions [11, 12]. In routine pathological procedures, pathological differentiation among DLBCL, FL, and reactive lymphoid hyperplasia (RL) is performed mostly through a hematopathological diagnosis.
In this study, to investigate the feasibility of a computer-aided diagnosis of malignant lymphomas, we aimed at distinguishing DLBCL, FL, and RL using deep learning, and compared the results with those of a diagnosis with hematoxylin and eosin (H&E) conducted by pathologists, including an experienced hematopathologist.
Materials and methods
Samples
This study analyzed samples of 388 sections composed of 259 DLBCLs, 89 FLs, and 40 RLs, which were nodal and extranodal lesions including in the gastrointestinal tract extracted by 23 core needle biopsies, 333 biopsies, and 32 excisions. Diagnosis of DLBCL, FL, or RL was previously confirmed using immunohistochemistry, including CD3, CD20, CD10, and Bcl2. All sections were diagnosed at Kurume University from 2010 to 2017. The samples applied in this study were approved by the ethics review committee of Kurume University in accordance with the recommendations of the Declaration of Helsinki.
Image preparation
Histopathological images were prepared using the following procedure (Fig. 1). First, we prepared glass slides of H&E stained formalin-fixed paraffin-embedded tissue of the lesion areas. Second, all glass slides were converted into whole slide images (WSIs) using a whole slide system (Aperio AT2 (Leica Biosystems, Inc.)). Third, the pathological findings of WSIs were annotated by hematopathologists. The total number of annotations was 6,183 (DLBCL, 3,865; FL, 1,209; RL, 1,109). Fourth, image patches of 2048 pixels × 2048 pixels were cropped at the center of each annotation in each WSI at magnification of ×5. In addition, from the periphery of each annotation at magnifications of ×20 and ×40, image patches of 2048 pixels × 2048 pixels in size were randomly cropped. If the average RGB value of all pixels was 200 or more, the image patches were discarded because the image patch was considered to either be too white or to show too little of the specimen.
Different additional cropping tasks were conducted during the training and testing phases. During the training phase, data augmentation methods were adopted to improve the general performance of the classifiers, in which smaller image patches were randomly cropped from 2048 pixel × 2048 pixel image patches. Specifically, 128 pixel × 128 pixel image patches were cropped from the ×5 magnification, and 64 pixel × 64 pixel image patches were randomly cropped from the ×20 and ×40 magnifications, respectively, (Fig. 2a). During the testing phase, smaller image patches were sequentially cropped (Fig. 2b). The classifiers made prediction values using sequentially cropped image patches, and these values were averaged to calculate the final prediction.
Evaluation
The deep neural network classifier was evaluated through a cross-validation and a comparison with pathologists.
Cross-validation
A cross-validation is an established method used to evaluate models statistically and assess how accurately the models predict unknown data [13,14,15,16]. A cross-validation randomly divides objective data into training and test sets, which are used for training and evaluation, respectively. A K-fold cross-validation divides the objective samples into one group or into a K − 1 group. One group is used for a test set, and K − 1 groups are used for a training set. The test can be performed K times by shifting the test set each time, and the results of all test shots are combined and averaged. By doing so, all data are effectively used for both training and testing, and the results of the evaluation can be statistically stable and accurate.
In this study, we prepared test sets of ×5, ×20, and ×40 magnification, and 100 image patches were used in each test set. The others were equally and randomly divided into five groups. The test was conducted five times. In every test, one group was the validation set, which was not used for training but for evaluating the classifier performance for every epoch during training, and the other four groups were used as training sets. The epoch is the unit of training, and a classifier is trained once with a training set in one epoch. In this study, the total number of epochs was 30. The classifier with the best validation accuracy during every test was chosen. After training, the classifier was evaluated using the test set. This process was repeated five times and thus all the data were treated as the validation set once. The accuracy, which is the number of correct predictions divided by the total number of predictions, and the area under the curve (AUC) as measured by the receiver operating characteristic (ROC) curve, were calculated.
Comparison with results from pathologists based on evaluation of H&E staining
For the comparisons, seven pathologists including a pathological trainee, general pathologist, and an experienced hematopathologist evaluated the H&E of WSIs of 388 sections, by which the pathologists could evaluate the overall architecture of the lesion at low- to high-power magnification, and of 100 image patches at magnifications of ×5, ×20, and ×40 prepared for the test set for the classifier, as mentioned above. For evaluation of WSIs, seven pathologists were composed of two pathological trainees, two pathologists, and three experienced hematopathologists, and for that of 100 image patches, two pathological trainees, four pathologists, and one experienced hematopathologist were included.
The performances of the seven pathologists and the classifier for the test set with image patches at magnifications of ×5, ×20, and ×40 were compared. The model ensemble method was adopted to evaluate the classifier. We averaged the confidence scores of all 15 classifiers, which included five classifiers at a magnification of ×5, five classifiers at a magnification of ×20, and five classifiers at a magnification of ×40, trained during a previous task to make the final prediction.
The precision, defined as the number of true positives over the number of true positives plus the number of false positives; the recall, defined as the number of true positives over the number of true positives plus the number of false negatives; and the F1 score, namely, the harmonic mean of the precision and recall, were calculated for comparison.
Results
The procedure of the data preparation was as follows. Image patches for training were prepared by cropping from glass slides (Fig. 1). The total number of image patches was 6,083 (×5), 42,674 (×20), and 42,446 (×40). Example image patches are shown in Fig. 3. Moreover, smaller image patches were cropped for increasing the training data (Fig. 2a, b).
The classifier for the study was based on deep neural networks, which consisted of 11 layers, including four convolutional layers and two fully connected layers (Fig. 4). The classifier had been verified in several tasks for analyzing medical images [17, 18].
The classifier was trained with the image and examples of the predicted images are shown in Fig. 5. The red area was predicted as DLBCL with a score of 0.7 or higher, the green area was predicted as FL, and the yellow area was predicted as RL. The white area indicates confidence scores of all three categories of <0.7, and were not predicted clearly.
Cross-validation
The results of the cross-validation are summarized in Table 1 and Fig. 6. The average accuracies of our classifiers were 87.0, 91.0, and 89.0%. The best accuracies were 94.0%, 93.0%, and 92.0% at magnifications of ×5, ×20, and ×40, respectively (Table 1). Almost all ROCs of the cross-validation showed an AUC of higher than 0.95 (Fig. 6).
Comparison with pathologists
The accuracies of the seven pathologists were 83.3, 81.4, 79.1, 72.4, 71.7, 70.4, and 66.8% for the evaluation of the WSIs, and 76.0, 75.0, 73.0, 72.0, 66.0, 60.0, and 49.0% for the image patches at magnifications of ×5, ×20, and ×40.
The study used the model ensemble, which was a technique used to develop a better classifier by combining values from multiple classifiers. Although the model ensemble has some variation, “Averaging” was adopted as the ensemble method in this study. With this method, the probabilities of all models are averaged (Fig. 7).
As a result, the best accuracy of our ensembled classifier for image patches at magnifications of ×5, ×20, and ×40 was 97.0%, which was higher than that of any of the pathologists (Table 2). In addition, the AUCs of DLBCL, FL, and RL of the ensembled classifier were 1.00, 0.99, and 1.00, respectively. The ROC curves of the ensembled classifier showed a higher performance than almost all of the pathologists (Fig. 8). The recall, precision, and F1 score were higher for the ensembled classifier than for the pathologists (Table 3). These results show that the ensembled classifier outperformed the pathologists. The confusion matrices are shown in Fig. 9. Although some of the pathologists were confused regarding the diagnosis between DLBCL and RL, the classifier made only a few errors.
Discussion
In this study, we showed that the deep neural network classifiers achieved an excellent accuracy and outperformed the pathologists in terms of the multi-class classification of three lesions when using only H&E stained images.
Based on a differential diagnosis using only H&E staining among the three lesion types (RL, FL, and DLBCL), the diagnosis of DLBCL is considered to be the most accessible. This might be because the morphological features of DLBCL are extremely different from those of FL and RL. DLBCL exhibits a diffuse proliferation of large neoplastic lymphoid cells, which can be detected at both high and low magnification [4,5,6,7]. Indeed, a higher accuracy in the diagnosis of DLBCL can be achieved by both the classifiers and pathologists (Table 3).
By contrast, the morphological distinction between RL and FL is generally difficult without the immunohistochemistry of CD10 and bcl2 because these lesions have a similar nodular proliferation with medium to large sized lymphocytes, although there are morphological differences in the shape and density of the follicles, as well as the presence of polarity and tingible body macrophages in the follicles [4]. The neural classifier used in this study achieved the highest accuracy for RL and FL, probably because the classifier can detect these morphological differences in much more detail than the pathologists. Moreover, the classifier can possibly detect other morphological findings not focused on by pathologists, as well as more detailed morphological findings that cannot be recognized by pathologists but only by classifiers based on pixel-level image features.
The classifier applied in this study can achieve a high accuracy in the diagnosis of certain types of malignant lymphoma using only H&E slide images. According to the WHO classification [1], the differential diagnosis of malignant lymphoma requires a number of immunohistochemical stains for more than 30 antigens, which might require manpower, a specific technique, time, and cost. A neural classifier might overcome these restrictions and provide a simpler diagnosis in developing countries, where immunohistochemical analyses are difficult to conduct.
Recently, more studies have declared the usefulness of diagnosis by artificial intelligence (AI) for malignant lymphoma [19, 20]. Achi et al. achieved the accuracy at 95% by convolutional neural network for pathological diagnosis among reactive lymphadenopathies, DLBCL, Burkitt lymphoma, and small lymphocytic lymphoma [19]. For pathological evaluation of DLBCL and Burkitt lymphoma, convolutional neural network by Mohlman et al. showed ROC AUC 0.92 [20]. More investigation might confirm the usefulness of AI for pathological diagnosis of malignant lymphoma.
This study has certain limitations that should be overcome before application to a practical diagnosis, although the utility of a neural classifiers for the pathological diagnosis of malignant lymphoma was shown. First, our classifier requires a manual annotation, which requires effort from a hematopathologist. Hence, a method without such an annotation should be developed. Second, there are more than 100 subtypes of lymphoproliferative disorders according to the WHO classification. A neural classifier must be adapted to the diagnosis of other subtypes for a practical diagnosis. Third, the classifier specializes in WSIs, which are suitable for deep learning owing to the amount of data applied. Because whole slide systems are generally expensive, the development of neural classifiers by other media, including CCD cameras and camera phones, is desirable.
In conclusion, the neural classifier can conduct a pathological diagnosis using only H&E staining, and its performance exceeds that of pathologists. AI technology might support a pathological diagnosis, allowing new findings that cannot be detected by pathologists, and achieve a pathological diagnosis at lower costs in developing countries.
References
Swerdlow SH, Campo E, Harris NL, Jaffe ES, Pileri SA, Stein H, et al. World Health Organization classification of tumours of haematopoietic and lymphoid tissues. Revised 4th ed. Lyon, IARC Press; 2017.
Piccaluga PP, Fuligni F, De Leo A, Bertuzzi C, Rossi M, Bacci F, et al. Molecular profiling improves classification and prognostication of nodal peripheral T-cell lymphomas: Results of a phase III diagnostic accuracy study. J Clin Oncol. 2013;31:3019–25.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nat Lett. 2015;512:436–44.
Krizhevsky A, Sutskever I.Hinton GE. ImageNet classification with deep convolutional neural networks. NIPS'12 Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012;1:1097–105.
Coudray N, Ocampo PS, Sakellaropoulos T, Narula N, Snuderl M, Fenyö D, et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat Med. 2018;24:1559–67.
Steiner DF, MacDonald R, Liu Y, Truszkowski P, Hipp JD, Gammage C, et al. Impact of deep learning assistance on the histopathologic review of lymph nodes for metastatic breast cancer. Am J Surg Pathol. 2018;42:1636–46.
Janowczyk A, Madabhushi A. Deep learning for digital pathology image analysis: a comprehensive tutorial with selected use cases. J Pathol Inform. 2016;7:29.
Muto R, Miyoshi H, Sato K, Furuta T, Muta H, Kawamoto K, et al. Epidemiology and secular trends of malignant lymphoma in Japan: Analysis of 9426 cases according to the World Health Organization classification. Cancer Med. 2018;7:5843–58.
Gascoyne RD, Campo E, Jaffe ES, Chan WC, Chan JKC, Rosenwald A, et al. World Health Organization classification of tumours of haematopoietic and lymphoid tissues. Revised 4th ed. Lyon, IARC Press; 2017. p. 291–7.
Kawamoto K, Miyoshi H, Yoshida N, Nakamura N, Ohshima K, Sone H, et al. MYC translocation and/or BCL 2 protein expression are associated with poor prognosis in diffuse large B-cell lymphoma. Cancer Sci. 2016;107:853–61.
Jaffe ES, Harris NL, Swerdlow SH, Ott G, Nathwani BN, de Jong D, et al. World Health Organization classification of tumours of haematopoietic and lymphoid tissues. Revised 4th ed. Lyon, IARC Press; 2017. p. 267–73.
Shimono J, Miyoshi H, Yoshida N, Kato T, Sato K, Sugio T, et al. Analysis of GNA13 protein in follicular lymphoma and its association with poor prognosis. Am J Surg Pathol. 2018;42:1466–71.
Erickson BJ, Korfiatis P, Akkus Z, Kline TL. Machine Learning for Medical Imaging. Radiographics. 2017;37:505–15.
Uthoff J, Sieren JC. Information theory optimization based feature selection in breast mammography lesion classification. International Symposium on Biomedical Imaging (ISBI). 2018. p. 817–21.
Oksuz I, Ruijsink B, Puyol-Anton E, Sinclair M, Rueckert D, Schnabel JA, et al. Automatic left ventricular outflow tract classification for accurate cardiac MR planning. International Symposium on Biomedical Imaging (ISBI). 2018. p. 462–5.
Zhang Z, Xiao J, Wu S, Lv F, Gong J, Jiang L, et al. Deep convolutional radiomic features on diffusion tensor images for classification of glioma grades. J Digital Imaging. 2020 (Epub ahead of print).
Sakamoto M, Nakano H, Zhao K, Sekiyama T. Multi-stage neural networks with single-sided classifiers for false positive reduction and its evaluation using lung X-ray CT images. ICIAP; 2017. p. 370–9.
Kabeya Y, Takeuchi Y, Nakano H, Nishino I, Okubo M, Inoue M, et al. Physician-level muscle disease classifier for computer-aided diagnostics with deep neural networks. International Symposium on Biomedical Imaging (ISBI). 2018.
Achi HE, Belousova T, Chen L, Wahed A, Wang I, Hu Z, et al. Automated diagnosis of lymphoma with digital pathology images using deep learning. Ann Clin Lab Sci. 2019;49:153–60.
Mohlman JS, Leventhal SD, Hansen T, Kohan J, Pascucci V, Salama ME. Improving augmented human intelligence to distinguish Burkitt lymphoma from diffuse large b-cell lymphoma cases. Am J Clin Pathol. 2020 (Epub ahead of print).
Acknowledgements
Chugai Pharmaceutical Co., Ltd provided funding for this study based on a joint research contract.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Miyoshi, H., Sato, K., Kabeya, Y. et al. Deep learning shows the capability of high-level computer-aided diagnosis in malignant lymphoma. Lab Invest 100, 1300–1310 (2020). https://doi.org/10.1038/s41374-020-0442-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41374-020-0442-3
This article is cited by
-
Artificial intelligence performance in detecting lymphoma from medical imaging: a systematic review and meta-analysis
BMC Medical Informatics and Decision Making (2024)
-
Machine learning-based pathomics signature of histology slides as a novel prognostic indicator in primary central nervous system lymphoma
Journal of Neuro-Oncology (2024)
-
What can machine vision do for lymphatic histopathology image analysis: a comprehensive review
Artificial Intelligence Review (2024)
-
Artificial intelligence-assisted mapping of proliferation centers allows the distinction of accelerated phase from large cell transformation in chronic lymphocytic leukemia
Modern Pathology (2022)
-
Subtype classification of malignant lymphoma using immunohistochemical staining pattern
International Journal of Computer Assisted Radiology and Surgery (2022)