Introduction

The grading of invasive pulmonary adenocarcinoma (hereinafter referred to as lung adenocarcinoma) can guide physicians in developing treatment plans and patient prognosis1,2,3,4. The specific criteria of the grading system for invasive lung adenocarcinoma, as described in the latest publication of the 2021 edition of the WHO classification of chest tumors, are shown in Table 15. This criterion was proposed by the pathology committee of the International Association for Study of Lung Cancer (IASLC)6 and was also adopted by the Chinese Medical Association. As pathological images become increasingly important and abundant, pathologists face growing pressure in diagnosis. The demand for computer-aided diagnostic models in the medical system continues to rise. Therefore, the use of computer-assisted diagnostic models to assist pathologists in diagnosing lung adenocarcinoma grades that have small interclass and large intraclass differences has an important application value.

Table 1 Grading scheme for invasive pulmonary adenocarcinomas6.

However, research on lung adenocarcinoma tissue pathological images7,8,9 primarily focuses on intelligent classification of pathological types, while overlooking the study of pathological grading tasks, which to some extent hinders the formulation of personalized treatment plans for physicians. Currently, there is often a problem of large intra-class differences and small inter-class differences between pathological images of lung adenocarcinoma tissues under different grades. If attention mechanisms such as Coordinate Attention (CA)10 are directly used for lung adenocarcinoma grading tasks, two issues may arise: CA only extracts global features through average pooling, thus risking excessive compression of feature information, affecting the model's ability to compare local features of lung adenocarcinoma; CA processes too few operations within the same dimension, lacking the process of embedding interactions within the same dimension, thus affecting the model's perception of global information in lung adenocarcinoma tissue pathological images. Therefore, to address the challenge of difficulty in grading lung adenocarcinoma, we propose Dimension Information Embedding Attention Net (DIEANet), which enables the model to focus on the local growth morphology of individual cells while also considering the overall growth pattern of cell clusters. Specifically, our contributions are as follows:

  1. (1)

    We integrate different pooling techniques to automatically select local regions representing key growth patterns of lung adenocarcinoma cells, enhancing the model's attention to local information.

  2. (2)

    We employ an interactive fusion approach to focus on feature information within the same dimension and across dimensions, improving the model's capacity to incorporate dimension information and extract remote dependency relationships. This enables the model to better consider the overall growth patterns of cell clusters.

  3. (3)

    In terms of objective metrics, DIEANet achieved an accuracy of 88.19%, an AUC of 96.61%, an MCC of 81.71%, and a Kappa of 81.16%. Compared to other 7 attention mechanisms, DIEANet achieved state-of-the-art objective metrics, and it aligns better with the visual attention of pathologists under subjective observation.

The following sections are structured as follows: In Section “Related work”, we delve into the evolution of deep learning techniques for lung cancer diagnosis in recent years, and the application and development of Coordinate Attention (CA). Section “Methods” describes the structure of DIEANet in detail. Section “Experiments” shows the lung adenocarcinoma dataset and experimental setup, including comparison experiments and ablation experiments. Section “Summary and outlook” makes some summary and outlook.

Related work

Currently, research in computer-aided diagnosis of lung cancer mainly focuses on the classification tasks of subtypes such as adenocarcinoma (LUAD), squamous carcinoma (LUSC), and large cell carcinoma. In 2017, Teramoto et al.11 correctly classified 71% of cytological pictures of lung cancer into the three lung cancer subtypes LUAD, LUSC, and small cell carcinoma using triple cross-validation to train Deep Convolution Neural Network (DCNN). In 2018, Coudray et al.12 and Khosravi et al.13 both published a paper on deep learning-based classification for LUAD and LUSC, where the former demonstrated that deep learning can effectively help pathologists detect lung cancer subtypes and even genetic mutations, and the latter investigated the effect of fine-tuning pre-trained CNN models on classification. In 2020, Moitra et al.14 used a one-dimensional CNN model to study the staging and grading of non-small cell carcinoma (NSCLC) based on CT images and clinical data, and they achieved good results but did not collect important histopathological images. In addition to improving model performance, in 2022, Civit-Masot et al.9 turned their attention to the interpretability aspect, and the proposed system can output the diagnostic results, as well as the image regions focused by the model, which can be used to provide more supporting information to physicians. This inspires us to enhance the model's focusing ability on the most densely populated regions of cancer cells with the highest malignancy level through attention mechanisms.

The aforementioned studies only consider the task of classifying different types of lung cancer, but overlook the significance of different grading of lung adenocarcinoma for personalized treatment and prognosis management. Identifying different grades of lung adenocarcinoma requires models to discern the growth patterns of cells more intricately, focusing on both the local growth morphology of individual cells and the overall growth pattern of cell clusters. We found that Coordinate Attention (CA) proposed by Hou et al.10 not only captures inter-channel information but also includes directional and dimensional information, which can enhance the network's vision and performance. This aligns with our approach; hence, we have organized and summarized CA as well as its applications and developments in recent years, as shown in Table 2.

Table 2 Coordinate attention (CA) and its applications and developments.

Different from the above studies, we are concerned with the problems of severely compressed information and insufficient calculation of information relevance within the same dimension in CA. These problems somewhat reduce the degree of dimensional information embedding, and cannot be well solved by the problem of large intra-class differences and small inter-class differences that exist in lung adenocarcinoma histopathology images. Therefore, this paper proposes the DIEANet structure for lung adenocarcinoma grading, starting from two aspects: extracting more effective feature information and embedding dimensional information more comprehensively. In the following, we will provide a detailed description of DIEANet.

Methods

Overview

The complete structure of DIEANet is illustrated in Fig. 1, which can be divided into three stages. The first stage is the data preprocessing stage, including color normalization and random cropping of histopathological images, which will be detailed in Section “Preprocessing”. The second stage is the feature extraction stage, where ResNet34 is utilized for feature extraction. The specific code uses the ResNet34 code provided by the torchvision library in the Pytorch platform. The third stage is the attention evaluation stage, consisting of Dimension Information Embedding Attention (DIEA), a classifier, and Grad_Cam, which will be elaborated on in Section “Dimension information embedding attention (DIEA)”. This stage validates the performance of DIEANet through both quantitative metrics and visualization.

Figure 1
figure 1

Dimension Information Embedding Attention Net (DIEANet).

Dimension information embedding attention (DIEA)

To boost the model's efficacy and elevate focus on lesion sites, we propose the DIEA structure. Within DIEA, it specifically includes operations such as feature compression decomposition, feature fusion embedding processing, and restored feature map operations. The detailed structural diagram is shown in Fig. 2.

Figure 2
figure 2

Dimension information embedding attention (DIEA).

Feature compression decomposition stage

First, the formula for the feature compression decomposition operation is expressed as follows:

$$a_{c}^{h} (h) = \frac{1}{W}\sum\limits_{0 \le i < W} {x_{c} } (h,i)$$
(1)
$$a_{c}^{w} (w) = \frac{1}{H}\sum\limits_{0 \le j < H} {x_{c} } (j,w)$$
(2)
$$m_{c}^{h} (h) = max_{0 \le i < W} x_{c} (h,i)$$
(3)
$$m_{c}^{w} (w) = max_{0 \le j < H} x_{c} (j,w)$$
(4)

where \({\mathbf{x}}\) denotes the input features, \(h\) and \(w\) the indexes of feature map height and width, respectively. \(a_{c}^{h} (h)\), \(a_{c}^{w} (w)\) denote the features obtained by strip average pooling in horizontal and vertical dimension, respectively. \(m_{c}^{h} (h)\), \(m_{c}^{w} (w)\) denote the features obtained by strip max pooling in horizontal and vertical dimension, respectively. After that, the four 1D feature tensors are combined two by two according to the pooling method to obtain two 1D feature description tensors of length \((H + W)\):

$${\mathbf{a}} = [{\mathbf{a}}^{h} ,({\mathbf{a}}^{w} )^{T} ]$$
(5)
$${\mathbf{m}} = [{\mathbf{m}}^{h} ,({\mathbf{m}}^{w} )^{T} ]$$
(6)

where \(T\) denotes the transpose operation and [*,*] denotes the splicing operation. In this stage, \({\mathbf{a}}\) containing the remote dependencies of features, \({\mathbf{m}}\) highlighting the features of the key growth morphology of cancer cells, and the combination of the two can provide more effective feature information for the feature embedding operation later in the model. The experimental results in Section “Experimental results and comparison experiments” demonstrate this point.

Feature fusion embedding processing

In the second stage, in order to better fuse the features with each other, we stitch \({\mathbf{a}},{\mathbf{m}}\) together to form a one-dimensional feature description tensor of length \((2H + 2W)\), which is separated after nonlinear processing. The process is represented as follows:

$${\mathbf{h}} = \delta (BN(F_{1} ([{\mathbf{a}},{\mathbf{m}}])))$$
(7)
$${\mathbf{h}}^{a} = BN(F_{2} ({\mathbf{h}}_{{\left[ {0:(H + W)} \right]}} ))$$
(8)
$${\mathbf{h}}^{m} = BN(F_{3} (({\mathbf{h}}_{{\left[ {(H + W):(2H + 2W)} \right]}} )^{T} ))$$
(9)

\(F_{1} (),F_{2} (),F_{3} ()\) denotes the \(1 \times 1\) convolution kernel, \(BN()\) denotes Batch-Normalization, and \(\delta ()\) denotes the h_swish activation function. H_swish is proposed by MobileNetV319 and is a combination of ReLU and linear functions. The convolution operation of \(F_{1} ()\) uses the squeeze operation of the SE block to perform channel dimensionality reduction using the \(1 \times 1\) convolution kernel, and the resulting tensor is \({\mathbf{h}} \in {\mathbb{R}}^{C//r \times (2H + 2W) \times 1}\). \({\mathbf{h}}^{a} \in {\mathbb{R}}^{C \times (H + W) \times 1} ,{\mathbf{h}}^{m} \in {\mathbb{R}}^{C \times 1 \times (H + W)}\) denote two one-dimensional feature description tensor of length \((H + W)\).

Recovery feature map processing

In the third stage, in order to recover the feature map, we multiply \({\mathbf{h}}^{a}\) and \({\mathbf{h}}^{m}\) to obtain the feature map \({\mathbf{h}}^{am} \in {\mathbb{R}}^{C \times (H + W) \times (H + W)}\) containing dimensional embedding information, and this part of the operation is expressed as:

$${\mathbf{h}}^{am} = {\mathbf{h}}^{a} \times {\mathbf{h}}^{m}$$
(10)

where \({\mathbf{h}}^{a}\) and \({\mathbf{h}}^{m}\) contain the transformed features \(\left[ {{\mathbf{a}}^{h} ,({\mathbf{a}}^{w} )^{T} } \right]\) and \(\left[ {({\mathbf{m}}^{h} )^{T} ,{\mathbf{m}}^{w} } \right]\), respectively, and after multiplying them, the feature map \({\mathbf{h}}^{am}\) will contain four kinds of information, i.e., \(({\mathbf{a}}^{h} \times ({\mathbf{m}}^{h} )^{T} )\), \(({\mathbf{a}}^{h} \times {\mathbf{m}}^{w} )\), \((({\mathbf{a}}^{w} )^{T} \times ({\mathbf{m}}^{h} )^{T} )\), and \((({\mathbf{a}}^{w} )^{T} \times {\mathbf{m}}^{w} )\). In contrast, CA generates only one kind of information \(({\mathbf{a}}^{h} \times {\mathbf{a}}^{w} )\) when recovering the feature map, and although it achieves the inter-embedding of features in both horizontal and vertical dimensions, it ignores the inter-embedding of features within the same dimension, such as between horizontal and horizontal and between vertical and vertical. To better distinguish between DIEA and CA, the structure of CA is shown in Fig. 3.

Figure 3
figure 3

Coordinate attention (CA).

Finally, \({\mathbf{h}}^{am}\) is downsampled and embedded in the input feature \({\mathbf{x}}\). The equation is expressed as follows:

$${\mathbf{g}} = \sigma (F \downarrow ({\mathbf{h}}^{am} ))$$
(11)
$${\mathbf{y}} = {\mathbf{x}} \times {\mathbf{g}}$$
(12)

where \(F \downarrow ()\) denotes a convolutional kernel with kernel size of 3 × 3, stride of 2, and padding of 1. \(\sigma\) represents the sigmoid, and \({\mathbf{g}} \in {\mathbb{R}}^{C \times H \times W}\) is the same size as the input feature \({\mathbf{x}}\). \({\mathbf{y}}\) is the final attention-enhanced feature map.

Ethics approval

This study was approved by the Medical Ethics Committee of the Key Laboratory of Kelamayi Central Hospital on February 10th, 2023.The the ethics approval number is "K202302-29". The informed consent was obtained from the subjects.

Experiments

Dataset

We utilized two datasets to assess the model's performance. A private dataset was employed for a thorough examination within the paper's core application domain, while a publicly accessible dataset was utilized to validate the model's effectiveness and its capacity for generalization.

Dataset for lung adenocarcinoma histopathological image grading: This dataset was provided by the Karamay Central Hospital of Xinjiang and approved by the Medical Ethics Committee of the Key Laboratory of Kelamayi Central Hospital, and contains three labels of low-grade, intermediate-grade, and high-grade lung adenocarcinoma. All experiments adhered to pertinent guidelines and regulations. All cases were diagnosed between November 2016 and August 2019, and were confirmed by histopathological examination and informed consent was obtained from the participants. The dataset contains a total of 107 cases, including 30 cases of low grade, 38 cases of intermediate grade, and 39 cases of high grade. The dataset was all stained by hematoxylin–eosin staining (HE staining), and the grading labels of the images were retrieved from the hospital's data system. Since cancer cells coexist in multiple growth patterns in lung adenocarcinoma, for example, there may be a small number of low grade growth patterns in intermediate grade lung adenocarcinoma, we asked two experienced pathologists from the Karamay Central Hospital to draw a finer representative lesion area with the highest and most intensive malignancy for each case as a way to obtain more accurate images of the lesions. The digital slice scanning system PRECICE 500b was then used to scan within the range marked by the physicians. Depending on the size of the lesion area marked by the physicians in the section, we scanned 2–10 random images in jpg format from each case, each with a size of 1665 × 1393 px. A total of 893 images were obtained, including 260 low grade, 316 intermediate grade, and 317 high grade images.

PCam20020: This dataset is derived from the Camelyon2016 challenge dataset21 by sliding window extraction, used to train models to detect the presence of lymph node metastasis. Tumor patches and normal patches are respectively cropped from annotated regions of tumor slides and tissue regions of normal slides. Subsequently, the cropped patch sizes are adjusted to 512 × 512 pixels using bicubic interpolation. After balancing the number of tumor and normal samples, a total of 56,703 images are randomly split into training, validation, and testing sets, where patches from the same slide are placed in the same set (slide-level splitting). Detailed information about the dataset is shown in Table 3.

Table 3 Composition of the PCam200 dataset.

Preprocessing

We observed excessive color differences between the HE-stained histopathology images and therefore used the currently more popular Vahadane22 method for stain normalization of the dataset, the results of which are shown in Fig. 4.

Figure 4
figure 4

Stain normalization effect. The top row is the original image, the bottom row is the image after Vahadane processing.

We referenced common processing methods for histopathological image datasets as well as the training approach of cross-validation23,24. Initially, at the case level, we utilized five-fold cross-validation to partition the dataset into five folds. Then, we extracted 3 images per normalized image through random cropping, ensuring that all images from the same case are used only for training or validation. Besides, in order to ensure experimental rigor, we removed the images with blank parts accounting for more than 75% of the overall. The final number of images in each category is shown in Table 4.

Table 4 Composition of lung adenocarcinoma dataset.

Experimental setup details

The experiments were taken with Python version 3.7.13, torch version 1.10.2 + cu102, and GPU V100-PCIE-16 GB. The batch size of the experiment is set to 16, the input image size is set to 256 × 256 px, the optimizer adopts AdamW, the learning rate is 1e-4, while the weight decay is set to 1e−2. The learning rate adjustment strategy is set to use linear learning rate warm-up at the first epoch, and cosine annealing strategy is used for 1–200 epoch to gradually reduce the learning rate to 0 according to cosine function. Cross Entropy Loss, which is commonly used in classification model, is used for training to converge the model.

For the lung adenocarcinoma grading dataset, we employed stratified five-fold cross-validation during model training, with each fold trained for 200 epochs. At the case level, we randomly divided patients into five folds, and when reading images, we determined their belonging to each fold based on their pathological identification number. This approach ensures both balanced data in training and validation sets and independence between them, thereby ensuring the authenticity of the experimental results.

For the PCam200 dataset, we train 200 epochs according to the training, validation, and test sets that have been divided by the authors.

Experimental results and comparison experiments

We selected three backbone frequently used in cancer pathology image classification tasks to validate the performance of DIEA, namely ResNet3425,26,27, MobileNetV228,29, and EfficientNet30,31. The pre-trained weights obtained from ImageNet training provided in torchvision.models were used in all training.

To measure the performance of the model in multiple aspects, a total of nine metrics were used to evaluate the model performance in two aspects. First, in terms of computational complexity, Number of Parameters (Params) and Giga Floating Point Operations per Second (GFLOPs) are used. These two metrics are used to measure the computational resource requirements and computational efficiency of the model. In this paper, both Params and GFLOPs are computed with the input image size of 256 × 256 px. Second, for the performance metrics, we used Accuracy (ACC), F1-Score (F1), Precision (PRE), Recall, Area Under Curve (AUC), Matthew's Correlation Coefficient (MCC), Cohen's Kappa score (Kappa) which are seven metrics32,33. These metrics are used to assess the performance and prediction quality of the model. ACC measures the overall prediction accuracy of the model, F1 combines PRE and Recall, while PRE and Recall measure the positive case prediction accuracy and positive case prediction ability of the model, respectively. Furthermore, given that the model's task involves three-class classification, we considered two averaging methods to comprehensively showcase the model's performance. "_m" in the table indicates macro-average, where all classes are equally weighted without considering sample distribution. "_w" denotes weighted-average, providing a more accurate reflection of overall classification performance. AUC measures the performance of the classification model, and MCC and Kappa score are used to assess the prediction quality and consistency of the classification model. The experimental results display the averages and standard deviations of five-fold cross-validation.

With ResNet34 as the backbone, the comparison results of DIEANet with SimAM34, BAM35, Criss-Cross Attention (CCA) in CCNet36, CBAM37, CA, EMA38, ACmix18 which are seven types of attention, are shown in Table 5.

Table 5 Comparison experiments with different attentions when ResNet34 is the backbone, where the bolded numbers are the highest scores in a column and the underline indicates the second highest in a column.

According to Table 5, our model achieved the highest values in 8 out of 10 objective metrics, with the remaining two ranking second, and the standard deviations generally being low. Particularly noteworthy is the five-fold average accuracy for lung adenocarcinoma grading, which reached 88.19% ± 0.0426, demonstrating outstanding performance along with good generalization and robustness.

We show the gradient-weighted class activation map generated by different attentions in Fig. 5 to analyze the spatial localization ability of DIEANet.

Figure 5
figure 5

Gradient-weighted class activation map for each model when ResNet34 is the backbone.

As shown in Fig. 5, DIEANet focuses on a more comprehensive and accurate region, which obviously can better identify global information while paying attention to location information compared to other attention, and can assist pathologist to better locate the region of interest.

We then compared with other attentions again using MobileNetV2 and EfficientNet as backbone, respectively.

According to Table 6, when using MobileNetV2 as the backbone, DIEANet achieved the highest values in all 10 performance metrics, with low standard deviations across the board. The five-fold average accuracy for lung adenocarcinoma grading reached 87.22% ± 0.0444, demonstrating outstanding performance. Similarly, as indicated in Table 7, when employing EfficientNet as the backbone, DIEANet attained the highest values in 9 out of 10 performance metrics, with generally low standard deviations. The five-fold average accuracy for lung adenocarcinoma grading reached 87.70% ± 0.0446, also showcasing remarkable performance.

Table 6 Comparison experiments with different attentions when MobileNetV2 is the backbone, where the bolded numbers are the highest scores in a column and the underline indicates the second highest in a column.
Table 7 Comparison experiments with different attentions when EfficientNet is the backbone, where the bolded numbers are the highest scores in a column and the underline indicates the second highest in a column.

In PCam200, we compared the performance of DIEANet with seven other attention mechanisms using ResNet34 as the backbone. The results on the test set are presented in Table 8.

Table 8 Performance of different attention mechanisms on PCam200 with ResNet34 as backbone.

Based on Table 8, it is evident that our model achieved the highest values in six out of seven metrics, with accuracy reaching 93.57% and recall reaching 90.91%. The significantly higher recall value compared to other models is crucial for reducing false negatives in medical diagnosis. These results further confirm the suitability of DIEANet for processing histopathological images.

Ablation experiments

We conducted ablation experiments using alone strip average pooling or only strip max pooling to extract features to validate the efficiency of combining strip average pooling with strip max pooling to extract features. The experimental results revealed that fusing the two pooling can facilitate the model to extract more effective feature information, which is advantageous to enhance the network's performance. The experimental results are presented in Tables 9 and 10.

Table 9 Ablation experiments, where the bolded numbers are the highest scores in a column.
Table 10 Ablation experiments, where the bolded numbers are the highest scores in a column.

Summary and outlook

In this study, we proposed the Dimension Information Embedding Attention Net (DIEANet) for the task of lung adenocarcinoma histopathological image grading, building upon the improvement of Coordinate Attention (CA). Specifically, DIEANet combines different pooling techniques to automatically select local regions of critical growth patterns such as lung adenocarcinoma cells, enhancing the model's focus on local information. Furthermore, it adopts interactive fusion methods to concentrate on feature information within the same dimension and across dimensions, improving the model's capacity to embed dimensional information and thus improve its perception of global features. Ultimately, extensive experiments demonstrate that, objectively, DIEANet achieves state-of-the-art performance in the grading task of lung adenocarcinoma. Subjectively, it also aligns better with the visual attention of pathology experts. In the future, we will collect pathological images of lung adenocarcinoma tissues from different research institutions to expand our dataset. In addition, considering the significant cost of annotating medical images, we plan to introduce weakly supervised techniques. By learning features from publicly available lung adenocarcinoma pathological images without graded labels, we aim to further enlarge the dataset and enhance model performance.