Asymmetric coordinate attention spectral-spatial feature fusion network for hyperspectral image classification

In recent years, the hyperspectral classification algorithm based on deep learning has received widespread attention, but the existing network models have higher model complexity and require more time consumption. In order to further improve the accuracy of hyperspectral image classification and reduce model complexity, this paper proposes an asymmetric coordinate attention spectral-spatial feature fusion network (ACAS2F2N) to capture distinguishing hyperspectral features. Specifically, adaptive asymmetric iterative attention was proposed to obtain the discriminative spectral-spatial features. Different from the common feature fusion method, this feature fusion method can adapt to most skip connection tasks. In addition, there is no manual parameter setting. Coordinate attention is used to obtain accurate coordinate information and channel relationship. The strip pooling module was introduced to increase the network’s receptive field and avoid irrelevant information brought by conventional convolution kernels. The proposed algorithm is tested on the mainstream hyperspectral datasets (IP, KSC, and Botswana), experimental results show that the proposed ACAS2F2N can achieve state-of-the-art performance with lower time complexity.

The construction of the network model has become the main technology of hyperspectral image classification, and it is also the current research focus of hyperspectral classification. In terms of graph convolutional networks, Mou et al. 19 proposed a nonlocal graph convolutional network (GCN). Wan et al. 20,21 explored multi-scale dynamic GCN based on superpixel segmentation of simple linear iterative clustering (SLIC). Yang et al. proposed GCN hyperspectral classification by sampling and aggregating, referred to as GraphSAGE 22 . Liu et al. 23 studied GCN hyperspectral classification based on label consistency and multi-scale convolutional networks. Ding et al. 24 proposed a globally consistent GCN based on SLIC. Hong et al. fuse CNN features and GCN features to perform hyperspectral classification 25 . Sha et al. proposed Graph Attention Network based on KNN and attention layer 26 . Zhao et al. explored the spectral-spatial GAT to reserve the discriminative features of hyperspectral 27 . In short, GCN plays an important role in the classification of hyperspectral images.
Summary of current research: For hyperspectral classification tasks, researchers have explored many classic algorithms, and these studies have further improved the accuracy of hyperspectral image classification. The relevant research summary is described as follows: (1) Network model Residual networks and densely connected networks are the main network types for current hyperspectral classification. The fusion of 2D CNN and 3D CNN is also a trend in hyperspectral research. Inserting the network module into the backbone network (such as the residual network) may improve the performance of hyperspectral classification. However, it will increase model parameters and consume time complexity.
(2) Multi-feature fusion Hyperspectral image data has rich spectral-spatial features, so multi-feature fusion is the main method of hyperspectral classification. Related technologies include: multi-scale fusion, multibranch fusion, spectral-spatial fusion, etc. The adaptability and effectiveness of feature fusion still need to be considered, which can avoid manual parameter adjustment and improve the accuracy of hyperspectral classification.
(3) Attention mechanism The attention mechanism can effectively enhance hyperspectral features and select discriminative spectral-spatial features. In computer vision, most of the existing attention mechanisms have high complexity, and there is less exploration of low-rank attention networks. This is also very important to expand the receptive field of the network and choose a suitable attention model.
The main contributions of this paper. Compared with most existing hyperspectral classification baselines, in order to further improve the accuracy of hyperspectral image classification and reduce model complexity, this paper proposes a new network model to deal with hyperspectral classification tasks. The algorithm is called ACAS2F2N. The main contributions of this paper are summarized as follows: (1) This paper proposes an asymmetric coordinate attention spectral-spatial feature fusion network (ACAS2F2N) to complete the hyperspectral classification task. ACAS2F2N is an asymmetric learning model, and it is an end-to-end feature learning method. The proposed algorithm can improve the performance of hyperspectral image classification and has lower model complexity. (2) The adaptive iterative attention feature fusion method is adopted to extract the discriminative spectralspatial features. Different from the common feature fusion method, this feature fusion method can adapt to most skip connection tasks. In addition, there is no manual parameter setting. (3) Coordinate attention is used to obtain accurate coordinate information and channel relationship. The strip pooling module was introduced to increase the network's receptive field and avoid irrelevant information brought by conventional convolution kernels. (4) The proposed algorithm is tested on the mainstream hyperspectral datasets (IP, KSC, and Botswana), experimental results show that the proposed ACAS2F2N can achieve state-of-the-art performance with lower time complexity.
The rest of the paper are organized as follows. "Related work" section describes the proposed ACAS2F2N network architecture. "ACAS2F2N network architecture" section shows the experimental results. "Results" section elaborates the conclusion.

Related work
In hyperspectral classification tasks, generative adversarial network [28][29][30] , long short-term memory 31 , network architecture search 32 , and capsule network [33][34][35] are all used in hyperspectral classification. In addition, Hao et al. proposed a hyperspectral classification algorithm based on recurrent neural network and geometry-aware loss, referred to as Geo-DRNN 36 . In the attention mechanism, Xue et al. proposed a second-order covariance pooling network based on attention, which reduces model complexity while ensuring classification accuracy 37  www.nature.com/scientificreports/ corresponding decisions 43 . The attention mechanism is also a key technology to improve the performance of hyperspectral classification. The construction of the network model is still the main technology of hyperspectral classification. Specifically, Sun et al. proposed 44 a fully convolutional segmentation network, and its core structure is still a residual block. Pan et al. 45 presented a semantic segmentation network to reduce the adjustment of manual parameters of the model. Transfer learning is also one of the research contents of hyperspectral feature extraction 46,47 . GhostNet module and channel attention are selected to improve the accuracy of hyperspectral classification 48 . A nonlocal module and a fully convolutional network were chosen to improve the effectiveness of hyperspectral classification 49 . Li et al. proposed a multi-layer fusion dense network based on the 3D dense module to extract spectral-spatial features 50 58 . Mu et al. separated low-rank components and sparse components, and implemented a two-branch network 59 . The convolutional layer and residual module still play an important role in hyperspectral classification [60][61][62][63][64] . In addition, support vector machines 65 , self-learning 66 , multi-view feature 67 are used to obtain discriminative features to perform hyperspectral classification. Some methods based on RNN and LSTM are also widely used in the field of histopathology [68][69][70] . The goal of these methods is to obtain highly discriminative spectral-spatial fusion features. In short, hyperspectral classification has become a research hotspot in the field of remote sensing and related research is of great significance.

ACAS2F2N network architecture
Research motivation. At present, there are still some problems in hyperspectral classification that need to be dealt with: (1) hyperspectral images have rich spectral and spatial information. The spatial and spectral information of hyperspectral images are not fully utilized; (2) most existing hyperspectral classifications have high model complexity. In other words, the network model has more parameters.
Although many algorithms based on attention mechanism have been proposed in the field of computer vision, if the attention mechanism is directly inserted into the backbone network, the time complexity of the algorithm will inevitably increase. In addition, directly fusing two independent networks will also increase algorithm time complexity and model complexity. Based on the above considerations, the goal of this paper is to independently build a hyperspectral classification network. The proposed network can achieve better hyperspectral classification performance with lower model complexity.
Algorithm advantages. In order to further improve the accuracy of hyperspectral image classification and reduce the time loss of hyperspectral classification, this paper proposes asymmetric coordinate attention spectral-spatial feature fusion network (ACAS2F2N). The proposed algorithm has the following major advantages: (1) Different from most existing algorithms, this paper does not use residual network and dense connection module to obtain hyperspectral image features, so the proposed algorithm does not increase the model complexity.
(2) Coordinate attention is used to extract discriminative spectral-spatial features. Different from existing hyperspectral classification algorithms, coordinate attention can accurately obtain coordinate information and channel relationships. In addition, coordinate attention is also proposed for the first time in hyperspectral processing.  www.nature.com/scientificreports/ ( C ). The feature map acquisition process is an asymmetric way, so it is named A2IAFFM. Through A2IAFFM, ACAS2F2N can obtain spectral-spatial features with discriminative capabilities. In addition, compared to most existing feature fusion modules, A2IAFFM does not require additional manual parameter settings while acquiring multi-scale information. After A2IAFFM, the feature map ( D ) is obtained, and the size of the feature map is C × M × N ; (5) to complete hyperspectral image classification based on mean pooling and fully connected layer. The size of the mean pooling output is N × 3C . Finally, the hyperspectral classification task is completed through the fully connected layer and cross entropy loss. The working principle of ACAS2F2N is as follows: (1) obtain feature map (A); (2) coordinate attention is used to obtain accurate coordinate information and channel relationship; (3) strip pooling module was introduced to increase the network's receptive field and avoid irrelevant information brought by conventional convolution kernels; (4) the adaptive iterative attention feature fusion method is adopted to extract the discriminative spectral-spatial features; (5) complete hyperspectral image classification based on mean pooling and fully connected layers.
In Fig. 1, the definition of symbols is described as follows: N is the batch size, C is the channel size, and C is the hyperspectral band size. M × N is the size of the hyperspectral image spatial domain, M is the height of the hyperspectral image spatial domain, and N is the width of the hyperspectral image spatial domain. H × W is the size of the feature map, H is the height of the feature map, and W is the width of the feature map.
Coordinate attention module. Existing hyperspectral image classification algorithms, location information (hyperspectral spatial domain) and channel relationship (hyperspectral image frequency band) have received less attention. Hyperspectral images have rich spectral-spatial information, so coordinate attention module 71 is used in this paper to obtain accurate position information and channel relationships of hyperspectral images.
Channel attention pays less attention to location information, and spatial attention pays less attention to channel relationships. The mixed-domain attention mechanism can consider location information and channel information at the same time. The mixed-domain attention mechanism is widely used in computer vision, most of the existing attention mechanism models have high computational complexity 72,73 . In addition, related studies have shown that low-rank attention and lightweight attention are less studied in computer vision 74,75 .
In order to obtain the precise location information of the hyperspectral, we use coordinate attention to extract the feature map ( B ). The coordinate attention module mainly captures position information, and the structure of the coordinate attention module is shown in Fig. 2.
In Fig. 2, the coordinate attention module only encodes H and W . In hyperspectral image, given position i, j , the pixel value on channel c is x c i, j .
The output of W mean pooling is defined as follows: where [, ] is concatenate, F is 1 × 1 convolution operation, and δ is ReLU activation function. y is the output feature map of the ReLU layer.
After the split operation, y can be decomposed into y i and y j . Next, y i completes the weighting of W through convolution and activation function. y j completes the weighting of H through convolution and activation function. The relevant definitions are as follows: where F i is the convolution operation on H , and its input is y i . w i is the adaptive weighting of the H direction of the hyperspectral data. F j is the convolution operation on W , and its input is y j . w j is the adaptive weighting of the W direction of the hyperspectral data. σ is the sigmoid activation function.
The output feature map ( B ) of the coordinate attention module is defined as follows: where c is the c-th channel. w i c (i) is the weight of the i-th position in the H direction. w j c j is the weight of the j -th position in the W direction. Given position i, j , x c i, j is the value of the input feature map ( A ). f c i, j is the value of the output feature map ( B).
Long-term strip pooling module. Different from the traditional kernel function, the strip pooling module uses a narrow-band kernel function to enhance the network receptive field. In addition, the strip pooling module can obtain long-term hyperspectral information and avoid the influence of negative information in hyperspectral www.nature.com/scientificreports/ images. In Fig. 3, strip pooling uses a narrow-band kernel function to focus on the regional information of the hyperspectral image. The input of the long-term strip pooling module is the feature map ( B ), namely f c i, j . The definition of strip pooling is as follows: The output of W strip pooling is defined as follows: The output of H strip pooling is defined as follows: Next, the strip pooling module executes Conv2d and Batchnorm2d. The relevant definitions are as follows: where M i is the conv2d operation, and its kernel function size is 1 × W . M j is the Conv2d operation, and its kernel function size is H × 1 . BN is a Batchnorm2d operation. h i and h j are the output of the Batchnorm2d layer, respectively. c is the c-th channel. The related definitions of feature fusion, ReLU, Conv2d, and activation function are as follows: www.nature.com/scientificreports/ where f 1 is the ReLU operation, f 2 is the Conv2d operation, and σ is the sigmoid function. N is the feature map ( C ), and N is also the output of the long-term strip pooling module.
Asymmetric adaptive iterative attention feature fusion module (A2IAFFM). In Fig. 4, The goal of A2IAFFM is to complete the fusion of feature map ( B ) and feature map ( C ). A2IAFFM can acquire hyperspectral features with discriminative ability. The output of A2IAFFM is the feature map ( D ). A2IAFFM includes local attention, global attention and adaptive weighting.
To simplify the description, L represents the local attention function, and G represents the global attention function. σ is the sigmoid function. The adaptive weighting is defined as follows: where W is the output weight of sigmoid function, and X and Y are the two input feature maps. ⊗ is elementwise product.
The first weight calculation process of A2IAFFM is defined as follows: where the input feature maps are B and C, and the output is W 1 . The A2IAFFM first adaptive weighting calculation process is defined as follows: The second weight calculation process of A2IAFFM is defined as follows: The A2IAFFM second adaptive weighting calculation process is defined as follows: In formula (16), the output feature map (D) of A2IAFFM can be obtained, denoted by Z 2 .
Other related technical modules. As shown in Fig. 1, other related technology modules mainly explain the details of the mean pooling layer and the fully connected layer. Among them, the average pooling layer is based on the AdaptiveAvgPool2d operation. There are two fully connected layers. The number of input nodes is the number of frequency bands of the feature map, and the number of output nodes is the number of categories of hyperspectral classification. In addition, this paper uses cross-entropy loss to optimize the network parameters of proposed ACAS2F2N. www.nature.com/scientificreports/ Parameter configuration. Figure 1 shows the overall architecture of the proposed ACAS2F2N, while Figs. 2, 3 and 4 show the network structures of the coordinate attention module, strip pooling module, and A2IAFFM, respectively. Next, the parameter configuration of the proposed algorithm is shown in Table 1. In the coordinate attention module, its core contains 2 mean pooling layers ( W mean pooling and H mean pooling) and 3 convolutional layers (Conv2d). In the strip pooling module, the module contains a total of 3 convolutional layers, namely ConvH, ConvW and Conv2d. In A2IAFFM, the module contains 8 convolutional layers (Conv2d). The three modules coordinate attention module, strip pooling module and A2IAFFM obtain the hyperspectral feature map ( D ) through concat operation. At this time, the number of channels of D is 3C . Therefore, the input of the fully connected layer (FC) is 3C, and the output of the fully connected layer (FC) is the number of categories.

Results
In order to evaluate the effectiveness of the proposed ACAS2F2N in hyperspectral classification tasks, this paper shows experimental comparison and analysis. Specifically, this paper mainly focuses on dataset and experimental parameter settings, baseline comparison algorithm, algorithm performance comparison, ablation analysis, algorithm performance and complexity analysis. The experiment was performed on tesla V100, and the capacity of the graphics card was 16G. The evaluation indicators are overall accuracy (OA), average accuracy (AA), and kappa coefficient (Kappa). During the experiment, on the three datasets, the ratio of training samples, verification samples and test samples is 3%:3%:94%.
Baseline selection. In order to verify the classification accuracy and real-time performance of the proposed ACAS2F2N, this paper compares the performance of 7 baselines. The specific description is as follows: A2S2KResNet 4 (TGRS, 2020), DBDA 13  Algorithm performance comparison. In order to evaluate algorithm performance fairly, all algorithms maintain the same parameter settings. On the IP dataset, 3% of the data is used for training. The epoch is 200, the code is run 3 times, and the length of the patch is 4. The experimental results on IP are shown in Table 2 and Fig. 5. It can be seen from Table 2 and Fig. 5 that the performance of the proposed ACAS2F2N algorithm is better than other baselines, and the time complexity of the proposed ACAS2F2N is lower. Therefore, it can show high classification accuracy and low time complexity.
Specifically, in Table 2 ACAS2F2N can accurately capture position information and enhance the receptive field. In addition, the proposed algorithm uses strip pooling to mine regional information. Proposed ACAS2F2N achieves better classification performance than all the baselines in Table 2 when using 3% data to train the network. In Fig. 5, classification map of the proposed ACAS2F2N is closer to Ground truth (GT), so the experiment verifies the effectiveness of ACAS2F2N on IP. www.nature.com/scientificreports/     www.nature.com/scientificreports/  www.nature.com/scientificreports/ Therefore, compared with A2S2KResNet, the proposed ACAS2F2N OA increased by 3.28%, AA increased by 2.52%, and Kappa increased by 3.60%. In terms of time complexity, the test time was reduced from the original 8.92 s to 7.09 s. Therefore, the proposed ACAS2F2N has faster convergence on the Botswana dataset.
In Table 4, among the 7 baseline algorithms, the performance of DBDA and FDSSC algorithms (OA, AA and Kappa) is better than the performance of the proposed ACAS2F2N algorithm. However, the model complexity of DBDA and FDSSC algorithm is higher, and the model time consumption is much greater than the proposed ACAS2F2N algorithm. Comprehensively comparing Tables 2 and 3, the performance of the proposed algorithm is better than that of DBDA and FDSSC. Ablation analysis. Next, this paper shows the ablation analysis of the algorithm. Among them, Table 5 is the ablation analysis on the IP data set, Table 6 is the ablation analysis on the KSC data set, and Table 7 is the ablation analysis on the Botswana data set. For the convenience of description, coordinate attention module is abbreviated as CAM, strip pooling module is abbreviated as SPM, and Asymmetric adaptive iterative attention feature fusion module is abbreviated as A2IAFFM.
The overall result of the algorithm in this paper is shown in Fig. 1. The network modules of ablation analysis include CAM, SPM and A2IAFFM. Feature fusion methods include concat and add operations. If the proposed algorithm removes the A2IAFFM module, then the algorithm only includes CAM or SPM network blocks, and the algorithm does not include feature fusion methods. The ablation analysis experiment is shown in Tables 5, 6 and 7. The experimental results show that the mentioned modules (CAM, SPM and A2IAFFM) all have a positive www.nature.com/scientificreports/ effect on the hyperspectral classification. In this paper, Concat is used to fuse the features of the three modules. At this time, the algorithm in this paper can experiment with the best hyperspectral classification performance.
The impact of training size. The number of training samples has a great influence for hyperspectral classification performance. Therefore, we analyze the impact of training size on OA, Kappa, Training time and test time. Figure 8 shows the OA results of different algorithms under different training sizes. As the training size increases, the OA of the proposed ACAS2F2N also increases. On the IP and KSC datasets, as the training size changes, the proposed ACAS2F2N both show the best performance. On the Botswana dataset, as the training size increases, the proposed ACAS2F2N hyperspectral classification accuracy also increases. Although the accuracy of the proposed algorithm is not optimal on the Botswana dataset, the algorithm has the lowest time complexity. In addition, the proposed algorithm is easier to extend. In addition, the proposed algorithm has the best performance on the IP and KSC datasets, and the proposed ACAS2F2N has less time complexity. Figure 9 shows Kappa performance under different training sizes. Figures 9 and 8 have similar conclusions and verify the effectiveness of the proposed ACAS2F2N. Specifically, On the IP dataset, when the training size is 2%, 3%, 5%, 10%, and 20%, compared to the A2S2KResNet (TGRS 2020) algorithm, the OA performance of the proposed ACAS2F2N is improved by 5.21%,    On the KSC dataset, when the training size is 2%, 3%, 5%, 10%, and 20%, compared to the A2S2KResNet (TGRS 2020) algorithm, the OA performance of the proposed ACAS2F2N is improved by 2.41%, 1.60%, 0.17%, 0.29% and 1.49%, respectively. On the KSC dataset, when the training size is 2%, 3%, 5%, 10%, and 20%, compared to the A2S2KResNet, the Kappa performance of the proposed ACAS2F2N is improved by 2.71%, 1.79%, 0.19%, 0.32% and 1.66%, respectively.

Time consumption analysis. Time consumption analysis demonstrates the convergence of the algorithm.
Time consumption analysis is also a concrete manifestation of model complexity. This article gives the time consumption of all algorithms under different training sizes, including training time and test time. Figure 10 shows Training time under different training sizes. Figure 11 shows Test time under different training sizes. Figures 10 and 11 jointly illustrate the time complexity of the proposed algorithm and the baseline. It can be seen from Figs. 10 and 11 that the proposed ACAS2F2N has less time consumption. In Figs. 8 and 9, we illustrate the impact of different training sizes on OA and Kappa. In terms of time complexity and accuracy of hyperspectral classification, the performance of the proposed ACAS2F2N is better than that of the baselines. Experiments verify the effectiveness of the proposed algorithm.
Training and verification accuracy. Figure 12 shows the training and verification accuracy. Figure 12 shows that the performance of the proposed ACAS2F2N algorithm is better than that of A2S2KResNet. In short, the proposed ACAS2F2N shows better classification performance on three hyperspectral datasets. Achieving high classification accuracy is accompanied by lower time consumption.  www.nature.com/scientificreports/

Conclusions
In this paper, we propose an asymmetric coordinate attention spectral-spatial feature fusion network (ACAS2F2N) to complete the hyperspectral classification task. Compared with the baselines, the proposed algorithm can improve the performance of hyperspectral image classification and has lower model complexity. Specifically, Coordinate attention is used to obtain accurate coordinate information and channel relationship. The strip pooling module was introduced to increase the network's receptive field and avoid irrelevant information brought by conventional convolution kernels. Asymmetric adaptive iterative attention feature fusion module is adopted to extract the discriminative spectral-spatial features. The experimental was performed on three datasets (IP, KSC and Botswana), and the experimental results show that the performance of the proposed ACAS2F2N is better than the baselines. In addition, the proposed ACAS2F2N has low model complexity and time consumption. www.nature.com/scientificreports/