MFCA-Net: a deep learning method for semantic segmentation of remote sensing images

Li, Xiujuan; Li, Junhuai

doi:10.1038/s41598-024-56211-1

Download PDF

Article
Open access
Published: 08 March 2024

MFCA-Net: a deep learning method for semantic segmentation of remote sensing images

Xiujuan Li^1,2 &
Junhuai Li¹

Scientific Reports volume 14, Article number: 5745 (2024) Cite this article

715 Accesses
1 Citations
Metrics details

Subjects

Abstract

Semantic segmentation of remote sensing images (RSI) is an important research direction in remote sensing technology. This paper proposes a multi-feature fusion and channel attention network, MFCA-Net, aiming to improve the segmentation accuracy of remote sensing images and the recognition performance of small target objects. The architecture is built on an encoding–decoding structure. The encoding structure includes the improved MobileNet V2 (IMV2) and multi-feature dense fusion (MFDF). In IMV2, the attention mechanism is introduced twice to enhance the feature extraction capability, and the design of MFDF can obtain more dense feature sampling points and larger receptive fields. In the decoding section, three branches of shallow features of the backbone network are fused with deep features, and upsampling is performed to achieve the pixel-level classification. Comparative experimental results of the six most advanced methods effectively prove that the segmentation accuracy of the proposed network has been significantly improved. Furthermore, the recognition degree of small target objects is higher. For example, the proposed MFCA-Net achieves about 3.65–23.55% MIoU improvement on the dataset Vaihingen.

A deep learning method for optimizing semantic segmentation accuracy of remote sensing images based on improved UNet

Article Open access 10 May 2023

An improved semantic segmentation algorithm for high-resolution remote sensing images based on DeepLabv3+

Article Open access 27 April 2024

Joint superpixel and Transformer for high resolution remote sensing image classification

Article Open access 01 March 2024

Introduction

Remote sensing technology is widely used in various fields such as urban planning^1,2, land resource utilization^3,4,5, and precision agriculture^4,6. The semantic segmentation technique is an important research direction of RSI. Various semantic segmentation methods have been developed and applied in practical applications. The threshold-based image segmentation method^7,8 realizes semantic segmentation by classifying the image gray histogram using different gray thresholds. The edge-based segmentation method was used by Roberts⁹, Sobel^10,11, Prewitt ^12,13, and other edge detection operators^14,15 in identifying and connecting the boundary pixels to form the contour of the edge. The image region segmentation method classifies the pixels and creates regions based on their similar characteristics, and methods such as region production and split merge are frequently employed^16,17,18. Traditional semantic segmentation approaches mentioned above need to set parameters manually, and the segmentation accuracy is low. In addition, they cannot adapt to image segmentation tasks with a large amount of semantic information.

In recent years, deep learning has achieved profound success in remote sensing image applications^19,20,21, especially in semantic segmentation^22,23,24. Zheng et al.²⁵ applied the U-Net²⁶ model widely used in medical image segmentation to RSI and trained on the GF-2 RSI dataset. Xuan et al.²⁷ suggested a multipath encoder structure for extracting the features to improve target object boundary classification accuracy in RSI. Zheng et al.²⁸ developed a semantic segmentation model using spatial context acquisition of the Markov random field model to enhance the segmentation accuracy of different land categories. Sun et al.²⁹ proposed an improved U-Net network that groups channels in a multitasking manner and processes heterogeneous image segmentation through information fusion. Chen et al.³⁰ presented an improved network framework for RSI semantic segmentation based on the spatial channel fusion compression and excitation module. Fan et al.³¹ improved DeepLab³² for extracting cultivated land information, introducing a parameter to adjust the dilated convolution kernel and adding a more precise decoder group to the model structure. Wang et al.³³ used ResNet-34³⁴ as the backbone and built a double-branch encoder to extract lakes and water bodies from the Qinghai Tibet Plateau.

Transformer is a deep learning model based on the self-attention mechanism. Since the transformer captures long-distance dependencies between local and global features by comparing their correlations at all spatial positions, it has more robust modeling capabilities. Therefore, more and more researchers are applying it to computer vision tasks. Zhang et al.³⁵ propose a semantic segmentation model using a transformer as the backbone network to obtain better remote spatial dependencies. Wang et al.³⁶ combine Swin Transformer with Densely Connected Feature Aggregation Module to propose a new semantic segmentation model for remote sensing images.

Generative Adversarial Networks (GANs)³⁷ belong to generative models. Luc et al.³⁸ first introduced GANs into image semantic segmentation. Due to the high time and money costs of large-scale annotated datasets, many researchers have shifted their research direction to GAN-based semantic segmentation. Li et al.³⁹ propose a distribution-aligned semantic segmentation network based on GAN. Ma et al.⁴⁰ suggest a novel GAN network, which integrates additional discriminators to learn domain-specific features and captures cross-domain dependencies of semantic feature representations through mutually enhancing attention transformers. Algorithms based on GANs can generate samples and determine their authenticity, but their performance could be better for large-scale training.

In summary, the feature learning ability of neural networks mentioned above has shown substantial advantages in the semantic segmentation of RSI. However, RSI is prone to the problem of unbalanced sample classification, or there may be significant differences in classification sizes. These characteristics result in insufficient network, classification errors, and missed detection of small target objects, decreasing overall segmentation accuracy. This paper presents a new deep neural network for remote sensing image segmentation in response to the above issues. The main contributions of this study can be summarized as follows:

A new neural network, MFCA-Net, is proposed for the semantic segmentation of RSI. Moreover, the results of the proposed MFCA-Net are superior to those of other approaches under limited training sample scenarios.
In IMV2, attention mechanisms are introduced in the shallow and deep feature maps respectively to improve the segmentation accuracy of the network.
The MFDF module obtained a more extensive range of contextual information and denser feature sampling points, effectively solving the problems of unbalanced sample classification and low segmentation accuracy of small target objects.

Methods

The overall framework of MFCA-Net adopts an encoding–decoding structure, as shown in Fig. 1. We introduce MobileNet V2⁴¹ as the backbone and improve it. The attention mechanism is used in the shallow and deep feature layers. We add the MFDF module, which not only obtains a larger receptive field but also attempts to solve the problem of identifying small sample targets through denser sampling points. In decoding, three branches are introduced from the feature extraction module, fused, and then upsampled to achieve pixel-level classification of RSI.

Encoder

IMV2

The feature extraction module uses the lightweight MobileNet V2 to ensure the learning performance and efficiency of the network. Based on depthwise and pointwise convolution, the parameter quantity of MobileNet V2 is only 1/9 to 1/8 of the standard convolution. Nevertheless, all channels in the feature map are assigned the same weight in MobileNet V2. We improve it and introduce channel attention mechanisms (CA) after the shallow feature map Bottleneck1 and deep feature map Bottleneck6, respectively. The operation process of CA includes compression, activation, and scale operations.

Compression operation

Firstly, the feature map is global pooled. Then, the feature vector is compressed into a one-dimensional vector through the convolution and batch normalization (BN) layers. Each dimension of the one-dimensional vector represents the weight of each channel. The operation can be expressed as follows:

$${\text{z}}={{\text{F}}}_{{\text{sq}}}\left({\text{f}}\right)=\frac{1}{{\text{H}}\times {\text{W}}}\sum_{{\text{i}}=1}^{{\text{H}}}\sum_{{\text{j}}=1}^{{\text{W}}}{\text{f}}\left({\text{i}},{\text{j}}\right),$$

(1)

where ${{\text{F}}}_{{\text{sq}}}$ is the compression operation function, ${\text{f}}\in {{\text{R}}}^{{\text{H}}\times {\text{W}}}$ is a set of two-dimensional feature maps; ${\text{f}}\left({\text{i}},{\text{j}}\right)$ is one of the elements, H and W are the height and width of the feature map, respectively; z is the output of compression operation.

Activation operation

The feature map vector’s channel dimension is reduced to the original 1/r through the first full connection layer (FC1), resulting in a 1 × 1 × C/r feature map shape, and r expresses the dimensionality reduction ratio. After that, Funnel activation (FReLU)⁴² performs the nonlinear processing. The activation function in the MobileNet series, whether Relu or Relu6, models the one-dimensional linear space of the pixel itself, so it is easy to lose the characteristics of the pixels around the center point and reduce the model learning ability. FReLU uses funnel conditions to obtain the maximum value between the center point and the states. The formula is as follows:

$$FReLU={\text{max}}\left({{\text{x}}}_{{\text{c}},{\text{i}},{\text{j}}},{\text{T}}\left({{\text{x}}}_{{\text{c}},{\text{i}},{\text{j}}}\right)\right),$$

(2)

where ${{\text{x}}}_{{\text{c}},{\text{i}},{\text{j}}}$ is the pooling window centered with position $({\text{i}},{\text{j}})$ on channel C. ${\text{T}}\left({{\text{x}}}_{{\text{c}},{\text{i}},{\text{j}}}\right)={{\text{x}}}_{{\text{c}},{\text{i}},{\text{j}}}^{{\text{w}}}\cdot {{\text{p}}}_{{\text{c}}}^{{\text{w}}}$, ${{\text{p}}}_{{\text{c}}}^{{\text{w}}}$ is the parameters shared by this window in the same channel. Therefore, a funnel-shaped two-dimensional feature extractor can obtain more abundant image context feature information, which helps improve the segmentation accuracy. The feature map of the feature map vector is raised back to the channels’ original number through the second full connection layer (FC2). Additionally, it is transformed into a normalized weight vector, with values varying between 0 and 1, using a sigmoid function.

Scale operation

The normalized weight and the original input characteristic map channel are multiplied to generate the weighted distinct map. The formula is

$${\text{x}}={{\text{F}}}_{{\text{scale}}}\left({\text{f}},{\text{s}}\right)={\text{s}}\cdot {\text{f}}\left({\text{i}},{\text{j}}\right),$$

(3)

where ${{\text{F}}}_{{\text{scale}}}$ is the scale operation; ${\text{x}}$ is a value in the last output X of the attention module; ${\text{X}}=\left[{{\text{x}}}_{1},{{\text{x}}}_{2},\ldots ,{{\text{x}}}_{{\text{c}}}\right]$.The entire process is a parameter learnable process. The contribution weights of different channels are obtained through backpropagation training. The structure diagram of IMV2 is given in Table 1.

Table 1 The structure of IMV2.

Full size table

In Table 1, $t$ is the expansion factor; $c$ is the depth of the output characteristic matrix; $n$ is the number of iterations of bottleneck; ${\text{s}}$ is the step length.

MFDF

The atrous spatial pyramid pooling (ASPP) proposed by DeepLab V2⁴³ contacts feature maps with different dilation rates. Although this method can get a larger receptive field, it is only effective for some large objects, and fewer sampling points can be captured for fewer categories and small target objects. The design of MFDF aims to address the above issues. The study fuses the convolution feature maps of 3, 6, 12, 18, and 24 with various dilation rates. Adaptive average pooling can integrate a broad range of spatial information and prevent overfitting, so adaptivepool2d is added to this module. These six branches are densely connected backward, and the overall MFDF structure is depicted in Fig. 2.

Each dilation layer can be represented as follows:

$${{\text{x}}}_{{\text{l}}}={{\text{H}}}_{{\text{K}},{{\text{d}}}_{{\text{l}}}}\left(\left[{{\text{x}}}_{{\text{l}}-1},{{\text{x}}}_{{\text{l}}-2},\dots ,{{\text{x}}}_{0}\right]\right),$$

(4)

where ${{\text{d}}}_{{\text{l}}}$ is the dilation rate of one layer; […] is the splicing operation of the feature layer; $\left[{{\text{x}}}_{{\text{l}}-1},{{\text{x}}}_{{\text{l}}-2},\dots ,{{\text{x}}}_{0}\right]$ is the output of all layers before splicing. For the expanded convolution layer with dilation rate d and convolution kernel size K, the receptive field size is computed as follows:

$$R=\left(d-1\right)\times (K-1)+K$$

(5)

Stacking the two convolution layers together can provide a large receptive field. If two convolution layers with convolution kernels K₁ and K₂ are superimposed, the new receptive field is

$$K={K}_{1}+{K}_{2}-1.$$

(6)

The above formula indicates that the receptive field of the densely connected characteristic map is 128. In contrast, the receptive field of the ASPP with the same void rate is only 51, i.e., the receptive field of MFDF is more than twice as large as that of the ASPP.

Decoder

Relevant studies^44,45 have indicated that increasing the fusion of shallow feature maps containing details can improve segmentation accuracy. The present research enhances the application of low-level feature maps. After 3 × 3 and 1 × 1 convolution to adjust the channels of feature maps, bottleneck1, and bottleneck2 perform fusion operations, then achieve downsampling using convolution with stride 2. After fusing with bottleneck3, the feature maps combine with the deep feature map. Bilinear interpolation of four times is performed for upsampling to produce the segmentation image.

Loss function

The loss function often used in semantic segmentation is cross-entropy loss, which assigns equal weight to all categories. The present study adds weight factors to the loss function to improve the importance of a few classes in the loss function and balance the distribution of the loss function. It uses the focal loss function⁴⁶. The formula is as follows:

$${L}_{FL}\left({p}_{t}\right)=-{\alpha }_{t}{\left(1-{p}_{t}\right)}^{\gamma }log{p}_{t}^{\prime},$$

(7)

where α is a category balance parameter used to adjust the category balance degree; γ is the focusing parameter used to focus complex samples; Pt is the probability value of the prediction category. Experiments revealed that weight adjustment slightly improves the result.

Experiments

We design two experiments to verify the performance of the proposed MFCA-Net: (i) an experimental investigation of the superiority of the proposed approach over six state-of-the-art methods, namely, SegNet⁴⁷, U-Net²⁶, PSPNet⁴⁸, DANet⁴⁹, DeepLab V3+⁵⁰, and A2-FPN⁵¹. SegNet proposed an unpooling structure that applied the max pooling index, improving the recognition of segmentation boundaries. U-Net is an entirely symmetric semantic segmentation model. The first half of its structure is feature extraction, and the second half is upsampling. PSPNet introduces a pyramid pooling module to capture contextual information at different scales, thereby improving semantic segmentation performance. The DANet model introduces both position and channel attention, downsampling using ResNet as the backbone network, reducing it from 32 to 8 times while retaining more detailed information to improve segmentation performance. DeepLab V3+ uses atrous spatial pyramid pooling to concatenate feature maps obtained through convolution operations with different void ratios, achieving multi-scale feature extraction. The A2-FPN model performs semantic segmentation of fine-resolution remote sensing images by adding an attention aggregation module to the feature pyramid network. (ii) An ablation experiment given promoting the widespread use of the proposed MFCA-Net.

The present study uses pixel accuracy (PA), mean PA (MPA), mean intersection over union (MIoU), and frequency-weighted intersection over union (FWIoU) to determine segmentation accuracy. The operating system of this experiment is Windows 10, the graphics card is NVIDIA Geforce RTX3060, the Cuda version for parallel computing architecture is 11.0, and the deep learning framework is Pytorch 1.7.

Dataset description

Two datasets: Vaihingen (https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-vaihingen.aspx) and Gaofen Image Dataset (GID) (https://www.cvmart.net/dataSets/detail/765?channel_id=op10&utm_source=cvmartmp&utm_campaign=datasets&utm_medium=article) are used to assess the effect of MFCA-Net. The Vaihingen dataset is collected by airborne imaging equipment of aerial vehicles, and the image collection location is the small village of Vaihingen in Germany. The data imaging consists of three bands: near-infrared, red, and green. The average resolution is 2494 × 2064, and the dataset is trimmed to a fixed size of 512 × 512 using 75% interblock coverage. 3300 images are obtained using horizontal and vertical flip and rotation operations to enhance the image. The dataset includes six classifications: impermeable surfaces, buildings, low vegetation, trees, cars, and backgrounds. The proportions of these six categories are 27.8%, 26%, 22.9%, 21.3%, 1.2%, and 0.8%, respectively, showing the category imbalance problem.

GID is a large-scale, high-resolution, remote-sensing, land-cover image dataset based on China's Gaofen-2 satellite data. These images were taken from over 60 cities in China, and each image is clear and of high quality, without any cloud or fog obstruction. The GID dataset has a vibrant spectrum, texture, and structure diversity, which is very close to the natural distribution characteristics of land features. GID includes 10 images with a spatial resolution of 4 m and an image size of 6908 × 7300 pixels. High interclass similarity and low intraclass discrimination are characteristics of GID images. Similarly, 31,500 images with the size of 512 × 512 are obtained after data enhancement methods. In light of the large dataset, the present study randomly selects 5000 images to create a small dataset. The dataset is classified into six categories: background, buildings, cultivated land, woodland, grassland, and water. The problem of sample unbalance is also apparent. The proportion of grassland is tiny, only 1.6%. Except for the background, the proportion of cultivated land is the highest, close to 30%.

Quantitative comparison and visual performance

Experiments on Vaihingen

Table 2 lists the Vaihingen test set and highlights the best performance in bold. The experimental results show that the segmentation accuracy of DANet, DeepLab V3+, and A2-FPN models is similar. The A2-FPN model proposed in 2022 has higher segmentation accuracy. MFCA-Net is the highest in all other metrics except for being less than 1% lower than DeepLab V3+ in MPA metrics. Compared to A2-FPN, MIoU and FWIoU indicators are 3.18% and 2.86% higher, respectively.

Table 2 Results on Vaihingen.

Full size table

The visual inspection is presented in Fig. 3. We randomly select three samples and predict the pixel-wise label. Among all the methods compared, the MFCA-Net method has the greatest impact on vehicle recognition. For easily confused low vegetation and trees, the proposed MFCA-Net has a more accurate boundary delineation.

Experiments on GID

Table 3 shows the experimental results of various methods in GID. The results show that the segmentation accuracy of SegNet and U-Net is relatively low; The segmentation accuracy of PSPNet and DANet is close. DeepLab V3+ has the highest accuracy among these six models, while A2-FPN segmentation accuracy is only higher than SegNet and U-Net. Analyzing the reasons, the variance between woodland and grassland classes is slight, and the proportion of woodland, grassland, and Buildings is also tiny, resulting in low segmentation accuracy for all three categories. After the proportion weighting calculation, the overall accuracy index was lowered. For datasets with slight inter-class variance, the segmentation accuracy of A2-FPN is low. The MFCA Net proposed in the paper outperforms the best DeepLab V3+ in all indicators. PA, MPA, MoU, and FWIoU indicators are 2.60%, 5.19%, 4.51%, and 3.86% higher than DeepLab V3+.

Table 3 Results on GID.

Full size table

For qualitative evaluation, three samples of the GID testing set are predicted and illustrated in Fig. 4. In the dataset, the promotion of grassland is tiny; SegNet, U-Net, and the A2-FPN proposed in 2022 have poor recognition performance on grassland. A2-FPN did not perform as well as expected in identifying cultivated land and woodland. Compared with the other six models, the proposed MFCA-Net has better recognition performance for all classifications and smoother segmentation boundaries.

Ablation study

The ablation study is implemented under the same hyperparameters and runtime environment. As presented in Table 4, the MPA and MIoU are collected to analyze the effects. We first list the segmentation accuracy on MobileNet V2 as the baseline network. Next, we investigated how IMV2 would influence the detection performance. It was observed that the MPA index improved by 4.08% and 5.39%, respectively, on the two datasets. On the MIoU index, the segmentation accuracy has been improved by 2.23% and 1.99%, respectively. Similarly, the performance of the MFDF module was verified. The MPA index increased by 2.51% and 3.96%, respectively; the MIoU index improved by 1.58% and 1.54%, respectively.

Table 4 Result of the ablation study.

Full size table

Figure 5 shows the performance of IMV2 and MFDF by randomly selecting two images for visualization. The first two columns are input images and ground truth. The third column is the performance of the primary network, and the effect is not satisfied for the small proportion of clutter marked in red and cars marked in yellow. The fourth column shows the significant improvement after replacing MobileNet V2 with IMV2. In contrast, the last column shows the segmentation performance after continuing to add the MFDF module, which has better recognition performance for small samples and small target objects and is closer to the ground truth.

Figure 6 shows the segmentation effect of IMV2 and MFDF on the GID dataset. The above figure did not identify the buildings marked in red based on the basic network and, after adding IMV2, ultimately identified the buildings after adding the MFDF module. It is difficult to distinguish between woodland in blue and grassland in yellow. The recognition effect is improving with the increase of IMV2 and MFDF modules.

Conclusion

This paper proposes a novel MFCA-Net to improve semantic segmentation performance with RSI. The analysis introduced the channel attention module into the feature extraction network's shallow and deep feature maps, respectively. Moreover, a two-dimensional activation function FReLU that can obtain context information is adopted. After deep feature extraction, the MFDF module was designed. The upsampling process fused the three branches of the shallow feature map of the backbone network. The proposed MFCA-Net achieved better performance and higher detection accuracies than the state-of-the-art methods. The advantages of the proposed MFCA-Net can be briefly summarized as follows: (1) MFCA-Net obtained advanced semantic segmentation results. The experimental results indicate that MFCA-Net outperformed six widely used semantic segmentation methods in the visual observation and quantitative evaluation criteria. (2) The proposed MFCA-Net may achieve quick and effective learning performance and be quickly promoted in practical engineering applications. The findings on the relationship between the loss value and epoch indicate the temporary learning effect of MFCA-Net. These characteristics are acceptable and even preferred in practical applications. In our future studies, we plan to collect large-area datasets with other change detection methods and apply the proposed network to test its robustness and adaptability further.

Data availability

The datasets used or analyzed during the current study are available from the corresponding author on reasonable request.

References

Du, S., Du, S., Liu, B. & Zhang, X. Mapping large-scale and fine-grained urban functional zones from VHR images using a multi-scale semantic segmentation network and object based approach. Remote Sens. Environ. 261, 112480 (2021).
Article Google Scholar
Gao, W., Nan, L., Boom, B. & Ledoux, H. PSSNet: Planarity-sensible semantic segmentation of large-scale urban meshes. ISPRS J. Photogramm. Remote. Sens. 196, 32–44 (2023).
Article ADS Google Scholar
Li, X. et al. MCANet: A joint semantic segmentation framework of optical and SAR images for land use classification. Int. J. Appl. Earth Obs. Geoinf. 106, 102638 (2022).
Google Scholar
Zhang, H. et al. Automated delineation of agricultural field boundaries from Sentinel-2 images using recurrent residual U-Net. Int. J. Appl. Earth Obs. Geoinf. 105, 102557 (2021).
Google Scholar
Wieland, M., Martinis, S., Kiefl, R. & Gstaiger, V. Semantic segmentation of water bodies in very high-resolution satellite and aerial images. Remote Sens. Environ. 287, 113452 (2023).
Article Google Scholar
Xiang, J., Liu, J., Chen, D., Xiong, Q. & Deng, C. CTFuseNet: A multi-scale CNN-transformer feature fused network for crop type segmentation on UAV remote sensing imagery. Remote Sens. 15, 1151 (2023).
Article ADS Google Scholar
Pun, T. A new method for grey-level picture thresholding using the entropy of the histogram. Signal Process. 2, 223–237 (1980).
Article ADS Google Scholar
Yen, J. C., Chang, F. J. & Chang, S. A new criterion for automatic multilevel thresholding. IEEE Trans. Image Process 4, 370 (1995).
Article ADS CAS PubMed Google Scholar
Rosenfeld, A. The max Roberts operator is a Hueckel-type edge detector. IEEE Trans. Pattern Anal. Mach. Intell. (1981).
Lang, Y. & Zheng, D. An improved Sobel edge detection operator. In 2016 6th International Conference on Mechatronics, Computer and Education Informationization (MCEI 2016) (2016).
Ravivarma, G. et al. Implementation of Sobel operator based image edge detection on FPGA. Mater. Today Proc. 45, 2401–2407 (2021).
Article Google Scholar
Yang, L., Wu, X., Zhao, D., Li, H. & Zhai, J. An improved Prewitt algorithm for edge detection based on noised image. In 2011 4th International Congress on Image and Signal Processing 1197–1200 (IEEE, 2011) https://doi.org/10.1109/CISP.2011.6100495.
Yadav, J. S. & Shyamala Bharathi, P. Edge detection of images using Prewitt algorithm comparing with Sobel algorithm to improve accuracy. In 2022 3rd International Conference on Intelligent Engineering and Management (ICIEM) 351–355 (2022). https://doi.org/10.1109/ICIEM54221.2022.9853193.
Huang, M., Liu, Y. & Yang, Y. Edge detection of ore and rock on the surface of explosion pile based on improved Canny operator. Alex. Eng. J. 61, 10769–10777 (2022).
Article Google Scholar
Vladimir, M., Mile, P., Dragan, S., Branimir, J. & Petar, S. New approach of estimating edge detection threshold and application of adaptive detector depending on image complexity. Optik Zeitschrift fur Licht und Elektronenoptik J. Light-and Electronoptic 238, 166476 (2021).
Google Scholar
Giacomini, M. & Perotto, S. Anisotropic mesh adaptation for region-based segmentation accounting for image spatial information. Comput. Math. Appl. 121, 1–17 (2022).
Article MathSciNet Google Scholar
Park, J., Cho, Y. K. & Kim, S. Deep learning-based UAV image segmentation and inpainting for generating vehicle-free orthomosaic. Int. J. Appl. Earth Observ. Geoinformation 115, 103111 (2022).
Article Google Scholar
Wang, Y., Wu, L., Qi, Q. & Wang, J. Local scale-guided hierarchical region merging and further over- and under-segmentation processing for hybrid remote sensing image segmentation. IEEE Access 10, 81492–81505 (2022).
Article Google Scholar
Paoletti, M. E. et al. Separable attention network in single- and mixed-precision floating point for land-cover classification of remote sensing images. IEEE Geosci. Remote Sens. Lett. https://doi.org/10.1109/LGRS.2021.3108965 (2022).
Article Google Scholar
Hl, A., Zw, B. & Hui, Z. A. Edge protection filtering and convolutional neural network for hyperspectral remote sensing image classification. Infrared Phys. Technol. 122, 104039 (2022).
Article Google Scholar
Zheng, H. et al. HFA-Net: High frequency attention siamese network for building change detection in VHR remote sensing images. Pattern Recogn. 129, 108717 (2022).
Article Google Scholar
Wang, H., Chen, X., Zhang, T., Xu, Z. & Li, J. CCTNet: Coupled CNN and transformer network for crop segmentation of remote sensing images. Remote Sens. 14, 1956 (2022).
Article ADS Google Scholar
Wang, Z. et al. Semantic segmentation of high-resolution remote sensing images based on a class feature attention mechanism fused with Deeplabv3+. Comput. Geosci. 158, 104969 (2022).
Article Google Scholar
Zhao, J. et al. Multi-source collaborative enhanced for remote sensing images semantic segmentation. Neurocomputing 493, 76–90 (2022).
Article Google Scholar
Zheng, X. & Chen, T. Segmentation of high spatial resolution remote sensing image based on U-Net convolutional networks. In IGARSS 2020—2020 IEEE International GeoSci. and Remote Sens. Symposium (2020).
Ronneberger, O., Fischer, P., & Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (2015).
Xuan, Y. et al. An attention-fused network for semantic segmentation of very-high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 177, 238–262 (2021).
Article Google Scholar
Zheng, C., Zhang, Y. & Wang, L. Multigranularity multiclass-layer Markov random field model for semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. PP, 1–20 (2020).
Article Google Scholar
Sun, S., Lei, Y., Liu, W. & Li, R. Feature fusion through multitask CNN for large-scale remote sensing image segmentation. In 2018 10th IAPR Workshop on Pattern Recognit. in Remote Sens. (PRRS) (2018).
Chen, G. et al. SDFCNv2: An improved FCN framework for remote sensing images semantic segmentation. (2021).
Fan, H., Wei, Q., Shu, D. Q., Li, Y. & Yang, C. D. An improved deeplab based model for extracting cultivated land information from high definition remote sensing images. In 2019 IEEE International Conference on Signal, Information and Data Process (ICSIDP) (2019).
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K. & Yuille, A. L. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Preprint at http://arxiv.org/abs/1412.7062 (2016).
Wang, Z., Gao, X. & Zhang, Y. HA-Net: A lake water body extraction network based on hybrid-scale attention and transfer learning. Remote Sens. 13, 4121 (2021).
Article ADS Google Scholar
Wang, F. et al. Residual attention network for image classification. In 2017 Proc. IEEE Conf. Comput. Vis. Pattern Recog. 6450–6458 (2017).
Zhang, C. et al. Transformer and CNN hybrid deep neural network for semantic segmentation of very-high-resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 60, 1–20 (2022).
CAS Google Scholar
Wang, L. et al. A novel transformer based semantic segmentation scheme for fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 19, 1–5 (2022).
CAS Google Scholar
Goodfellow, I. J. et al. Generative Adversarial Networks. Preprint at https://doi.org/10.48550/arXiv.1406.2661 (2014).
Luc, P., Couprie, C., Chintala, S. & Verbeek, J. Semantic Segmentation using Adversarial Networks. Preprint at https://doi.org/10.48550/arXiv.1611.08408 (2016).
Li, Y., Shi, T., Zhang, Y. & Ma, J. SPGAN-DA: Semantic-preserved generative adversarial network for domain adaptive remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 61, 1–17 (2023).
Google Scholar
Ma, X., Zhang, X., Wang, Z. & Pun, M.-O. Unsupervised domain adaptation augmented by mutually boosted attention for semantic segmentation of VHR remote sensing images. IEEE Trans. Geosci. Remote Sens. 61, 1–15 (2023).
CAS Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L. C. MobileNetV2: Inverted residuals and linear bottlenecks. In 2018 Proc. IEEE Conf. Comput. Vis. Pattern Recog. (2018).
Ma, N., Zhang, X. & Sun, J. Funnel Activation for Visual Recognition. Preprint at http://arxiv.org/abs/2007.11824 (2020).
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K. & Yuille, A. L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40, 834–848 (2018).
Article PubMed Google Scholar
Takikawa, T., Acuna, D., Jampani, V. & Fidler, S. Gated-SCNN: Gated Shape CNNs for Semantic Segmentation. Preprint at https://doi.org/10.48550/arXiv.1907.05740 (2019).
Wang, Z., Song, R., Duan, P. & Li, X. EFNet: Enhancement-fusion network for semantic segmentation. Pattern Recogn. 118, 108023 (2021).
Article Google Scholar
Lin, T. Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 5, 2999–3007 (2017).
Google Scholar
Badrinarayanan, V., Kendall, A. & Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. (2017).
Zhao, H., Shi, J., Qi, X., Wang, X. & Jia, J. Pyramid scene parsing network. In IEEE Computer Soc. (2016).
Fu, J. et al. Dual attention network for scene segmentation. In 2019 Proc. IEEE Conf. Comput. Vis. Pattern Recog. (2020).
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F. & Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (2018).
Li, R., Wang, L., Zhang, C., Duan, C. & Zheng, S. A2-FPN for semantic segmentation of fine-resolution remotely sensed images. Int. J. Remote Sens. 43, 1131–1155 (2022).
Article Google Scholar

Download references

Funding

Shaanxi Water Conservancy Technology Project (2020slkj-17) and the Scientific Research Support Program of Xi’an University of Finance and Economics (22FCZD05,22FCJH008).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Xi’an University of Technology, Xi’an, 710048, China
Xiujuan Li & Junhuai Li
School of Information, Xi’an University of Finance and Economics, Xi’an, 710100, China
Xiujuan Li

Authors

Xiujuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Junhuai Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, methodology, and writing by X.L. and J.L.; validation and experiments by X.L.; writing review and editing, X.L. and J.L. Both authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Junhuai Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Li, X., Li, J. MFCA-Net: a deep learning method for semantic segmentation of remote sensing images. Sci Rep 14, 5745 (2024). https://doi.org/10.1038/s41598-024-56211-1

Download citation

Received: 13 October 2023
Accepted: 04 March 2024
Published: 08 March 2024
DOI: https://doi.org/10.1038/s41598-024-56211-1

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

A deep learning method for optimizing semantic segmentation accuracy of remote sensing images based on improved UNet

An improved semantic segmentation algorithm for high-resolution remote sensing images based on DeepLabv3+

Joint superpixel and Transformer for high resolution remote sensing image classification

Introduction

Methods

Encoder

IMV2

Compression operation

Activation operation

Scale operation

MFDF

Decoder

Loss function

Experiments

Dataset description

Quantitative comparison and visual performance

Experiments on Vaihingen

Experiments on GID

Ablation study

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links