Dynamic hierarchical multi-scale fusion network with axial MLP for medical image segmentation

Medical image segmentation provides various effective methods for accuracy and robustness of organ segmentation, lesion detection, and classification. Medical images have fixed structures, simple semantics, and diverse details, and thus fusing rich multi-scale features can augment segmentation accuracy. Given that the density of diseased tissue may be comparable to that of surrounding normal tissue, both global and local information are critical for segmentation results. Therefore, considering the importance of multi-scale, global, and local information, in this paper, we propose the dynamic hierarchical multi-scale fusion network with axial mlp (multilayer perceptron) (DHMF-MLP), which integrates the proposed hierarchical multi-scale fusion (HMSF) module. Specifically, HMSF not only reduces the loss of detail information by integrating the features of each stage of the encoder, but also has different receptive fields, thereby improving the segmentation results for small lesions and multi-lesion regions. In HMSF, we not only propose the adaptive attention mechanism (ASAM) to adaptively adjust the semantic conflicts arising during the fusion process but also introduce Axial-mlp to improve the global modeling capability of the network. Extensive experiments on public datasets confirm the excellent performance of our proposed DHMF-MLP. In particular, on the BUSI, ISIC 2018, and GlaS datasets, IoU reaches 70.65%, 83.46%, and 87.04%, respectively.


Dynamic hierarchical multi-scale fusion network with axial MLP for medical image segmentation Zhikun Cheng & Liejun Wang *
Medical image segmentation provides various effective methods for accuracy and robustness of organ segmentation, lesion detection, and classification.Medical images have fixed structures, simple semantics, and diverse details, and thus fusing rich multi-scale features can augment segmentation accuracy.Given that the density of diseased tissue may be comparable to that of surrounding normal tissue, both global and local information are critical for segmentation results.Therefore, considering the importance of multi-scale, global, and local information, in this paper, we propose the dynamic hierarchical multi-scale fusion network with axial mlp (multilayer perceptron) (DHMF-MLP), which integrates the proposed hierarchical multi-scale fusion (HMSF) module.Specifically, HMSF not only reduces the loss of detail information by integrating the features of each stage of the encoder, but also has different receptive fields, thereby improving the segmentation results for small lesions and multilesion regions.In HMSF, we not only propose the adaptive attention mechanism (ASAM) to adaptively adjust the semantic conflicts arising during the fusion process but also introduce Axial-mlp to improve the global modeling capability of the network.Extensive experiments on public datasets confirm the excellent performance of our proposed DHMF-MLP.In particular, on the BUSI, ISIC 2018, and GlaS datasets, IoU reaches 70.65%, 83.46%, and 87.04%, respectively.
Because medical images are affected by equipment, the partial volume effect, and patient position movement, they inevitably have noise and artifacts.At the same time, the lesion areas are complex and diverse, which all present certain obstacles to the physician's diagnosis.As a result, the efficiency and accuracy of diagnosis have increased as doctors are assisted by computers.
With the development of deep learning, the emergence of convolutional neural networks (CNNs) 1 has played a huge role in the development of medical image segmentation.CNNs perform well in many segmentation tasks, such as multi-organ segmentation through abdominal CT images [2][3][4] , lesion detection [5][6][7] , cell segmentation [8][9][10] , heart segmentation [11][12][13] , etc. Unfortunately, for the segmentation of high-level networks, the feature maps contain less detail information due to their low resolution.For the low-level networks of segmentation, the feature maps have more noise.The low-level networks also have the characteristics of a small receptive field and weak semantic information representation abilities.However, both high-level semantic information and low-level features are extremely important to the final segmentation result.Effective multi-scale feature fusion contributes to identifying network segment targets more accurately, which is an important way to improve segmentation performance.In order to guide the segmentation of small lesions and multi-lesion regions and increase prediction accuracy, many CNNs have been proposed that fuse low-level features with high-level semantic information.For example, the pure convolutional network U-Net 14 fuses low-level features into the up-sampling through skip connections.U-Net 14 has become the baseline for most medical image segmentation tasks and has inspired a large number of researchers to think about U-shaped semantic segmentation networks.V-Net 15 , which is used for 3D image segmentation, also uses skip connections to transmit low-level features.However, these simple skip connections do not achieve cross-scale interaction.Later, it is proposed that U-Net++ 16 indirectly fuses features of several different levels through short skip connections and up-down sampling.MDU-Net 17 extracts rich semantic information through multi-scale dense connection encoders, decoders, and skip connections.With the deepening of the network, the features of the deep network are greatly offset from the features of the shallow network, and direct feature fusion will lead to semantic conflicts.These conflicts inhibit the learning of detail information, which is not conducive to the establishment of context information for multi-scale features and has negative impacts on segmentation results.

OPEN
College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China.* email: wljxju@xju.edu.cn For the reasons outlined above, many researchers have proposed a variety of attention mechanisms to make networks focus on features of greater interest.SE-Net 18 and Coordinate Attention 19 use the generated weight sequence to explicitly build the dependency relationship between channels, so as to increase the sensitivity of the model to channel information and make channel information contribute more to the final decision.CBAM 20 further combines channel attention with spatial attention and has better performance.However, these networks ignore the different proportions of foreground and background information for each feature map at different sampling stages.
Based on the above analysis, we propose dynamic hierarchical multi-scale fusion network with axial mlp (DHMF-MLP) for medical image segmentation, in which we integrate the hierarchical multi-scale fusion (HMSF) module.We generate features with rich semantic and spatial information by fusing features from each stage of the encoder several times.To alleviate semantic conflicts in multi-scale feature fusion and enhance the ability to model the network globally, we propose dynamic spatial linear attention module (DSLA) as a component of HMSF.DSLA includes two parts: the adaptive spatial attention mechanism (ASAM) and the global branching in multi-gated MLP 21 (Axial-mlp 21 ).In the ASAM module, the semantic conflicts between multi-scale features can be adjusted adaptively by learning parameters, and the noise inhibiting segmentation performance can be filtered out to enhance the attention of important features.Axial-mlp 21 addresses the baseline's (UNeXt 22 ) shortcoming in global information modeling with linear computational complexity.
The contribution of this paper can be summarized as follows:

Related work
Based on the traditional image fusion methods.Spatial domain, transform domain, and their combination make up traditional medical image fusion algorithms.Principal component analysis 23 is a common fusion technique for medical imaging based on the spatial domain.Nevertheless, spectral and spatial distortion of the merged images are produced by spatial domain approaches.Researchers have moved their attention to the transform domain in an effort to improve the results of fusion.The contour transform 24 , discrete wavelet transform 25 , and pyramid transform 26 are common examples.Although the transform domain-based approaches produce noise during the fusion process, they have the advantages of excellent structure and distortion avoidance.Better fusion results are obtained when the two procedures are combined.However, based on the traditional fusion methods, on the one hand, they are compelled to employ the same transform for various source images to extract features in order to guarantee the viability of subsequent feature fusion.The fact that this process disregards the variations in the source images' characteristics could result in a subpar representation of the extracted features.On the other hand, the performance of the conventional feature fusion technique is insufficient and too coarse.The technique for integrating deep learning into image fusion gets over these drawbacks of conventional approaches.
CNN-based methods.The emergence of CNNs has led to rapid development in the field of image segmentation.FCN 27 is the pioneer of CNNs for image segmentation, opening up a new era of encoder-decoder structure for image segmentation.Subsequently, U-Net 14 combines encoder features from different levels to reduce information loss from pooling structures, achieve more accurate pixel boundary localization, and generate a plethora of efficient U-shaped segmentation network architectures 28,29 .Some researchers have further improved the structure of CNN-based networks, like Dilated Convolution 30,31 , RefineNet 32 , and PSPNet 33 .These networks are widely used in the field of image segmentation.However, due to the inherent characteristics of convolution, it lacks the ability to perform global context modeling.

Attention mechanisms.
The attention mechanism is designed to focus the network on more important features.Channel attention is weighted by channel direction to automatically obtain the contribution of each channel to the segmentation result.The representative networks are SE-Net 18 , ECANet 34 , and FcaNet 35 .The spatial attention mechanism is weighted along the spatial dimension so that the network can weaken background noise and pay more attention to the foreground information.For example, GE-Net 36 , RA-Net 37 , and SPA-Net 38 Research motivation.As down-sampling proceeds, image information is lost and feature offsets can occur.
By fusing encoder features layer by layer, interaction between higher-level features and their relative lower-level features can be achieved, and bias between features can be reduced.Semantic conflicts arise during the fusion process, and by adaptively adjusting the conflicts, consistent multi-scale feature sequences can be generated, facilitating the learning of important features.The fused features contain rich semantic information and are up-sampled from the decoder's bottom, reducing the semantic gap in the skip connections part and improving prediction accuracy.Furthermore, human tissues are highly similar, and both global and local information are critical.While UNeXt 22  www.nature.com/scientificreports/{ f 1 , f 2 , f 3 , f 4 , f 5 }, which is used for skip connections and input of HMSF.To mitigate the simple semantic proper- ties of medical images, multi-scale feature sequence is fed into the HMSF module.In this part, features containing richly detail information and high-level semantic information can be generated for use as input for up-sampling.In the decoder part, the generated multi-scale feature is up-sampled by bilinear interpolation and passed through Tok-MLP and convolution to obtain the final prediction map.For all experiments in the DHMF-MLP network, we set C1, C2, C3, C4 and C5 to 32, 64, 128, 160 and 256, respectively.

Tok-MLP.
The channel is divided into h parts for the input feature T, then axially shifted along the w-dimension and Tokenized to obtain T W .The formula ( 1) is as follows: where ρ and shift w indicates the division along the channel dimension and shifted along the w-dimension, respectively.
T W performs MLP along the channel dimension to map the number of channels into 768 dimensions, followed by 3 × 3 DWConv and GLEU to obtain T 1 .The formula ( 2) is as follows: where DWConv indicates 3 × 3 depth-wise convolution.
T 1 is similarly shifted along the H-dimension to obtain T H .The module output is obtained by concatenating the residuals after mapping T H into the original input feature dimension.By generating a random window, the module extracts excellent local features.The formula (3) and ( 4) is as follows: where FC and shift h indicates fully connected layers and shifted along the h-dimension, respectively.⊕ denotes element-wise addition.

Hierarchical multi-scale fusion module (HMSF).
It is well known that the low-level features of the segmentation network contain more fine-grained information, which is helpful for the segmentation of small lesions.The deep segmentation network is able to extract more high-level semantic information, which can improve the accuracy of segmentation.Moreover, the rich multi-scale information, which fuses features with different receptive fields, facilitates the segmentation of multi-lesion regions.
In this paper, we propose the HMSF module.The structure of the HMSF module is shown in Fig. 2.There are two fusions of HMSF.The first, features from each encoder stage are fused with relative low-level features.The second, the result obtained after the first fusion is fused again, and the fused result is used as the input for up-sampling.
Formally, the HMSF module has five input scales f i (i = 1, 2, 3, 4, 5).For f 1 , there are no relatively low-level features, so no feature fusion or semantic conflict is required for adjustment.Only Axial-mlp 21 is performed to create global context to obtain f ′ 1 , the formula ( 5) is as follows: (1) where MLP represents Axial-mlp 21 .It is a branch of the DSLA, as discussed in detail in "Dynamic spatial linear attention module (DSLA)".For feature f i (i = 2, 3, 4, 5), its relative low-level feature f ′ i−1 (i = 2, 3, 4, 5) is down-sampled by 3 × 3 DWConv to the resolution of f i (i = 2, 3, 4, 5).The feature obtained from down-sampling is concatenated with f i to retain more channel information.DSLA module is applied to obtain new fusion feature f ′ i (i = 2, 3, 4, 5).Reserve the intermediate value of this feature, which serves as input for the next stage of fusion.In this way, we can generate consistent multi-scale sequences with rich detail and high-level semantic information.The formula ( 6) is as follows: where Concat is the concatenation operation.We adapt DWConv 3×3 to represent 3 × 3 depth-wise convolution. Then } are down-sampled to the size of f ′ 5 by adaptiveMaxpool.The features after down-sampling are concatenate together along the channel dimensions, and then 3 × 3 convolution is carried out to obtain the final output f out .The f out can be obtained by the following formula (7): where f ′ i denotes the output of the i-layer encoder in the first fusion process.f out is the final fusion output of the HMSF module , pool we use AdaptiveMaxPool.

Dynamic spatial linear attention module (DSLA).
As down-sampling proceeds, there exists positional deviation between low-level features and high-level features.To resolve the semantic conflicts that occur when they are fused and to enhance the global modeling capability of the network, this paper proposes the DSLA module.As shown in Fig. 3, the DSLA consists of two parts.On the one hand, ASAM is used for feature selection.On the other hand, Axial-mlp 21 is used to enhance the global contextual information of the fused features.

Adaptive spatial attention mechanism (ASAM).
Inspired by SE-Net 18 , we propose an efficient mechanism, which is shown in Fig. 3a.In this part, we conduct Avgpool and Maxpool of input F B×C×H×W features along  .In order to adaptively adjust the dynamic balance between redundant background information and foreground information according to the characteristics of the different scale features, we apply the learnable parameters μ(0 < μ < 1), 1 − μ multiplied by the Avgpool and Maxpool, respectively.After learning the two features are summed to get F B×1×H×W add .F B×1×H×W add and sigmoid are operated to obtain the adaptive weight parameter w B×1×H×W , which is used for feature selection.The ASAM module is calculated by the following formula (8): where F i+1 represents the output after feature selection, C Avg is the spatial Avgpool of features compressed into individual channel along the spatial direction of channel dimension, C Max is the spatial Maxpool of features compressed into a individual channel along the spatial direction of channels dimension.⊗ denotes element-wise multiplication, and ⊕ denotes element-wise addition.σ is the sigmoid function.
Axial-mlp 21 .In order to enhance the ability of the network to capture global context information and reduce computational complexity, Axial-mlp 21 is constructed by processing non-overlapping image patches of fixed size to achieve this goal.The structure of Axial-mlp 21 is shown in Fig. 3b.For the input feature F C×H×W , the channel is mapped to 2C, and then the new feature is gridded into the shape . We set the size of the grid to be fixed (d × d) .In this paper, we set d = 8.The formula ( 9) is as follows: where LN, FC represents LayerNorm and fully connected layers, respectively.σ denotes GELU.δ denotes grid operation.
After encapsulation, the channel dimension is divided into two branches to obtain performing MLP is fused with . The formula (10) is as follows: where ⊗ denotes element-wise multiplication.
The output of the multiplication gate performs reshape and grid reassembly operations to obtain F 5

C×H×W
.Finally, the output of the Axial-mlp 21 is obtained by adding F C×H×W to F 5 C×H×W .The out of the Axial-mlp 21 module is calculated by the following formula (11), (12): where FC represents fully connected layers.ϕ denotes reshape and ungrid operation.⊕ denotes element-wise addition.

Experiments and analysis
Datasets.(1) Breast UltraSound Images (BUSI) 47 : ultrasound images and corresponding segmentation images of normal, benign, and malignant breast cancer cases were collected.We use only benign and malignant images (647 images) and resize all images to 256 × 256.(2) International Skin Imaging Collaboration (ISIC 2018) 48 : the dataset consists of skin images containing cases and corresponding segmentation images of skin lesions, including a total of 2594 images.We resize all images to 512 × 512.(3) GlaS 49 : the dataset consists of 165 microscopic images of hematoxylin and eosin-stained slides, all of which are resized to 256 × 256.

Implementation details.
We utilize the Pytorch framework to develop DHMF-MLP.Consistent with the UNeXt 22 loss function scaling, we adopt a combination of binary cross entropy (BCE) and dice loss (Dice) for training.The total loss L between prediction ŷ and target y is expressed as: We use Adam optimizer to train the model with the initial learning rate of 1e −4 and momentum of 0.9.The training times are 400 epochs.Eight batches of training are used on the BUSI and ISIC 2018 datasets, and four batches are used on the GlaS datasets.The rotation and flipping techniques are adopted as data augmentation methods to force the model to learn more robust features, so as to effectively improve the generalization ability of the model.We randomly divide all datasets by 8:2 for training and testing, respectively.We evaluate our method on three datasets using IoU, Dice, Sensitivity (SE), Accuracy (Acc), Presion (PPV), and Specificity (SP).All our training is done on a Tesla V100-PCIE GPU.
Evaluation metrics.We exploit the IoU, Dice, SE, Acc, PPV, and SP segmentation metrics to quantify the segmentation ability of DHMF-MLP.For instance, IoU is used to assess the degree of similarity between prediction and ground truth.SE is a measure of the ability to correctly identify pixels that are not in the region of interest in a segmentation experiment.The formula is shown below: where TP denotes that the sample is deemed positive and is, in fact, positive.TN denotes that the sample has been judged to be negative and is, in fact, negative.FP denotes that the sample is thought to be positive but is actually negative.FN denotes that the sample is thought to be negative but is actually positive.
Training process.Figure 4 shows a relatively "perfect" loss curve.At the beginning of the training phase, the loss value decreases significantly, indicating a suitable learning rate and a gradient descent process.After a certain stage of learning, the loss curve plateaus.
Comparative experiment.In order to further measure the effectiveness of the proposed DHMF-MLP network for lesion segmentation, we conduct comparative tests on the BUSI, ISIC 2018, and GlaS datasets.The network architectures used in our comparative experiments include the most advanced CNN-based networks, such as U-Net 14 , U-Net++ 16 , U-Net3+ 28 , Att-Unet 29 , and transformer-based network architectures, such as TransUnet 42 and MedT 50 .We also make comparisons with UNeXt 22 , the network based on MLP.In the following, we will conduct the quantitative and qualitative analysis of the comparative test results.Moreover, the number of Parameters in each network is maintained to two decimal places.www.nature.com/scientificreports/have good performances.U-Net3+ 28 even outperforms the MedT 50 network based on the transformer method, with the best PPV and SP.However, there is a big gap between the overall performance of the CNN-based methods and DHMF-MLP.IoU, Dice, SE, and ACC are 8.14%, 5.76%, 10.95%, and 0.59% higher than U-Net3+ 28 , respectively.We note that the 4.54 M parameters of DHMF-MLP are also relatively low compared to the 26.97 M parameters of U-Net3+ 28 .It shows that DHMF-MLP is efficient in its segmentation performance.
Qualitative result analysis.The qualitative comparison results of the BUSI dataset using different methods are presented in Fig. 5.According to the third row of Fig. 5, due to the inherent local characteristics of traditional convolution, the control ability of global modeling is insufficient, resulting in under-segmentation.In contrast to the methods based on transformers, we not only achieve cross-scale interaction but also adaptively adjust the semantic conflicts that arise during fusion according to image characteristics.As can be seen in Fig. 5, the DHMF-MLP segmentation is more accurate and complete.From the qualitative analysis, the validity of the HMSF module is verified.
Evaluation of the ISIC 2018 dataset.Quantitative result analysis.Results of the quantitative comparison of the ISIC 2018 dataset on different methods according to Table 2, DHMF-MLP has all the best segmentation metrics.Among the other baseline models, the TransUnet 42 , based on the Transformer method, has the best PPV and SP.However, the number of DHMF-MLP parameters is very low compared to TransUnet 42 , which is also a relatively lightweight model.These experiments have verified the consistency of the foregoing.
Qualitative result analysis.3 shows the results of the quantitative comparison of the GlaS dataset on different methods.This dataset is characterized by inconsistencies in shape and size and by numerous small lesion areas.Both local feature extraction and global context feature extraction are extremely important for segmentation results.As can be seen in Table 3, TransUnet 42 improves its performance by using CNN and Transformer to extract local and global contextual information, respectively.UNeXt 22 achieves great competitive advantages by extracting excellent local features through CNN and shifted MLP.DHMF-MLP considers both local features and global feature extraction.Further better segmentation results from the medical image's own characteristics.Our proposed network (DHMF-MLP) has the best IoU, Dice, ACC, PPV and SP, which is 3.87%, 2.10%, 2.10%, 3.42%, 3.46% higher than UNeXt 22 .It should be noted that the proposed DHMF-MLP is also a relatively lightweight model, which is more feasible in clinical scenarios.There are huge advantages to these advanced methods.
Qualitative result analysis.Based on the above analysis of the characteristics of the GlaS dataset and the results of the qualitative analysis of the GlaS dataset in the first row of results in Fig. 7, U-Net++ 16 and DHMF-MLP enable cross-scale interaction to reduce redundant information interference compared to U-Net 14 .From the   second row of results, TransUnet 42 and MedT 49 combine the CNN with the Transformer and give better segmentation results of the junction of the lesion area compared to U-Net++ 16 .DHMF-MLP further considers the feature conflict during fusion compared to TransUnet 42 and proposes ASAM.As shown in Fig. 7, our method effectively measures foreground and background information.Compared with other methods, verify the feasibility of DHMF-MLP for segmentation.
Analytical study.To verify the individual contribution of each module in DHMF-MLP, we perform ablation experiments on three datasets and compare them with the baseline model (UNeXt 22 ).( 1   enhances the boundary segmentation effect, validating the module's ability to improve the network's ability to extract global contextual information.Compared with DHMF-MLP without (lp and Axial-mlp 21 ), DHMF-MLP without Axial-mlp 21 takes into account the difference of foreground and background information of different scale features and automatically adjusts itself by using the learnable parameters.From the segmentation results of the two columns in Fig. 8, the necessity of learnable parameters is proven.
As is vividly depicted in the third line of Fig. 9, DHMF-MLP without DSLA is much sharper in terms of edge profile compared to UNeXt 22 .That is, by fusing multi-scale features, rich semantic information is extracted, improving the segmentation effect.As shown in the second row of Fig. 9, DHMF-MLP without ASAM and DHMF-MLP without Axial-mlp 21 achieve better boundary preservation results than DHMF-MLP without DSLA by utilizing global contextual information and adjusting for semantic conflicts that arise during the fusion process, respectively.The segmentation results from Fig. 9 show that the learnable parameters facilitate the adaptive adjustment of the semantic conflicts generated during the fusion process.The combination of the three of them significantly improves the segmentation effect and is closer to ground truth.
According to the location of the red box in the second row shown in Fig. 10, DHMF-MLP without DSLA effectively reduces the under-segmentation of the lesion region compared to UNeXt 22 .This is due to the better  Through the above analysis, it can be seen that the quantitative results are consistent with the qualitative results.These experiments demonstrate the efficacy of our proposed method, which is exploited to extract rich multi-scale information for improving the accuracy of segmentation of small lesions and multi-lesion regions.Simultaneously, determine the feasibility of ASAM for adaptive learning of important features, as well as the necessity of Axial-mlp 21 to retrieve global contextual information.When they are all applied to the network, as shown in the last column of the qualitative analysis results, they compensate for each other's flaws, resulting in significant improvements in the segmentation effect.

Conclusion
We propose a new medical image segmentation framework called DHMF-MLP.HMSF is proposed as part of the encoder, which contains three functions.First, the accuracy of small lesion and multi-locus region segmentation is improved by fusing features from each stage of the encoder to obtain rich semantic information and reduce the deviation between features.Second, lightweight ASAM is constructed by applying learnable parameters to calculate feature weights based on the foreground and background information of the feature map to adjust the semantic conflicts arising from feature fusion.Third, Axial-mlp 21 , which is introduced to establish the global contextual information, fully compensates for the lack of global information at baseline and allows the fused feature information to be propagated so as to improve the overall performance of the network.Extensive experiments on three medical segmentation datasets have revealed that our proposed DHMF-MLP is competitive with current state-of-the-art methods.In the future, we will investigate the merits of the proposed DHMF-MLP on a wider range of datasets to improve the generalisation capability of the model.

Figure 4 .
Figure 4. Training loss variation curves for different datasets.

Figure 6
provides exemplary qualitative results generated by different methods for several challenging cases from the ISIC 2018 dataset.According to the qualitative analysis results of the red box position in the first row, it is obtained that DHMF-MLP effectively measures the relationship between background information and foreground information and improves the segment effect.Because of the simple semantics of medical images, rich multi-scale information is beneficial for improving segmentation accuracy.Combined with the importance of global context information to the segmentation performance, DHMF-MLP effectively reduces false negatives and better preserves boundaries compared with other methods.The second and third rows of red box positions in Fig.6confirm this view.
) UNeXt 22 framework; (2) DHMF-MLP without DSLA: our propose DHMF-MLP framework does not include DSLA module in its HMSF framework; (3) DHMF-MLP without ASAM: our propose DHMF-MLP framework with DSLA module without ASAM block; (4) DHMF-MLP without (lp and Axial-mlp 21 ) Our proposed DHMF-MLP framework with DSLA modules does not have Axial-mlp 21 blocks or learnable parameters.(5) DHMF-MLP without Axialmlp 21 : our propose DHMF-MLP framework with DSLA module without Axial-mlp 21 block; (6) DHMF-MLP: the DHMF-MLP framework is proposed by us.Tables 4, 5 and 6 show the quantitative analysis results of the ablation studies on the BUSI, ISIC 2018 and GlaS datasets, respectively.Figures 7, 8 and 9 show the qualitative analysis results of the ablation studies on the BUSI, ISIC 2018 and GlaS datasets, respectively.Quantitative result analysis.From the quantitative results in Tables 4, 5 and 6, our proposed DHMF-MLP without DSLA outperforms UNeXt22 , verifying that multi-scale feature fusion can contribute to optimal segmentation results.The superiority of DHMF-MLP without Axial-mlp21 over DHMF-MLP without (lp and Axialmlp21 ) indicates the importance of the learnable parameters.In addition, DHMF-MLP without Axial-mlp21 and DHMF-MLP without ASAM improve the segmentation metrics essentially without increasing the number of parameters and by increasing the number of parameters by less, respectively.We conclude the lightness of ASAM and Axial-mlp and the necessity of applying them to the feature fusion process.When they are all applied to the network, the IoU (%) of BUSI, ISIC 2018, and GlaS increases by 4.76%, 1.10%, and 3.87%, respectively.Qualitative result analysis.Taking the first row of Fig.8as an example, DHMF-MLP without DSLA reduces redundant information through multi-scale feature aggregation, thus reducing over-segmentation.The addition of ASAM has positive influences on the adjustment of foreground and background information relationships compared to DHMF-MLP without DSLA, which is closer to the ground truth.The addition of Axial-mlp21
Evaluation of the BUSI dataset.Quantitative result analysis.The quantitative comparison results of the BUSI dataset on different methods are depicted in Table1.Based on the traditional convolution methods, they still

Table 1 .
Comparison results of the proposed method on BUSI dataset.Significant values are in [bold].

Table 2 .
Comparison results of the proposed method on ISIC 2018 dataset.Significant values are in [bold].

Table 3 .
Comparison results of the proposed method on GlaS dataset.Significant values are in [bold].

Table 4 .
Ablation studies of the proposed blocks on the BUSI dataset.

Table 5 .
Ablation studies of the proposed blocks on the ISIC 2018 dataset.

Table 6 .
Ablation studies of the proposed blocks on the GlaS dataset.