Introduction

The third most common form of cancer worldwide is colorectal cancer, and its prevalence is increasing every year1. About the precursors of colon cancer, it is commonly accepted that most colorectal cancers evolve from adenomatous polyps2. Recent surveys and statistics underline polypoid lesions are precursors to most ( 85%) colorectal cancers3. Colonoscopy is the ‘gold standard’ method for examining colon and rectum4,5. Importantly, it has been assessed that the proportion of colon polyps missing during endoscopies could range from 20 to 47 percent6. A review article noted that an early detection of the CRC increases the 5-year survival rate from 18% when CRC is detected in the highest grade to 88.5% when it is detected in an initial grade due to symptoms7. Along with the development of artificial intelligence, semantic segmentation methods of AI assisted colonoscopy detection can significantly reduce the risk of misclassification and omission of polyp cancer, colorectal tumor lesions and colorectal cancer from early to late stages due to various reasons8. Therefore, the accuracy of semantic segmentation of colon polyps needs to be improved to achieve better support for colonoscopy detection.

Many networks for deep learning have achieved advanced performance in polyp-by-pixel segmentation9. The backbone of many of these excellent networks is the Pyramid Vision Transformer V2 (PVTv2)10 or the Mixed Transformer (MiT)11. High-level semantic features are more appropriate for the model to achieve a higher performance12. Feature fusion, a common technique in polyp segmentation tasks, has shown exceptional results13,14. However, there are still advanced semantic features that can further improve the segmentation accuracy through feature fusion. The process of discovery is to select the location of the feature maps according to Di et al15. and refer to Han WC et al16. to generate the feature maps of FCB-Former17 and ESFPNet18 in the form of heat maps, as shown in Fig. 1. The darker the warm color on the feature map indicates the more obvious features of the polyp or the background, while there is a clear change from warm to cold color at the polyp-background junction. It can be found that the features are not complete enough to lead to accurate segmentation results. In order to solve this problem, we propose a fusion strategy, Multi-Head Control Ensemble, which fuses complementary features step-by-step and integrates different feature results optimally to achieve efficient utilization of complementary features.

Colon polyps are thought to vary widely in size, orientation, color and texture19. It is difficult for a single network to produce accurate predictions in various situations20. Employing a multi-network ensemble strategy is anticipated to both enhance and stabilize performance. However, variability among polyps, coupled with the risk of networks converging to a local optimum during training, may result in a network that adversely affects the effectiveness of the ensemble at the end of training. The proactive identification and removal of such a network before the conclusion of training represent a challenge. A review article points out that generalisability studies are very limited in medical image analysis21. To solve the above problems, we propose a generalisability ensemble learning strategy that adaptively selects the most suitable network for the ensemble for different datasets, thus stabilizing the output of high-performance segmentation results.

The main contributions of this paper are as follows:

  1. (1)

    In order to maximize the use of complementary features, we propose Multi-Head Control Ensemble (MHC Ensemble), which can effectively supervise the network and output high-precision segmentation results.

  2. (2)

    In order to achieve stable and high-performance segmentation on discrepant data, we propose an improved Particle Swarm Optimization algorithm for optimizing sub-model weights in ensemble learning. And based on this, we propose a strategy SDBH-PSO Ensemble that can perform adaptive selection of sub-models under different datasets.

Figure 1
figure 1

Feature heat maps and polyp segmentation results under CVC-ColonDB and ETIS-LaribPolypDB datasets.

Related work

Ensemble learning

Ensemble learning methods are broadly categorized into: bagging, boosting, and stacking22. Bagging ensemble improves accuracy by training a single network with multiple copies of the dataset23. Boosting ensemble optimizes the ensemble results by assigning greater weights to the erroneous copies on top of the bagging ensemble by assigning greater weights to the erroneous copies to optimize the integration results24. Combining multiple models helps to improve and stabilize the results25. Stacking ensemble’s approach of integrating multiple networks as sub-models can provide strong robustness26. Stacking ensemble and ensemble multi-output is expected to solve the task of colon polyp segmentation, which is difficult for a single network. For the sake of simplicity, we will refer to “ensemble multi-output” as “ensemble” elsewhere in this paper.

Kang et al27. used ensemble learning to ensemble segmentation results from Mask R-CNN networks using ResNet50 and ResNet101 as the backbone. Thanh et al28. used ensemble learning to ensemble UNet segmentation results from EfficientNet B4 and EfficientNet B5. Nanni et al29. used the PvTv2 segmentation ensemble on other tasks and achieved excellent segmentation results. The sub-models used for the ensemble have changed as the model performance has improved. A review article on ensemble learning points out that a major challenge in deep ensemble learning is model selection for building the ensemble architecture30.

In the model selection problem, Zhang et al31. used neighborhood mutual information to select the models involved in the ensemble on carbon emission prediction. Djellali et al32. selected the models involved in the ensemble in a data mining task based on k-fold cross-validation. Both of the above methods perform sub-model selection at the end of training. Birman et al33. used reinforcement learning for sub-model selection during training in malware detection tasks. Labeling for colon polyp segmentation is more expensive, which limits the application of reinforcement learning in this area. To explore the application of model selection to colon polyp segmentation, we propose the SDBH-PSO Ensemble.

Ensemble learning with improved PSO optimization

The most critical aspect of the ensemble is the optimization of the ensemble weights and the selection of sub-models, and the Particle Swarm Optimization (PSO) algorithm34 is commonly used for the optimization of the ensemble weights35. PSO algorithms have been used to solve a variety of mathematical, engineering, design, network, robotics, and image processing optimization problems36. The solution in the PSO algorithm is represented as a particle, which holds a position vector and a velocity vector. PSO searches for the optimal solution by iteration. In each iteration, the velocity \(v_{id}(t)\) of each particle is updated based on its previous optimal position \(P_{pd}(t)\), the current optimal position \(P_{gd}(t)\) of all particles, random numbers \(r_1\) and \(r_2\) in [0, 1], adjustable inertia parameter \(\omega\), and adjustable learning parameter \(c_1\) and \(c_2\), while the position \(x_{id}(t)\) is varied as the velocity changes, as defined in Eqs. (1) and Eqs. (2).

$$\begin{aligned} v_{id}(t+1)&={\omega v}_{id}(t)+c_1r_1(P_{pd}(t)-x_{id}(t))+c_2r_2(P_{gd}(t)-x_{id}(t)) \end{aligned}$$
(1)
$$\begin{aligned} x_{id}(t+1)&=x_{id}(t)+v_{id}(t+1) \end{aligned}$$
(2)

The PSO algorithm is considered to continue to be dynamic in interdisciplinary research in the future37. In recent years, Gu et al38. proposed a resampling PSO algorithm for optimizing the scheduling of multi-star, large-area target observations. Subsequently, Song et al39. proposed a large-scale nonconvex joint optimization method based on PSO in order to solve the active control problem of wind farm layout and turbine yaw. Fontes et al40. proposed an improved PSO algorithm based on the job shop scheduling problem of transportation resources to be solved. Similarly, Qian et al41. proposed an improved PSO algorithm, which successfully realized the intelligent selection of the piston sealing groove for the designed domestic cylinder. Du et al42. proposed an improved PSO algorithm for ordered charging strategy, which can reduce the charging cost and peak variance of electric vehicles. Thus, on the problems that can be optimized by PSO algorithms, designing and improving PSO algorithms based on the problem to be solved or optimized is expected to solve the problem in a better way.

Image segmentation on colon polyps

On the task of semantic segmentation of colon polyps, this paper focuses on realizing high-precision and stable segmentation of polyps by building branches and feature fusion, and the relevant state of the art in this regard is as follows.

On branch building, Guo et al43. proposed a two-branch approach called ThresholdNet to collaborate segmentation and threshold learning in alternative training strategies. Fang et al44. proposed a new boundary-sensitive loss to model the interdependence between region branches and boundary branches. In order to better extracte the detail information, Zhang WC et al45. used to capture the local appearance details through the dual branch structure of Transformer and CNN. Chen et al46. built a depth feature extraction branch and depth bootstrapping for extracting the depth information between pixels. Wang et al47. built a new anchor-free instance segmentation framework by performing object detection branching for classification and localization with mask generation branching for generating instance-level masks. Fan et al48. achieved a more stable training process in federated learning by building a multi-branch network.

For feature fusion, Huang et al49. re-weighted encoder features in space and channel to enhance key features for segmentation task. To enhance the features on the boundary, Zhou et al50. merged the boundary information into the segmentation network to generate finer segmentation maps. Liu et al51. achieved adaptive feature fusion and selection for the network by channel attention. In addition, Chen et al52. utilized rich global context information to refine the fused features for informative feature representation. Patel et al53. improved the quality of features layer by layer, which in turn enhanced the final feature representation. Wang et al54. suggested that the region around the polyp has more detailed features that facilitate polyp segmentation.

Method

Overview

In order to fuse complementary features and perform stable high-performance segmentation on the disparate colon polyp dataset, we built the Dual Ensemble System, as shown in Fig. 2. Among them, in order to provide complementary features, we built Three-branch Architecture, which fuses complementary features through MHC Ensemble. In addition, in order to achieve stable and high-precision segmentation on different datasets, we choose FCB-Former and ESFPNet, which have complementary phenomena in the output results, and also take into account that there are differences in the adaptability of different datasets to the optimal network depth. The sub-models selected for the SDBH-PSO Ensemble range from sub-model 1 to sub-model 6 and include the following: Treble-Former-L(MiT-B4, PvTv2-B4), Treble-Former-S(MiT-B2, PvTv2-B2), FCB-Former-L(PvTv2-B4), FCB-Former-S(PvTv2-B2), ESFPNet-L(MiT-B4), ESFPNet-S(MiT-B2). Finally, the best real-time ensemble model and the best sub-model optimized by SDBH-PSO are again subjected to final ensemble. In addition, DET-Former is an ensemble structure that allows segmentation across multiple devices. It has an FPS of 3.9 for single-image input.

Figure 2
figure 2

The architecture of Dual Ensemble System with Treble Transformer.

Three-branch architecture

Mix Transformer Branch (MTB). In order to stabilize the performance during training so as to facilitate the integration with other branches, we constructed the MTB as shown in Fig. 3a. In order to improve the consistency of the convergence speed of the training parameters in each branch of Treble-Former, we add GroupNormal as a normal layer before the linear layer, which can stabilize the performance and reduce the effect of batch size on the model, and ultimately make it easier for MTB to integrate with the other branches to become a powerful network. In addition, both polyp and background features should be concerned in polyp segmentation. Therefore, we use SiLu as the activation function, which can better preserve both polyp and background features in each feature map.

Pyramid transformer branch (PTB). In order to maintain the complementarity of the features in Fig. 1, we retain some of the structures in FCB-Former. Since the Transformer Branch in FCB-Former uses PVTv2 as the backbone, and PVTv2 uses convolutional layers instead of the linear layers of the traditional Transformer, PVTv2 is able to capture the information of the polyp boundaries when sensing the global field of view as well as CNN. So we remove the full convolutional branch of FCB-Former and keep the Transformer Branch as the PTB in Treble-Former.

Swin transformer branch (STB). In order to make the STB output different features from the first two branches, we adopt DoubleUNet55 as the structure of the STB. DoubleUNet has good feature fusion capability on a network with UNet as the encoder. Since Swin Transformer56 does not use a convolutional layer, the improvement of the extraction ability of features on the details of colon polyps can be realized by combining VGG19 with a stacked 3\(\times\)3 convolution. Therefore, we fused the SwinUNetR57 equipped with Swin Transformer with the UNet equipped with VGG19 for coarse and fine features by using the structure of DoubleUNet.

Multi-head control ensemble

Multi-head control ensemble (MHC Ensemble). As shown in Fig. 3b, three branches output branching features. In order to fuse the complementary branch features, the branch outputs are cascaded step by step through the RB module and LB module of FCB-Former. In addition, multi-loss function supervises and multi-head output Ensemble are also performed on the results of multi-head output, and this whole process is collectively called Multi-Head Control Ensemble.

Multi-loss function supervises. In the problem of binary classification of polyp and background, we expect the deep model to learn the polyp and background features while paying more attention to the representative features of the polyp. Therefore, we then chose the combination of the cross entropy loss function (CE loss), which pays attention to the background and polyps, and the Dice loss function, which pays attention to the polyps only, as the loss function supervised training for each output header.

Multi-head output Ensemble. In order to complement the output results in the multi-head output, we first empirically categorize the five multi-head outputs in Fig. 3b into three categories according to performance from highest to lowest: (\(\alpha\)) MTB concatenated PTB concatenated STB’s output head; (\(\beta\)) MTB’s output header and the output header after PTB concatenate STB; and (\(\gamma\)) the output header of PTB and STB. In order to take into account, the performance differences of each output head in a specific case, when integrating the output results of multiple output heads, the definition of weights will be based on the specific division of weights according to the mDice coefficients, the evaluation indexes of each output head on the validation set.

$$\begin{aligned} W_{head}&=\textstyle \sum _{i}^{I}Softmax(d_j) \end{aligned}$$
(3)
$$\begin{aligned} Output_{Ensemble}&=\textstyle \sum _{i}^{I}(Output_{head\ i}\cdot W_{head\ i}) \end{aligned}$$
(4)

where \(J\in {\left\{ \alpha ,\ (\alpha +\beta ),(\alpha +\beta +\gamma ) \right\} }\). d denotes the evaluation index mDice corresponding to the corresponding output head. \(Output_{head\ i}\) denotes the output result of the output head. The mDice coefficients corresponding to each class in J are Softmaxed and then accumulated to generate the ensemble weights \(W_{head}\). The weights are weighted and summed with the outputs of the header \(Output_{head\ i}\) to generate the integrated prediction result \(Output_{Ens}\).

Figure 3
figure 3

(a) The architecture of Mix Transformer Branch. (b) The architecture of Multi-Head Control Ensemble.

SDBH-PSO

Since the global optimal solution before iteration in the real-time ensemble task is not necessarily the global optimal solution in this epochs, there is a need to prevent the optimal particle from being a local optimal solution. Therefore, it is necessary to initialize the particles that are too close to the optimal particles when the whole is too aggregated. The degree of proximity of each particle to the optimal particle is defined by Pearson’s correlation coefficient58, and the overall aggregation of particles is defined by Renxoa Wan’s aggregation coefficient C(k)59.

$$\begin{aligned} {sim}_i&=\frac{Cov(X_{id},P_{gd})}{\sqrt{Var(X_{id})}\sqrt{Var(P_{gd})}} \end{aligned}$$
(5)
$$\begin{aligned} C(k)&=\sum _{j=1}^{n}{\frac{{sim}_j}{\sum _{i=1}^{n}{sim}_i}{sim}_j} \end{aligned}$$
(6)

where C(k) is the aggregation degree of the particle population in the kth generation and n is the population size. Since the iterations are all relatively homogeneous and may lead to excessive oscillations in particle aggregation in later iterations, an adaptive function \(\theta (k)\) controlled by a nonlinear function is added to assess the degree of particle overlap59. Whether the particles have a tendency to fall into local optimum is judged by \(H \cdot \theta (k)<C(k) \cdot {sim}_j\), where H is a constant used for adjustment. Through many experiments, it is found that SDBH-PSO has the best effect on weight adjustment when H is taken around 3.

$$\begin{aligned} \theta (k)={\left( \frac{k_{max}-k}{k_{max}}\right) }^s \cdot (\lambda _{max}-\lambda _{min})+\lambda _{min} \end{aligned}$$
(7)

Where \(k_{max}\) is the maximum number of iterations and k is the current number of iterations. s is an exponential factor. We set \(s=1.0\), \(\lambda _{max}=0.9\), \(\lambda _{min}=0.4\) in our experiments, while the adjustable inertia parameter in Eqs. (1) is also set to \(0.9-0.5{({k}\setminus {k_{max}})}^2\) with reference to the nonlinear tuning method.

The potential global optimal point is usually within a certain distance from the current optimal point, and we found that in our task, the variation of the distance between the previous generation global optimal point and the next generation is roughly concentrated in the range of [0.04, 0.23] through analysis. The strategy of RBH-PSO60 to search for the potential global optimum is used to randomly select a point within a certain range as the location of the potential optimum \({\widetilde{x}}_{id}\). Compared to the RBH-PSO in which the radius \(\zeta =0.01\) is taken as the range, the global optimum solution of the real-time ensemble is subjected to the influence of the model training and has a large variation. So, the search for potential optimal solutions needs to be expanded. Therefore, the position of the particle to be reset is placed into the black hole combined with randomizing the initial velocity to \({\widetilde{v}}_{id_0}\) for resetting.

$$\begin{aligned} {\widetilde{x}}_{id}(t+1)&=P_{gd}+\zeta \end{aligned}$$
(8)
$$\begin{aligned} {\widetilde{v}}_{id_0}&=(v_{id_{max}}-v_{id_{min}}) \cdot P_{gd} \cdot rand \cdot Gaussian(\mu ,\sigma ^2) \end{aligned}$$
(9)
$$\begin{aligned} \sigma&=(\sigma _{max}-\sigma _{min}) \cdot \frac{k}{K} \end{aligned}$$
(10)

where \(\zeta\) is randomly derived from a uniform distribution over the interval \([-\xi , \xi ]\), and \(\xi\) is taken as 0.1. Rand is a random number in the range [0, 1], \(x_{id\ max}\) and \(x_{id\ min}\) are the upper and lower bounds of the search space, and \(Gaussian(\mu ,\sigma ^2)\) is a Gaussian function. We set \(\sigma _{max}=1.0\) and \(\sigma _{min}=0.1\).

Similarity degree black hole PSO can be summarized simply in Algorithm 1.

Algorithm 1
figure a

SDBH-PSO algorithm

SDBH-PSO ensemble

We use the ensemble weights as the positions of the particles in SDBH-PSO, and the mDice coefficients achieved by the Ensemble Learning outputs as the fitness of the particles in the validation set, and achieve the optimal allocation of the ensemble weights to the validation set through the SDBH-PSO algorithm: every ten epochs. In addition, the initial learning rate is 0.001, and the learning rate is adjusted with the strategy that the learning rate decreases to half if the Dice coefficient has not improved for five consecutive generations on the validation set. When the learning rate of all sub-models has decreased an equal number of times, the weights of Ensemble Learning are optimized by the SDBH-PSO Ensemble algorithm. The sub-models with weights less than or equal to 0 are marked. When the sub-model has been labeled 3 times, the learning rate corresponding to the learning rate adjustment strategy is already very low, so even if we continue to train this sub-model, it will not improve much, so we choose to eliminate it.

Regarding parameter configurations, the SDBH-PSO Ensemble performs five iterations for every ten epochs, utilizing 50 particles per iteration. Where the iterative process is the same. The running speed of the SDBH-PSO Ensemble, which optimizes parameters via validation sets, is contingent upon the size of this set. Specifically, for a validation set of 100 images, the iteration time per generation is approximately one minute. Conversely, if the validation set contains 61 images, the iteration time is reduced to around 35 s.

In contrast to other semantic segmentation that Ensemble Learning performs Ensemble by best sub-model, SDBH-PSO Ensemble performs real-time ensemble every ten epochs during the training process. The SDBH-PSO Ensemble’s best ensemble model’s checkpoint of the SDBH-PSO Ensemble is not necessarily the checkpoint of the best sub-model, we perform the final ensemble of the best ensemble model with the best sub-model, and the final ensemble is defined as follows.

$$\begin{aligned} L_{DET-Former}=W_{rt} \cdot L_{rt}+ {\textstyle \sum _{i}^{\widetilde{I}}}({W_{{sub}_i} \cdot L_{{sub}_i}}) \end{aligned}$$
(11)

where \(L_{rt}\) is the output of the best real-time ensemble model, \(L_{{sub}_i}\) is the output of the best sub-model, \(W_{rt}\) is the weight hyperparameter of the best real-time ensemble model, \(W_{rt}=0.5\), \(W_{{sub}_i}\) is the weight hyperparameter of each best sub-model, \(W_{{sub}_i}=0.5\setminus \widetilde{I}\), and \(\widetilde{I}\) is the weight hyperparameter of each sub-model filtered by SDBH-PSO according to different data sets. \(L_{DET-Former}\) is the final output of SDBH-PSO Ensemble.

Experiments

Dataset

The following datasets are used in this paper: the Kvasir61, CVC-ClinicDB62, CVC-ColonDB63, ETIS-LaribPolypDB1 and PolypGen64. The PolypGen comes from 6 unique centers suitable for Generalisation testing compared to other datasets. The information of the datasets is shown in Table 1. Due to the different image sizes of different datasets, we scale all the sizes to \(352\times 352\) and set the batch size to 4. In this paper, we combine ESFPNet, SS-Former, and an analytical paper illustrating the effect of polyp segmentation dataset enhancement on segmentation66, and choose random flip, scale, rotate, as well as random expansion and erosion as the data augmentation operations.

Table 1 The most commonly used public dataset for polyp segmentation.

Evaluation metrics

Almost all of the colon polyp segmentation papers adopt the mDice coefficient and the mIoU coefficient as the evaluation performance metrics to measure segmentation accuracy. Furthermore, we choose the 95th percentile of the asymmetric Hausdorff distance (HD95) as a performance metric for the boundary of interest. mDice, mIoU and HD95 are calculated using the following formulae\(:\)

$$\begin{aligned} mDice&=\frac{1}{2}\sum _{i}^{k\in (P,B)}\frac{{2\times n}_{ii}}{\sum _{j}^{k}{n_{ij}+\sum _{j}^{k}n_{ji}}} \end{aligned}$$
(12)
$$\begin{aligned} mIoU&=\frac{1}{2}\sum _{i}^{k\in (P,B)}\frac{n_{ii}}{\sum _{j}^{k}{n_{ij}+\sum _{j}^{k}{n_{ji}-n_{ii}}}} \end{aligned}$$
(13)
$$\begin{aligned} HD95&=\max _{k95\%}[d(X,Y),d(Y,X)] \end{aligned}$$
(14)

where \(n_{ii}\) denotes the number of real numbers and is predicted to be j and k is the category of polyp and background (polyp abbreviated as P and background abbreviated as B). \(n_{ii}\) is the number of correctly predicted values, and \(n_{ij}\) and \(n_{ji}\) denote the false positives and false negatives respectively. The one-way Hausdorff distances d(XY) measure how far the predicted results are from the actual results and d(YX) as well as vice versa.

Regarding the evaluation of the success of polyp categorization without calculating the background, we choose Dice, which is formulated as follows\(:\)

$$\begin{aligned} Dice=\frac{{2\times n}_{PP}}{\sum _{j}^{k}{n_{Pj}+\sum _{j}^{k}n_{jP}}} \end{aligned}$$
(15)

Compare experiment

In the compare experiment of DET-Former, we use UNet, UperNet and DoubleUNet as the base networks and SS-Former, FCB-Former, ESFPNet, HarDNet-DFUS65 amd Nanni’s Ens (Nanni et al67. proposed Ens1) as the comparison models. Experiments were conducted on five datasets: Kvasir, CVC-ClinicDB, CVC-ColonDB, ETIS-LaribPolypDB, and PolypGen. Each model was trained for 100 epochs, and the optimal values of the evaluation metrics were documented. The metrics of each model that differ most from DET-Former are taken for t-test statistical analysis and the p-value is generated. The results of tests using different datasets or data sources were more closely aligned with clinical scenarios and were selected to generate visual segmentation maps.

Table 2 The test results of the compare study in Kvasir and CVC-ClinicDB.
Table 3 The test results of the compare study in ETIS-LaribPolypDB and CVC-ColonDB.

An article on polyp segmentation pointed out that it is difficult for a single network to make accurate predictions in many situations20. As shown in Table 2, excluding DET-Former and Nanni’s Ens, no single network consistently emerged as optimal across different datasets, reinforcing the challenge of achieving robust performance in colon polyp segmentation when faced with dataset variability. The models’ learning abilities were evaluated by training and testing them on identical datasets. In experiments where Kvasir and CVC-ClinicDB were used for training and testing, the performance metrics of DET-Former exceeded those of the comparator models, highlighting its superior learning capabilities. However, the results of statistical analyses show that DET-Former cannot significantly outperform some networks in terms of learning ability. DET-Former and Nanni’s Ens outperformed individual networks regarding Dice, mDice and mIoU metrics. These results suggest that the strategy of multi-model ensemble is expected to solve the problem of unstable learning ability of a single network on different data.

Table 4 The test results of the compare study in polypGen.
Table 5 The test results of the compare study in polypGen.
Table 6 The results of comparison with other weight optimization algorithms in CVC-ClinicDB.
Table 7 The results of comparison with other weight optimization algorithms in Kvasir.
Figure 4
figure 4

Visual segmentation results. (From top to bottom, there are CVC-ColonDB, ETIS-LaribPolypDB, and PolypGen datasets).

As shown in Table 3, DET-Former’s evaluation metrics - Dice, mDice and mIoU - show superiority over other models on the ETIS-LaribPolypDB and CVC-ColonDB datasets, suggesting superior generalisation ability. Statistical analysis reveals that DET-Former significantly outperforms other models on the ETIS-LaribPolypDB dataset, except FCB-Former and Nanni’s Ensemble. However, this significant advantage is not evident when analysing the CVC-ColonDB dataset. Among them, CVC-ClinicDB and CVC-ColonDB were used to form CVC-EndoSceneStill68 in the MICCAI2015 challenge69. The significant difference between DET-Former and the comparison model is larger in ETIS-LaribPolypDB and CVC-ColonDB. To better analyse whether the generalisation ability of DET-Former can be significant due to the comparison model, we performed generalisation experiments on the multi-centre PolypGen.

Tables 4 and 5 shows the performance of DET-Former on the PolypGen multi-centre dataset. DET-Former shows a significant improvement over competing models in centres C1 and C4. However, its performance advantage is less pronounced in centres C2, C3, C5 and C6, where it outperforms only some models. Although DET-Former exhibits superior generalisation capabilities on the multi-centre PolypGen dataset, it does not achieve significant dominance in all centres. To further investigate the limitations of DET-Former’s generalisation ability in certain centres, we analyse this in conjunction with the visual segmentation results in Fig 4. Figure 4c shows a decrease in the segmentation accuracy of DET-Former, especially in cases where different networks produce different false-negative segmentations. This problem arises because the DET-Former ensemble has multiple sub-models, and the false negatives from these sub-models are variable, especially in complex scenarios such as the independent distribution of multiple polyps. The complexity of these divergent false-negative segmentations poses a significant challenge for ensemble learning in colon polyp segmentation. Conversely, false-positive segmentations are less frequent and tend to occur sporadically across models, as shown in Fig. 4a, b. DET-Former is an ensemble learning structure that can optimise the false-positive results of individual false-positive segmentation models through multiple sub-models. Thus, ensemble learning has advantages in false-positive segmentation. In the future, more effective balancing of false-negative and false-positive results in ensemble learning needs to be achieved to address the problem of false negatives in colon polyp segmentation.

Model ensemble experiment

In the model ensemble experiments, the optimal sub-model of the previous compare experiments of DET-Former on Kvasir and CVC-ClinicDB datasets is chosen as the sub-model of the model ensemble experiments. Then to measure the effectiveness of the improvement, we choose the RBH-PSO algorithm and CSPSO algorithm, which are the closest to SDBH-PSO, as well as the classical PSO algorithm and weight averaging as the BASELINE algorithm to adjust the weights of the ensemble model. The weight optimization results are shown in Table 6.

As shown in Table 6, under the CVC-ClinicDB data, the results of the outputs of various strategies are basically the same, and it can be seen that the method of improving the ensemble effect through weight optimization is not applicable to all cases. Even so, by comparing Table 2, it can be seen that compared with single network segmentation, the ensemble learning improves the segmentation accuracy greatly. As shown in Table 7, It is worth noting that the superiority of PSO over RBH-PSO and CSPSO under the Kvasir dataset also indicates that not all of the proposed improved algorithms based on the PSO algorithm are well suited for the ensemble task of the colon polyp segmentation network. While our ensemble model does not demonstrate statistically significant superiority in performance compared to the comparison method, the ensemble model of our method eliminates the sub-models and improves the performance instead of degrading it. Despite utilizing only five-sixths of the model usage compared to the comparative methods, our method maintains performance levels. This suggests that the proposed SDBH-PSO Ensemble based on the polyp segmentation task is better than PSO weight optimization with average weights, which verifies that our improvement is suitable for this task.

To further explore the effectiveness of the sub-modeling strategy of SDBH-PSO Ensemble eliminated sub-models, we engage the eliminated sub-models in real-time ensemble, and their ensemble weights computed by SDBH-PSO for every ten epochs are shown in Fig. 5.

Figure 5
figure 5

The green line indicates the ensemble weights of the sub-models retained during training under the SDBH-PSO calculation. The yellow line indicates that the training was stopped after setting the sub-model weights to zero, where the mean intersection and union set (mIoU) of the real-time ensemble models are shown around the corresponding points.

As shown in Tables 6 and 7, the 5th sub-model is eliminated on the Kvasir dataset and the 4th sub-model is eliminated on the CVC-ClinicDB dataset. As shown in Fig. 5, the sub-model eliminated on the Kvasir dataset is eliminated by the SDBH-PSO Ensemble strategy at epoch 60. The sub-model eliminated on the CVC-ClinicDB dataset is eliminated at epoch 70. On the Kvasir dataset, SDBH-PSO Ensemble improved performance by filtering out models that were not suitable for ensemble. On the CVC-ClinicDB dataset, the elimination of the sub-models filtered out by the SDBH-PSO Ensemble does not improve the performance, but also does not degrade the overall performance.

The experiments verify the ability of SDBH-PSO Ensemble to select sub-models and the effectiveness of sub-model selection by eliminating sub-model strategy. A review article pointed out that model selection is a major challenge for ensemble learning30. It is believed that the ensemble method is not only applicable to colon polyp segmentation, but also can realize adaptive sub-model selection for different datasets by pairing with suitable sub-models in other tasks.

Feature fusion experiments

In the feature fusion experiments, since the ability of feature extraction and fusion can be better demonstrated on datasets that have never been involved in training, we train and validate on Kvasir and CVC-ClinicDB datasets and test on CVC-ColonDB and ETIS-LaribPolypDB datasets, and the results of the tests are shown in Fig. 6 and Table 8.

Table 8 The performance of different branches in Treble-Former.

As shown in Table 8, the outputs of STB, PTB and MTB after fusion are better for STB+PTB and STB+PTB+MTB than STB, PTB and MTB before fusion, and the segmentation performance is further improved after integration. As shown in Fig. 6, the visualization of the features by heat map shows that the features extracted from different branches are quite different, and along with the feature fusion, the output results appear to be improved accordingly. Our ensemble strategy successfully compensates for the different branch segmentation defects. In this case, Treble-Former is used to fuse multiple branches and ensemble multiple output heads through MHC Ensemble, so the ensemble output results are the output results of Treble-Former.

Figure 6
figure 6

Feature heat maps of the five branches with the segmentation results of the corresponding branches. The results after multi-output ensemble by MHC Ensemble are shown as red clippings pointing to.

Through Fig. 6 and Table 8, it can be seen that no branch in STB, PTB and MTB can perform the segmentation ability stably, but through feature fusion, the boundary features of the polyps become more obvious, which further compensates for the deficiency of some branches after weighted ensemble. Statistical analyses reveal that the MHC Ensemble significantly surpasses the branch STB only after the fusion of multi-branch advanced features on the CVC-ColonDB dataset. However, it significantly outperforms all three branches prior to fusion on the ETIS-LaribPolypDB dataset. These findings underscore the MHC Ensemble’s capacity to leverage advanced semantic features effectively. High-level semantic features are more appropriate for the model to perform better12. Our research confirms that this improvement extends to colon polyp segmentation. As artificial intelligence progresses, we anticipate the introduction of more sophisticated backbones and networks that will surpass the performance of current models such as PvTv2, Mix Transformer, Double UNet, and FCB-Former. The MHC Ensemble, which performs layer-by-layer fusion of advanced semantic features on them, will continue to exist as a reference value.

Conclusion

In this study, we propose a novel Dual Ensemble System with Treble Transformer (DET-Former). The system first constructs a multi-branch ensemble network Treble-Former with three different Transformers. Then, to improve the stability under different datasets, we propose DET-Former with SDBH-PSO Ensemble structure. Among them, the Treble-Former’s approach, which employs a multi-branch, layer-by-layer fusion of high-level semantic features, represents a promising direction for developing more accurate segmentation models in the future. Meanwhile, DET-Former maintains stable, high-performance segmentation relative to other networks, suggesting that ensemble learning is expected to solve the problem of unstable performance of a single network on different colon polyp datasets. In addition, experimental evidence shows that the SDBH-PSO ensemble can adaptively select sub-models during training, providing valuable insights into model selection for ensemble learning.