Introduction

With the popularity of Microsoft Kinect, Intel RealSense and some modern smartphones (e.g. iPhone X, and Samsung Galaxy S20), depth images can be obtained easily and conveniently. As a result, Salient Object Detection (SOD) by using RGB images and depth images (RGB-D images) has become a hot research topic. Benefiting from its stable geometry and additional contrast cues, depth images can provide important complementary information for SOD. Especially, the emergence of Fully Convolutional Neural Networks (FCNs) makes it possible to capture multi-level and multi-scale features, thereby boosting the performance of RGB-D SOD1,2,3,4,5,6,7,8,9.

Most FCNs predict the salient object by assembling multi-level features. However, there is an object-part dilemma under the mechanism of FCNs, which is demonstrated in Fig. 1 with four representative examples. As shown in Fig. 1d, some parts of the predicted salient object from FCNs are immersed in the background or disturbed by the background, they may be easily mislabeled as non-salient regions. It results in incomplete segmentation. In other words, the relationships between an object and its parts are not taken into consideration by existing FCNs. Ideally, a salient object is a complete entity, which is composed of several associated parts. If a large proportion of the object were predicted as the salient region, the complete object would be determined as a salient object. As shown in Fig. 1e, the salient objects are segmented as a whole with a high probability when the object-part relationship is taken into consideration by the Capsule Network (CapsNet).

Figure 1
figure 1

The examples of object-part relationship for SOD. For the first row, the tail wing of an aircraft is not recognized as the salient object by FCNs while our proposed method predicts the aircraft as a whole. In the second row, there is an incoherence in the right arm of the cartoon figure predicted by the FCNs while the proposed method regards the right arm and the cartoon figure as an integral whole. In third line, the salient object predicted by FCNs misses the stem of the plants while the flower and the stem are predicted by our CCNet. For the last row, the flame of the satellite is misidentified as the salient object by the FCNs. However our proposed method suppresses the interference by identifying the satellite as a complete object. Note: Reproduced with permission of references 25, Copyright of ©2017 IEEE, references 26, Copyright of ©2016 IEEE, references 27, Copyright of ©2018 IEEE, references 28, Copyright of ©2015 IEEE.

Recently, the CapsNet10,11,12 has been proposed to preserve vector quantity, rather than scalar quantity, by replacing max-pooling operation with convolutional strides and dynamic routing. The vector quantity is capable of preserving object-part relationships, which is the basic element of a capsule. A capsule encapsulates a group of neurons whose outputs are vector quantity, representing different properties of an entity, including position, size, sharp and orientation, which preserve enough information to explore the object-part relationship. Furthermore, associated parts of an object are represented by child capsules. Then, children capsules are clustered by the dynamic routing algorithm to generate parent capsules. Unfortunately, despite the great performance, CapsNet is known for its high computational demand, both in terms of memory and run-time, even for a very simple image classification task. Especially, child capsules store all intermediate representations and parent capsules are clustered by the dynamic routing algorithms, which determines coefficients between every child capsule and every parent capsule. A large amount of GPU memory required, when the dynamic routing algorithms occurs. Therefore, it is impractical for CapsNet to deal with SOD.

Inspired by these observations, we introduce object-part relationship for RGB-D SOD in this paper, which provides a solution to incomplete salient object segmentation. A Convolutional Capsule Network based on Feature Extraction and Integration (CCNet) for RGB-D SOD is proposed to explore the object-part relationship with low computation demand. Our system consists of two key parts. One is proposed to extract and integrate features based on VGG, Global Context Module(GCM)13, attention mechenism14, 15 and FDM(Feature Depth Module). The other one is the Feature-integrated Convolutional Capsule Network (FiCaps), whose structure is composed of the convolutional part and deconvolutional part, similar to SegCaps16. Specifically, in our proposed FiCaps, child capsules are only routed to parent capsules within a defined local kernel. Besides, the transformation matrices are shared for each member of the grid within a capsule.

Our contributions are summarized as follows:

  1. (1)

    We introduce the object-part relationship into RGB-D SOD by using the CCNet. To the best of our knowledge, this is an earlier attempt to apply CapsNet to explore object-part relationships for RGB-D SOD.

  2. (2)

    A novel FiCaps is proposed to integrate external multi-level features with internal capsules. As demonstrated in Fig. 1e, our proposed method can recognize and segment the salient object as a whole with a high probability, compared with the methods based on FCNs.

  3. (3)

    We compare our approach with 23 state-of-the-art RGB-D SOD. The experimental results demonstrate that our CCNet outperform other state-of-the-art algorithms.

Related works

The utilization of RGB-D images for SOD based on FCNs has been extensively explored for years. Based on the goal of this paper, we review the RGB-D SOD methods as well as the CapsNet and illustrate the differences between our proposed methods and related works.

RGB-D salient object detection

The pioneering work was produced by Niu et al.17 based on traditional methods. After that, various hand-crafted features originally applied for RGB SOD were extended to RGB-D SOD, such as18, 19. In this paper, we pay much attention to the RGB-D SOD based on deep learning algorithms. For example, Xu et al.20 propose a lightweight SOD for real-time localization which is composed of lightweight feature extraction network based on multi-scale attention, the jump connections and a residual refinement module. Chen and Fu20 propose an alternate refinement strategy and combine a guided residual block to predict refined features and salient maps, simultaneously. Lei22 proposes SU2GE-Net. Firstly, the CNN-based backbone is replaced by the transformer-based Swin-TransformerV2. Besides, an edge-based loss and training-only augmentation loss are introduced to enhance spatial stability. Zhao et al.23 build a real single-stream network by combining RGB-D images at the starting point, taking advantage of the potential contrast information provided by depth images. Compared with the above algorithms, there are similarities and differences. For the similarities, the design idea between our proposed algorithms and SU2GE-Net in the related work is the same. Both of them try to replace the CNN-based backbone with a novel backbone, such as Swin-TransformerV2 or CapsNet. Furthermore, the structure of salient object detectors is the same as well, including the encoders and decoders. For the differences, the CCNet predicts the salient object mainly based on CapsNet, whose basic elements are vector quantity. However, other salient detectors are scalar quantity.

Capsule network

Recently, a novel deep learning network, named CapsNet, was developed by Hinton et al. 10. A capsule is essentially a group of neurons, which represent a specific type of the entity, such as position, size, orientation, deformation, texture and etc. The CapsNet is totally different with the FCNs in two aspects. On the one hand, neurons of FCNs are scalars while that of CapsNet are vectors. On the other hand, the FCNs extract and integrate multi-level features by encoder and decoder while the CapsNet matches associated active child capsules into parent capsules by dynamic routing algorithm. Then, Sabour et al.11 proposes the vector CapsNet. An iterative dynamic routing algorithm was proposed to assign child capsules to corresponding parent capsules via transformation weights. The spatial relationship between a part and a object is encoded and learned by the dynamic routing algorithm and transformation weights. One year later, Hinton et al.12 consolidated the vector CapsNet by proposing a matrix CapsNet, whose capsule is composed of a pose matrix and an activation probability. The coefficients between the child capsule and the parent capsule are calculated by the iterative Expectation–Maximization (EM) algorithm, by finding the tightest clusters of capsules. Compared with the vector CapsNet, the transformation matrix of the matrix CapsNet has much less parameters. Furthermore, the matrix CapsNet use the iterative EM to measure the similarities between capsules, while the vector CapsNet uses the cosine similarity. In the view of its advances, some attempts have been made to apply CapsNet for several computer vision tasks, including the object segmentation and SOD. To reduce the high computational demand, LaLonde and Bagci design the SegCap16 based on the vector Capsules to solve the object segmentation. It extends the idea of convolutional capsules with the locally-connected routing and the concept of deconvolutional capsules. Liu and his colleague24 propose the Two-Stream Part-Object Relational Network (TSPORTNet) to implement the matrix CapsNet for SOD, whose activation map is the final salient map. Both methods try to reduce the computation demand, which make them possible to be used in large-scale image tasks. In this paper, the structure of our proposed method is the similar to that of SegCaps. Different from the SegCaps, our proposed method in the encoder excavate capsules and concatenates them with corresponding multi-level features. However, the encoder and decoder of the SegCaps are enclosed environment. Furthermore, the TSPORTNet prefers to explore the object-part relationship based on the matrix Capsules and use the activation map as the salient map. The predicted salient map is coarse and needs to be refined. On the contrary, our proposed FiCaps uses extracted features from FCNs as the input. Subsequently, a refined salient map is predicted directly by FiCaps, without post-processing.

Methodology

This paper begins by demonstrating an overall architecture of CCNet, which is depicted in Fig. 2. It will then go on to introduce their principles and detail information of modules.

Figure 2
figure 2

The framework of CCNet. The features are extracted by the VGG backbone, which is denoted as \(\left( {f_{0} ,f_{1} ,f_{2} ,f_{3} ,f_{4} } \right)\). The features \(\left( {g_{0} ,g_{1} ,g_{2} ,g_{3} ,g_{4} } \right)\) refer to the outputs of the GCM. The depth image is downsampled directly which is labeled as \(\left( {d_{0} ,d_{1} ,d_{2} ,d_{3} ,d_{4} } \right)\). Then, the features from GCM and depth images are integrated by FDM, whose outputs are \(\left( {fd_{0} ,fd_{1} ,fd_{2} ,fd_{3} ,fd_{4} } \right)\). In the next step, the outputs of FDM are aggregated by Feature Fusion Module(FFM) progressively, denoted as \(\left( {ff_{01} ,ff_{12} ,ff_{23} ,ff_{34} } \right)\).In FiCaps, the conv means the traditional convolution with 1 × 1 kernel size. The convCaps means the convolution capsule layer, whose stride and padding is equal to 1 or 2. The deconvCaps refers to the deconvolution capsule layer, implemented by transposed convolution with stride 2 and padding 2. The concatenation indicates a series of operations, including concatenation, reshape and convolution, for integrating internal capsules and external features.

Overall architecture

Figure 2 shows the overall architecture of CCNet. Our system begins with a VGG backbone, extracting multi-level features. Then, these features are input into GCM to further exploit. The depth image is downsampled by the max pooling to shrink it by a corresponding multiply, including 2, 4, 8, 16, respectively. In the next step, depth images are integrated with features from GCM by FDM directly, based on attention mechanism14, 15. After that, the outputs of FDM are integrated by FFM progressively, whose outputs are further input into FiCaps to fuse external multi-level features with capsules ulteriorly. The structure of FiCaps is similar to U-Net. In the encoder, these capsules are processed by convolutional capsule layers, which map the child capsules to the parent capsules by the dynamic routing algorithms in defined local connections. Besides, the concatenation module in the encoder is proposed to integrate external features with internal capsules. When it comes to the decoder, the capsules are processed by the deconvolutional capsule layers which are mainly composed of transposed convolution with stride 2. Finally, the output of FiCaps is the predicted salient map.

Feature depth module

The FDM is used to reweight features from GCM based on depth images. The structure of GCM is introduced in13 in details. In Fig. 3, we multiply features with the depth image which is downsampled to the corresponding size. Then, the product are processed by two convolutions with batch normalization and Relu operation. Consequently, these features are concatenated with downsampled depth images and are processed by two convolutions. Next, we facilitate the attention mechanism, including the channel attention and the spacial attention, to generate a reweighted map and multiply it with input features gi, following convolutions. The procedure is formulated as follows:

$$gd_{i} = conv_{br} \left(conv_{br} \left( {g_{i} *d_{i} } \right)\right)$$
(1)
Figure 3
figure 3

The structure of FDM. The convbr refers to the 3 × 3 convolution with batch normalization and relu operation while the conv means the 3 × 3 convolution. The symbol X and [.] indicates the multiplication and the concatenation operation in the pixel level. The pool refers to the pooling operation, whose multiple is 2 to i.

$$cd_{i} = conv\left( {conv\left( {cat\left( {gd_{i} ,d_{i} } \right)} \right)} \right)$$
(2)
$$fd_{i} = conv\left( {conv_{br} \left( {g_{i} *CA\left( {cd_{i} } \right)*SA\left( {cd_{i} } \right)} \right)} \right)$$
(3)

where \(g_{i}\) and \(d_{i}\) refers to the ith feature from GCM and depth image with 2i times downsampling. The \(conv_{br}\) and \(conv\) indicates the 3 × 3 convolution with and without the batch normalization and relu operation. The CA and SA means the channel attention and the spacial attention. The parameter i ranges from 1 to 4. The symbol \(*\) means the multiplication operation in the pixel level.

Feature fusion module

The FFM integrates adjacent two features from high-level to low-level, generating the feature map. As showed in Eq. (4), two input features first undergo the convolution layers, respectively. Then, the relative high-level feature is upsampled and concatenated with the low-level feature, which is further processed by two convolution layers.

$$ff_{i,i - 1} = conv \left(conv \left(cat \left(up \left(conv_{a} \left( {fd_{i} } \right)\right),conv_{b} \left( {fd_{i - 1} } \right)\right)\right)\right)$$
(4)

where \(fd_{i}\) and \(fd_{i - 1}\) refers to the relative high-level feature and low-level feature, respectively. The \(conv_{a}\), \(conv_{b}\) and \(conv\) all indicate the convolution with batch normalization and relu operation. The \(up\) indicates the 2-times upsample operation and the \(cat\) means the concatenation operation. The i ranges from 1 to 4.

Feature-integrated convolutional capsule network

Figure 4 shows the details of FiCaps. We first introduce the structure of FiCaps and then elaborate the detail, including the convolutional capsule layer, the deconvolutional capsule layer and the concatenation layer. The FiCaps shares the same architecture with the U-Net. For the encoder, it contains two basic modules, the convolutional capsule layer and the concatenation layer. In the decoder, it is composed of the convolutional capsule layer and the deconvolutional capsule layer. First of all, the feature map from FFM is transformed into the capsule. Then, the capsule (1 × 16 × 256 × 256) is downsampled by a convolutional capsule layer with stride 2, which is further put into a convolutional capsule layer with stride 1, for mapping the child capsule to the parent capsule by using the dynamic routing algorithm. Subsequently, the concatenation layer first transforms the capsule (4 × 16 × 128 × 128) back to feature map (64 × 128 × 128) via reshape operation. Then, the transformed feature map is concatenated with the corresponding external features (32 × 128 × 128), which is then processed by the convolution and reshaped into the capsule (4 × 16 × 128 × 128). Such procedure is executed three times until the capsule (8 × 32 × 32 × 32) is obtained. In the decoder, the capsule is first upsampled by a deconvolution capsule layer with stride 2. Then, the upsampled capsule (8 × 32 × 64 × 64) and the corresponding capsule in the encoder are concatenated by the bridge connection to generate the capsule (8 × 32 × 64 × 64), which is then processed by the convolutional capsule layer. As well, such procedure is repeated three times to predict the final salient map.

Figure 4
figure 4

The structure of FiCaps. The conv0 represents the traditional 1 × 1 convolution. The convCaps(2) indicates the convolutional capsule layer with stride 2 while the convCaps(1) means the convolutional capsule layer with stride 1. The deconvCaps represents the deconvolutional capsule layer based on the transposed convolution. The red dash arrow refers to the bridge connection between the capsule in the encoder and the corresponding capsule in the decoder. The black arrow is the data flow whose data size is described by the text near it. The concatenate layer means the concatenation operation for integrating the internal capsules with the external features. The f01, f12, f23 and f34 means the corresponding external features.

Convolutional and deconvolutional capsule layer

Both convolutional and deconvolutional capsule layer contain two parts. One is the transformation module of the capsule and the other one is the dynamic routing algorithm. There are seven parameters in a capsule layer, which can be described as \(capsule layer\left( {in, inv, op, s, on, onv, rt} \right)\).The in and inv means the number of input capsule and the number of vector of input capsule while the on and onv means the number of output capsule and the number of vector of output capsule, respectively. There are two options of op, including ‘conv’ and ‘deconv’. When the op is ‘conv’, it means the convolution capsule layer. When the op is ‘deconv’, it means the deconvolutional capsule layer. The s refers to the number of stride in the convolution, cooperating with the op to accomplish different operations. If op is ‘conv’ and s is 2, it means a convolution with 2 times downsampling. Furthermore, if op is ‘deconv’ and s is 2, it means a convolution with 2 times upsampling. The rt means the iteration time of dynamic routing algorithm, which is set to 3 in this paper.

The convolutional capsule layer decides how to assign active child capsules to parent capsules. This is similar to the process of clustering. Each relatively parent capsule corresponds to a cluster center and each relatively child capsule corresponds to a data point, which can be solved by an EM algorithm. This mapping is measured by a transformation matrix, called voting in EM routing and defined as:

$$V_{ij}^{\left( l \right)} = c_{i}^{\left( l \right)} T_{ij}^{\left( l \right)}$$
(5)

where \(c_{i}^{\left( l \right)}\) and \(c_{i}^{{\left( {l + 1} \right)}}\) refers to the child capsule and the parent capsule. \(V_{ij}^{\left( l \right)}\) means the voting result from capsule i at layer l for capsule j at layer l + 1. \(T_{ij}^{\left( l \right)}\) is a transformation matrix. Next, a Gaussian mixture model is introduced. Supposing Gaussian distribution \(N\left( {x; \mu , \Sigma } \right)c_{i}^{\left( l \right)}\) has a diagonal covariance matrix diag(\(\sigma^{2}\)). The posterior probability of a \(V_{ij}^{\left( l \right)}\) belonging to the jth Gaussian is defined as:

$$R_{ij} = \frac{{a_{i} N\left( {V_{ij} ;\mu_{j} ,diag\left( {\sigma_{j}^{2} } \right)} \right)}}{{\mathop \sum \nolimits_{j} a_{i} N\left( {V_{ij} ;\mu_{j} ,diag\left( {\sigma_{j}^{2} } \right)} \right)}}$$
(6)

where activation aj for capsule j is a mixture coefficient of Gaussian mixture model and Vij is treated as a k*d’-dimensional vector. As a result, the child capsules vote for the parent capsule j, the contribution coefficient rij of capsule i when calculating cluster center (capsule) j should consider its activation value ai as follows:

$$r_{ij} = \frac{{a_{i} R_{ij} }}{{\mathop \sum \nolimits_{i} a_{i} R_{ij} }}$$
(7)

Finally, the procedure of the convolutional and deconvolutional capsule layer is discussed as follows. First and foremost, we transform capsules to feature maps, by reshaping capsules [n, in, inv, h, w] into feature maps [n, in * inv, h, w], following the convolution layer. Then, we transform feature maps back to capsules, reshaping feature maps [n, on * onv, h, w] into capsules [n, on, onv, h, w]. Finally, the capsules execute the dynamic routing by using the EM algorithm with r times.

Concatenation layer

The concatenation layer includes two reshape operations, a concatenation operation and a convolution. Supposing the size of capsule from convolutional capsule layer is [b, c, v, h, w] and the size of external feature map is [b, n, h, w]. The procedure of the concatenation layer can be discussed as follow. In the first stage, we transform capsules to feature maps, by reshaping capsules [b, c, v, h, w] into feature maps [n, c * v, h, w]. Therefore the shape of capsules and features maps is the same. Furthermore, we concatenate the transformed feature maps with the external features, following the convolutional layer. Lastly, the concatenated result transforms back to the capsules, by reshaping the feature maps [n, c * v, h, w] back into capsules [b, c, v, h, w].

Loss function

The parameters of our proposed method are supervised by the cross-entropy loss and the margin loss, which are described as:

$$Loss = \alpha \cdot CE + \beta \cdot ML$$
(8)
$$CE = - (gt \cdot log\left( {pred} \right) + \left( {1 - gt} \right) \cdot log\left( {1 - pred} \right)$$
(9)
$$ML = gt \cdot max\left( {0, m^{ + } - pred} \right) + 3 \cdot \left( {1 - gt} \right) \cdot max\left( {0,pred - m^{ - } } \right)$$
(10)

where CE and ML represents the cross entropy loss and margin loss, respectively. The gt and pred indicates the ground truth and the prediction of salient object. The m+ and m- refers to the constant parameter in this paper, which is set to 0.9 and 0.1, respectively. The \(\alpha ,\beta\) are set to 1.

Experiment and analyze

In this section, numerous experiments are conducted to verify the effectiveness and superiority of CCNet and modules, evaluating by four evaluation metrics.

Benchmark datasets and evaluation metrics

We evaluate the performance of our model on four public RGB-D benchmark datasets. NJU2K25 (1985 samples), NLPR26 (1000 samples), STERE27 (1000 samples) and SIP28 (929 samples). We choose the same 700 samples from NLPR and 1500 samples from NJU2K to train our algorithms. The remaining samples are used for testing.

Four widely-used metrics are used to evaluate the performance, including Mean Absolute Error (MAE), F-measure (\({\text{F}}_{{{\upbeta } - {\text{max}}}}\))29, S-measure (\({\text{S}}_{{\upalpha }}\))30, E-measure (\({\text{E}}_{{\upxi }}\))31.

Implementation details

Our proposed CCNet is implemented in Pytorch, which is trained for 300 epochs on a single NVIDIA Tesla T4 GPU. The Adam optimizer is used with default values. The initial learning rate is set as 1e−4 for Adam optimizer and the batch size is 10. The poly learning rate policy is used, where the power is set to 0.9. For the data augment, every input data batch in the training session are resized to 256 × 256 with random flipping, rotation, color enhance and random pepper. In the training session, the RGB images, depth images and GT are combined together as data batch. During the inference session, RGB-D images are put into the trained model to predict the salient map, without any other post-processing.

Comparison with the state-of-the-art methods

In this section, we compare our proposed networks with 23 state-of-the-art methods, including PCF5, MMCI32, CPFP33, DRMA8, D3Net9, UCNet4, SSF34, S2MA24, CoNet35, cmMS36, DANet23, A2dele37, PAGR20, DFM38, DSA2f39, HAINet40, SSL41, DisenFuse42, ICNet43, CMWNet44, BBSNet1, CDNet45 and DCF246. Quantitative and visual comparisons are taken into accounts for fair comparisons.

Quantitative comparisons

Table 1 shows quantitative comparisons with 23 salient detectors from three perspectives. First and foremost, evaluation scores of all methods on four benchmark datasets present as columns. It is obviously that our models achieve the top-3 performance on NLPR, STERE and SIP for four evaluation metrics. More importantly, our proposed method possesses the least MAE on NLPR and STERE, with approximately 8.7% and 2.7% promotion, respectively. Secondly, we count on the top-3 number of every method. The statistical result is demonstrated in the column named Top 3. It is remarkable that our proposed method occupy the largest number, with 11/16. Finally, we calculate the average value of the evaluated metric on four datasets, which is listed in the row named “Average-Metric”. Our model reach the top-3 performance on all datasets And rank 1st in the average MAE.

Table 1 Quantitative comparisons.

Visual comparisons

Figure 5 shows visual comparisons. These examples reflect various scenarios, including complex scenes (1st and 2nd rows), multi-objective salient object (3rd and 4th rows), small objects (5th and 6th rows) and low contrast between salient object and background (7th and 8th rows). All images come from downloading the experimental result from Github directly or training the source codes from the Github and predicting salient object. For complex scenes, the compared approaches mostly predict a blurry salient object and recognize some non-salient part around the salient object as salient part. For the multi-objective detection, several methods miss some salient objects or predict the salient object with noises. When it comes to the small object, the compared methods cannot predict a clear and complete salient object whose size is very small in the image. Lastly, for the scenario of low contrast, the existing salient detectors mostly get poor object smoothness and poor details of the salient object. Besides, some compared methods miss important parts of salient object. To sum up, our proposed method can consistently produce accurate and complete salient maps with sharp edges in various cases.

Figure 5
figure 5

Visual comparisons of different methods. The 1st and 2nd row indicate the complex scenes. The multi-objective object is included in the 3rd and 4th rows. The 5th and 6th mean the scenes of small targets. The low contrast between the background and the object is displayed in 7th and 8th rows. Note: Reproduced with permission of references 25, Copyright of ©2017 IEEE, references 26, Copyright of ©2016 IEEE, references 27, Copyright of ©2018 IEEE, references 28, Copyright of ©2015 IEEE.

Ablation study

In this section, we validate the effectiveness of proposed structures. First and foremost, we evaluate the performance of our proposed FiCaps by comparing it with the U-Net47. Furthermore, we testify the strategy of integrating external features with internal capsules in FiCaps. Next, our proposed FDM is evaluated. Finally, the performance of GCM is verified by replacing it with the traditional convolutions. All experimenal results are demonstrated in Table 2.

Table 2 Ablation study.

Effectiveness of FiCaps

We evaluate the performance of FiCaps in two aspects. On the one hand, we use the U-Net as the compared structure to evaluate the performance of FiCaps. The FiCaps is replaced with the U-Net and other modules and the parameters remain unchanged. The experimental results in Table 2, the row and the row ‘our’, show that our FiCaps outperforms the U-Net. In addition, we evaluate the effectiveness of integrating internal capsules with external features in FiCaps. To verify it, we train our method with and without integrating external features. It is obviously that, from row and row ‘our’ in Table 2, the integration of external features is an effective way to improve the performance, with approximately 0.1–9.8% promotions.

Effectiveness of integration way of depth images

To evaluate the performance of integrating depth images directly, in this section, we try to integrate depth features which are extracted from depth images by VGG, MobileNet48 or RESNET1849, instead of integrating the depth images directly. In this section, another independent VGG backbone is used to extract depth features from depth images and predict the salient map based on depth images. The extracted depth features are integrated with features from RGB images. The experimental result in Table 2 demonstrates that our proposed method is a more effective way, with about 15.2–50% promotion in MAE and approximately 0.8–2.2% improvement in other evaluation metrics.

Effectiveness of GCM

To evaluate the contribution of GCM, we replace the GCM with traditional convolutions with batch normalization and Relu operation. The experimental result in Table 2 demonstrates that GCM is an more effective way to integrate features.

Conclusion

In this paper, we pay much attention to solving the object-part relationship dilemma in the SOD. Therefore, we propose a novel CCNet based on CapsNet with less computation demand, which makes explore the object-part relationship available and applicable. Our proposed method includes two main steps. In the first step, the RGB-D features are extracted and integrated. In the second step, the object-part relationship can be explored fully by using FiCaps. Subsequently, the final salient map is predicted by FiCaps. Extensive experiments on four datasets demonstrate our proposed method outperforms 23 SOTA methods.

More importantly, the FiCaps is transferable for any RGB-D SOD. The FiCaps can be used as a complementary branch for any architecture in the area of SOD to explore the object-part relationship. A feature map is input into the FiCaps and a attention map considering the object-part relationship is predicted. The attention map can be integrated with other features to predict the final map.

In the future, we may focus on two aspects to improve the performance of CCNet. On the one hand, the FiCaps is a convolutional capsule network, to some extent, it is not a pure capsule network. Therefore, as discussed in the related work, the vector CapsNet or the matrix CapsNet may be introduced to explore the object-part relationship in true sense. On the other hand, for reducing the computational demand of CapsNet, mutual learning such as knowledge distillation50, 51 may be introduced.