RGB-D salient object detection via convolutional capsule network based on feature extraction and integration

Fully convolutional neural network has shown advantages in the salient object detection by using the RGB or RGB-D images. However, there is an object-part dilemma since most fully convolutional neural network inevitably leads to an incomplete segmentation of the salient object. Although the capsule network is capable of recognizing a complete object, it is highly computational demand and time consuming. In this paper, we propose a novel convolutional capsule network based on feature extraction and integration for dealing with the object-part relationship, with less computation demand. First and foremost, RGB features are extracted and integrated by using the VGG backbone and feature extraction module. Then, these features, integrating with depth images by using feature depth module, are upsampled progressively to produce a feature map. In the next step, the feature map is fed into the feature-integrated convolutional capsule network to explore the object-part relationship. The proposed capsule network extracts object-part information by using convolutional capsules with locally-connected routing and predicts the final salient map based on the deconvolutional capsules. Experimental results on four RGB-D benchmark datasets show that our proposed method outperforms 23 state-of-the-art algorithms.


Related works
The utilization of RGB-D images for SOD based on FCNs has been extensively explored for years.Based on the goal of this paper, we review the RGB-D SOD methods as well as the CapsNet and illustrate the differences between our proposed methods and related works.

RGB-D salient object detection
The pioneering work was produced by Niu et al. 17 based on traditional methods.After that, various handcrafted features originally applied for RGB SOD were extended to RGB-D SOD, such as 18,19 .In this paper, we Figure 1.The examples of object-part relationship for SOD.For the first row, the tail wing of an aircraft is not recognized as the salient object by FCNs while our proposed method predicts the aircraft as a whole.In the second row, there is an incoherence in the right arm of the cartoon figure predicted by the FCNs while the proposed method regards the right arm and the cartoon figure as an integral whole.In third line, the salient object predicted by FCNs misses the stem of the plants while the flower and the stem are predicted by our CCNet.For the last row, the flame of the satellite is misidentified as the salient object by the FCNs.However our proposed method suppresses the interference by identifying the satellite as a complete object.pay much attention to the RGB-D SOD based on deep learning algorithms.For example, Xu et al. 20 propose a lightweight SOD for real-time localization which is composed of lightweight feature extraction network based on multi-scale attention, the jump connections and a residual refinement module.Chen and Fu 20 propose an alternate refinement strategy and combine a guided residual block to predict refined features and salient maps, simultaneously.Lei 22 proposes SU2GE-Net.Firstly, the CNN-based backbone is replaced by the transformerbased Swin-TransformerV2.Besides, an edge-based loss and training-only augmentation loss are introduced to enhance spatial stability.Zhao et al. 23 build a real single-stream network by combining RGB-D images at the starting point, taking advantage of the potential contrast information provided by depth images.Compared with the above algorithms, there are similarities and differences.For the similarities, the design idea between our proposed algorithms and SU2GE-Net in the related work is the same.Both of them try to replace the CNNbased backbone with a novel backbone, such as Swin-TransformerV2 or CapsNet.Furthermore, the structure of salient object detectors is the same as well, including the encoders and decoders.For the differences, the CCNet predicts the salient object mainly based on CapsNet, whose basic elements are vector quantity.However, other salient detectors are scalar quantity.

Capsule network
Recently, a novel deep learning network, named CapsNet, was developed by Hinton et al. 10 .A capsule is essentially a group of neurons, which represent a specific type of the entity, such as position, size, orientation, deformation, texture and etc.The CapsNet is totally different with the FCNs in two aspects.On the one hand, neurons of FCNs are scalars while that of CapsNet are vectors.On the other hand, the FCNs extract and integrate multi-level features by encoder and decoder while the CapsNet matches associated active child capsules into parent capsules by dynamic routing algorithm.Then, Sabour et al. 11 proposes the vector CapsNet.An iterative dynamic routing algorithm was proposed to assign child capsules to corresponding parent capsules via transformation weights.The spatial relationship between a part and a object is encoded and learned by the dynamic routing algorithm and transformation weights.One year later, Hinton et al. 12 consolidated the vector CapsNet by proposing a matrix CapsNet, whose capsule is composed of a pose matrix and an activation probability.The coefficients between the child capsule and the parent capsule are calculated by the iterative Expectation-Maximization (EM) algorithm, by finding the tightest clusters of capsules.Compared with the vector CapsNet, the transformation matrix of the matrix CapsNet has much less parameters.Furthermore, the matrix CapsNet use the iterative EM to measure the similarities between capsules, while the vector CapsNet uses the cosine similarity.In the view of its advances, some attempts have been made to apply CapsNet for several computer vision tasks, including the object segmentation and SOD.To reduce the high computational demand, LaLonde and Bagci design the SegCap 16 based on the vector Capsules to solve the object segmentation.It extends the idea of convolutional capsules with the locally-connected routing and the concept of deconvolutional capsules.Liu and his colleague 24 propose the Two-Stream Part-Object Relational Network (TSPORTNet) to implement the matrix CapsNet for SOD, whose activation map is the final salient map.Both methods try to reduce the computation demand, which make them possible to be used in large-scale image tasks.In this paper, the structure of our proposed method is the similar to that of SegCaps.Different from the SegCaps, our proposed method in the encoder excavate capsules and concatenates them with corresponding multi-level features.However, the encoder and decoder of the SegCaps are enclosed environment.Furthermore, the TSPORTNet prefers to explore the object-part relationship based on the matrix Capsules and use the activation map as the salient map.The predicted salient map is coarse and needs to be refined.On the contrary, our proposed FiCaps uses extracted features from FCNs as the input.Subsequently, a refined salient map is predicted directly by FiCaps, without post-processing.

Methodology
This paper begins by demonstrating an overall architecture of CCNet, which is depicted in Fig. 2. It will then go on to introduce their principles and detail information of modules.

Overall architecture
Figure 2 shows the overall architecture of CCNet.Our system begins with a VGG backbone, extracting multi-level features.Then, these features are input into GCM to further exploit.The depth image is downsampled by the max pooling to shrink it by a corresponding multiply, including 2, 4, 8, 16, respectively.In the next step, depth images are integrated with features from GCM by FDM directly, based on attention mechanism 14,15 .After that, the outputs of FDM are integrated by FFM progressively, whose outputs are further input into FiCaps to fuse external multi-level features with capsules ulteriorly.The structure of FiCaps is similar to U-Net.In the encoder, these capsules are processed by convolutional capsule layers, which map the child capsules to the parent capsules by the dynamic routing algorithms in defined local connections.Besides, the concatenation module in the encoder is proposed to integrate external features with internal capsules.When it comes to the decoder, the capsules are processed by the deconvolutional capsule layers which are mainly composed of transposed convolution with stride 2. Finally, the output of FiCaps is the predicted salient map.

Feature depth module
The FDM is used to reweight features from GCM based on depth images.The structure of GCM is introduced in 13 in details.In Fig. 3, we multiply features with the depth image which is downsampled to the corresponding size.Then, the product are processed by two convolutions with batch normalization and Relu operation.Consequently, these features are concatenated with downsampled depth images and are processed by two convolutions.Next, we facilitate the attention mechanism, including the channel attention and the spacial attention, (1) The framework of CCNet.The features are extracted by the VGG backbone, which is denoted as The features g 0 , g 1 , g 2 , g 3 , g 4 refer to the outputs of the GCM.The depth image is downsampled directly which is labeled as . Then, the features from GCM and depth images are integrated by FDM, whose outputs are fd 0 , fd 1 , fd 2 , fd 3 , fd 4 .In the next step, the outputs of FDM are aggregated by Feature Fusion Module(FFM) progressively, denoted as ff 01 , ff 12 , ff 23 , ff 34 .In FiCaps, the conv means the traditional convolution with 1 × 1 kernel size.The convCaps means the convolution capsule layer, whose stride and padding is equal to 1 or 2. The deconvCaps refers to the deconvolution capsule layer, implemented by transposed convolution with stride 2 and padding 2. The concatenation indicates a series of operations, including concatenation, reshape and convolution, for integrating internal capsules and external features.

Feature fusion module
The FFM integrates adjacent two features from high-level to low-level, generating the feature map.As showed in Eq. ( 4), two input features first undergo the convolution layers, respectively.Then, the relative high-level feature is upsampled and concatenated with the low-level feature, which is further processed by two convolution layers.
where fd i and fd i−1 refers to the relative high-level feature and low-level feature, respectively.The conv a , conv b and conv all indicate the convolution with batch normalization and relu operation.The up indicates the 2-times upsample operation and the cat means the concatenation operation.The i ranges from 1 to 4.

Feature-integrated convolutional capsule network
Figure 4 shows the details of FiCaps.We first introduce the structure of FiCaps and then elaborate the detail, including the convolutional capsule layer, the deconvolutional capsule layer and the concatenation layer.The FiCaps shares the same architecture with the U-Net.For the encoder, it contains two basic modules, the convolutional capsule layer and the concatenation layer.In the decoder, it is composed of the convolutional capsule layer and the deconvolutional capsule layer.First of all, the feature map from FFM is transformed into the capsule.Then, the capsule (1 × 16 × 256 × 256) is downsampled by a convolutional capsule layer with stride 2, which is further put into a convolutional capsule layer with stride 1, for mapping the child capsule to the parent capsule by using the dynamic routing algorithm.Subsequently, the concatenation layer first transforms the capsule (4 × 16 × 128 × 128) back to feature map (64 × 128 × 128) via reshape operation.Then, the transformed feature map is concatenated with the corresponding external features (32 × 128 × 128), which is then processed by the convolution and reshaped into the capsule (4 × 16 × 128 × 128).Such procedure is executed three times until the capsule (8 × 32 × 32 × 32) is obtained.In the decoder, the capsule is first upsampled by a deconvolution capsule layer with stride 2.Then, the upsampled capsule (8 × 32 × 64 × 64) and the corresponding capsule in the encoder are concatenated by the bridge connection to generate the capsule (8 × 32 × 64 × 64), which is then processed by the convolutional capsule layer.As well, such procedure is repeated three times to predict the final salient map.

Convolutional and deconvolutional capsule layer
Both convolutional and deconvolutional capsule layer contain two parts.One is the transformation module of the capsule and the other one is the dynamic routing algorithm.There are seven parameters in a capsule layer, which can be described as capsulelayer in, inv, op, s, on, onv, rt .The in and inv means the number of input capsule and the number of vector of input capsule while the on and onv means the number of output capsule and the number of vector of output capsule, respectively.There are two options of op, including 'conv' and 'deconv'.When the op is 'conv', it means the convolution capsule layer.When the op is 'deconv', it means the deconvolutional capsule layer.The s refers to the number of stride in the convolution, cooperating with the op to accomplish different operations.If op is 'conv' and s is 2, it means a convolution with 2 times downsampling.Furthermore, if op is (4) ff i,i−1 = conv conv cat up conv a fd i , conv b fd i−1 The convolutional capsule layer decides how to assign active child capsules to parent capsules.This is similar to the process of clustering.Each relatively parent capsule corresponds to a cluster center and each relatively child capsule corresponds to a data point, which can be solved by an EM algorithm.This mapping is measured by a transformation matrix, called voting in EM routing and defined as: where c (l) i and c where activation a j for capsule j is a mixture coefficient of Gaussian mixture model and V ij is treated as a k*d'dimensional vector.As a result, the child capsules vote for the parent capsule j, the contribution coefficient r ij of capsule i when calculating cluster center (capsule) j should consider its activation value a i as follows: Finally, the procedure of the convolutional and deconvolutional capsule layer is discussed as follows.First and foremost, we transform capsules to feature maps, by reshaping capsules [n, in, inv, h, w] into feature maps [n, in * inv, h, w], following the convolution layer.Then, we transform feature maps back to capsules, reshaping feature maps [n, on * onv, h, w] into capsules [n, on, onv, h, w].Finally, the capsules execute the dynamic routing by using the EM algorithm with r times.

Concatenation layer
The concatenation layer includes two reshape operations, a concatenation operation and a convolution.Supposing the size of capsule from convolutional capsule layer is [b, c, v, h, w] and the size of external feature map is [b, n, h, w].The procedure of the concatenation layer can be discussed as follow.In the first stage, we transform capsules to feature maps, by reshaping capsules [b, c, v, h, w] into feature maps [n, c * v, h, w].Therefore the shape of capsules and features maps is the same.Furthermore, we concatenate the transformed feature maps with the external features, following the convolutional layer.Lastly, the concatenated result transforms back to the capsules, by reshaping the feature maps [n, c * v, h, w] back into capsules [b, c, v, h, w].

Loss function
The parameters of our proposed method are supervised by the cross-entropy loss and the margin loss, which are described as: where CE and ML represents the cross entropy loss and margin loss, respectively.The gt and pred indicates the ground truth and the prediction of salient object.The m + and m -refers to the constant parameter in this paper, which is set to 0.9 and 0.1, respectively.The α, β are set to 1.

Experiment and analyze
In this section, numerous experiments are conducted to verify the effectiveness and superiority of CCNet and modules, evaluating by four evaluation metrics.

Benchmark datasets and evaluation metrics
We evaluate the performance of our model on four public RGB-D benchmark datasets.NJU2K 25 (1985 samples), NLPR 26 (1000 samples), STERE 27 (1000 samples) and SIP 28 (929 samples).We choose the same 700 samples from NLPR and 1500 samples from NJU2K to train our algorithms.The remaining samples are used for testing.

Quantitative comparisons
Table 1 shows quantitative comparisons with 23 salient detectors from three perspectives.First and foremost, evaluation scores of all methods on four benchmark datasets present as columns.It is obviously that our models achieve the top-3 performance on NLPR, STERE and SIP for four evaluation metrics.More importantly, our proposed method possesses the least MAE on NLPR and STERE, with approximately 8.7% and 2.7% promotion, respectively.Secondly, we count on the top-3 number of every method.The statistical result is demonstrated in the column named Top 3. It is remarkable that our proposed method occupy the largest number, with 11/16.Finally, we calculate the average value of the evaluated metric on four datasets, which is listed in the row named "Average-Metric".Our model reach the top-3 performance on all datasets And rank 1st in the average MAE.

Visual comparisons
Figure 5 shows visual comparisons.These examples reflect various scenarios, including complex scenes (1st and 2nd rows), multi-objective salient object (3rd and 4th rows), small objects (5th and 6th rows) and low contrast between salient object and background (7th and 8th rows).All images come from downloading the experimental result from Github directly or training the source codes from the Github and predicting salient object.For complex scenes, the compared approaches mostly predict a blurry salient object and recognize some non-salient part around the salient object as salient part.For the multi-objective detection, several methods miss some salient objects or predict the salient object with noises.When it comes to the small object, the compared methods cannot predict a clear and complete salient object whose size is very small in the image.Lastly, for the scenario of low contrast, the existing salient detectors mostly get poor object smoothness and poor details of the salient object.Besides, some compared methods miss important parts of salient object.To sum up, our proposed method can consistently produce accurate and complete salient maps with sharp edges in various cases.

Ablation study
In this section, we validate the effectiveness of proposed structures.First and foremost, we evaluate the performance of our proposed FiCaps by comparing it with the U-Net 47 .Furthermore, we testify the strategy of integrating external features with internal capsules in FiCaps.Next, our proposed FDM is evaluated.Finally, the performance of GCM is verified by replacing it with the traditional convolutions.All experimenal results are demonstrated in Table 2.

Effectiveness of FiCaps
We evaluate the performance of FiCaps in two aspects.On the one hand, we use the U-Net as the compared structure to evaluate the performance of FiCaps.The FiCaps is replaced with the U-Net and other modules and the parameters remain unchanged.The experimental results in Table 2, the row ① and the row 'our' , show that our FiCaps outperforms the U-Net.In addition, we evaluate the effectiveness of integrating internal capsules with external features in FiCaps.To verify it, we train our method with and without integrating external features.It is obviously that, from row ② and row 'our' in Table 2, the integration of external features is an effective way to improve the performance, with approximately 0.1-9.8%promotions.

Effectiveness of integration way of depth images
To evaluate the performance of integrating depth images directly, in this section, we try to integrate depth features which are extracted from depth images by VGG, MobileNet 48 or RESNET18 49 , instead of integrating the depth images directly.In this section, another independent VGG backbone is used to extract depth features from depth images and predict the salient map based on depth images.The extracted depth features are integrated with features from RGB images.The experimental result ③ in

Effectiveness of GCM
To evaluate the contribution of GCM, we replace the GCM with traditional convolutions with batch normalization and Relu operation.The experimental result ④ in Table 2 demonstrates that GCM is an more effective way to integrate features.
Table 1.Quantitative comparisons.For MAE, the lower, the better.On the contrary, for FM, SM and EM, the higher, the better.The second row from the bottom refers to the evaluated scores of our proposed method and the last row refers to the rank of our proposed method.

Conclusion
In this paper, we pay much attention to solving the object-part relationship dilemma in the SOD.Therefore, we propose a novel CCNet based on CapsNet with less computation demand, which makes explore the object-part relationship available and applicable.Our proposed method includes two main steps.In the first step, the RGB-D features are extracted and integrated.In the second step, the object-part relationship can be explored fully by using FiCaps.Subsequently, the final salient map is predicted by FiCaps.Extensive experiments on four datasets demonstrate our proposed method outperforms 23 SOTA methods.
Figure1.The examples of object-part relationship for SOD.For the first row, the tail wing of an aircraft is not recognized as the salient object by FCNs while our proposed method predicts the aircraft as a whole.In the second row, there is an incoherence in the right arm of the cartoon figure predicted by the FCNs while the proposed method regards the right arm and the cartoon figure as an integral whole.In third line, the salient object predicted by FCNs misses the stem of the plants while the flower and the stem are predicted by our CCNet.For the last row, the flame of the satellite is misidentified as the salient object by the FCNs.However our proposed method suppresses the interference by identifying the satellite as a complete object.Note: Reproduced with permission of references 25, Copyright of ©2017 IEEE, references 26, Copyright of ©2016 IEEE, references 27, Copyright of ©2018 IEEE, references 28, Copyright of ©2015 IEEE.
https://doi.org/10.1038/s41598-023-44698-zwww.nature.com/scientificreports/ to generate a reweighted map and multiply it with input features g i , following convolutions.The procedure is formulated as follows:where g i and d i refers to the ith feature from GCM and depth image with 2 i times downsampling.The conv br and conv indicates the 3 × 3 convolution with and without the batch normalization and relu operation.The CA and SA means the channel attention and the spacial attention.The parameter i ranges from 1 to 4. The symbol * means the multiplication operation in the pixel level.

Figure 3 .
Figure 3.The structure of FDM.The conv br refers to the 3 × 3 convolution with batch normalization and relu operation while the conv means the 3 × 3 convolution.The symbol X and [.] indicates the multiplication and the concatenation operation in the pixel level.The pool refers to the pooling operation, whose multiple is 2 to i.

Figure 4 .
Figure 4.The structure of FiCaps.The conv0 represents the traditional 1 × 1 convolution.The convCaps(2) indicates the convolutional capsule layer with stride 2 while the convCaps(1) means the convolutional capsule layer with stride 1.The deconvCaps represents the deconvolutional capsule layer based on the transposed convolution.The red dash arrow refers to the bridge connection between the capsule in the encoder and the corresponding capsule in the decoder.The black arrow is the data flow whose data size is described by the text near it.The concatenate layer means the concatenation operation for integrating the internal capsules with the external features.The f 01 , f 12 , f 23 and f 34 means the corresponding external features.
child capsule and the parent capsule.V (l) ij means the voting result from capsule i at layer l for capsule j at layer l + 1. T (l) ij is a transformation matrix.Next, a Gaussian mixture model is introduced.Supposing Gaussian distribution N(x; µ, �)c (l) i has a diagonal covariance matrix diag(σ 2 ).The posterior probability of a V (l) ij belonging to the jth Gaussian is defined as:

Figure 5 .
Figure 5. Visual comparisons of different methods.The 1st and 2nd row indicate the complex scenes.The multi-objective object is included in the 3rd and 4th rows.The 5th and 6th mean the scenes of small targets.The low contrast between the background and the object is displayed in 7th and 8th rows.Note: Reproduced with permission of references 25, Copyright of ©2017 IEEE, references 26, Copyright of ©2016 IEEE, references 27, Copyright of ©2018 IEEE, references 28, Copyright of ©2015 IEEE.
Our proposed CCNet is implemented in Pytorch, which is trained for 300 epochs on a single NVIDIA Tesla T4 GPU.The Adam optimizer is used with default values.The initial learning rate is set as 1e−4 for Adam optimizer and the batch size is 10.The poly learning rate policy is used, where the power is set to 0.9.For the data augment, every input data batch in the training session are resized to 256 × 256 with random flipping, rotation, color enhance and random pepper.In the training session, the RGB images, depth images and GT are combined together as data batch.During the inference session, RGB-D images are put into the trained model to predict the salient map, without any other post-processing.

Table 2 .
Ablation study.The 'ours' in Table2means our proposed method.The ① refers to the experimental results by replacing the structure of FiCaps with U-Nets.The ② means the experimental results, which FiCaps does not integrate with external features.The ③ indicates the experimental results, extracting and integrating the features of depth image by using the VGG backbone.The ④ refers to the experimental results by replacing the GCM with the traditional convolutions.