DA-CapsNet: dual attention mechanism capsule network

A capsule network (CapsNet) is a recently proposed neural network model with a new structure. The purpose of CapsNet is to form activation capsules. In this paper, our team proposes a dual attention mechanism capsule network (DA-CapsNet). In DA-CapsNet, the first layer of the attention mechanism is added after the convolution layer and is referred to as Conv-Attention; the second layer is added after the PrimaryCaps and is referred to as Caps-Attention. The experimental results show that DA-CapsNet performs better than CapsNet. For MNIST, the trained DA-CapsNet is tested in the testset, the accuracy of the DA-CapsNet is 100% after 8 epochs, compared to 25 epochs for CapsNet. For SVHN, CIFAR10, FashionMNIST, smallNORB, and COIL-20, the highest accuracy of DA-CapsNet was 3.46%, 2.52%, 1.57%, 1.33% and 1.16% higher than that of CapsNet. And the results of image reconstruction in COIL-20 show that DA-CapsNet has a more competitive performance than CapsNet.

Attention mechanism. A visual attention mechanism is a special brain signal processing mechanism in human vision 21 . When human vision captures an image, it automatically focuses on an interesting part, invests more energy in and obtains more information from the relevant areas, and suppresses other irrelevant information. Researchers have incorporated the concept of visual attention into the field of computer vision to improve the efficiency of the model [22][23][24][25] . In essence, the neural network is a function approver 26 . The structure of the neural network determines what kind of function it can fit 27,28 . Generally, A typical neural network can be implemented as a series of matrix multiplexes and element-wise non-linearities, where the elements of the input or eigenvectors interact only by addition [29][30][31] . In theory, neural networks can fit any function, but in reality, the fitted functions are limited. Zhang et al. 32 proposed that spatial interaction in human visual cortex requires multiplication mechanism. Swindale et al. 33,34 proposes that the forward information in cortical maps is directed to the backward information by a attentional mechanism that allows control of the presence of multiplication effects and multiplicative interactions. By introducing an attention mechanism, the functions of input vectors are computed with the mask used to multiply the feature to reduce the limitations of neural network fitting functions, which extends the operation of input vectors to multiplication.

Methods
Background. In deep learning, capsules are sets of embedded neurons, and a CapsNet is comprised of these capsules. The activity vector of a capsule represents the instantiation parameter of a specific type of entity, such as a target or part of a target. Figure 1 is the original CapsNet structure diagram, which shows the comparable results of a deep convolution network. The length of the activation vector of each capsule in the DigitCaps layer represents the presentation of each class instance and is used to calculate the classification loss. Figure 2 shows the decoding structure of the DigitCaps layer. DigitCaps pass through two full connection layers controlled by ReLU and tanh. The euclidean distance between images and the output of the sigmoid layer are minimized in training. Using the correct label as the reconstruction target in the training.
The length of the capsule output vector represents the probability that the entity represented by the capsule exists in the current input. Therefore, a nonlinear squashing function is used as the activation function to ensure that the short vector is compressed to a length close to 0 and the long vector is compressed to a length slightly less than 1. In the original paper 3 , the constant used in the squashing function was 1. In the experiment, the constant was changed to 0.5 to improve the scaling scale of the squashing function. Table 1 shows the accuracy of CapsNet in each dataset under different squashing constant. A squashing function was calculated using Eq. (1) and the scaling function were calculated using Eqs. (2) and (3): www.nature.com/scientificreports/ where V j is the output vector of the jth capsule, S j is the input vector of the jth capsule, and S j is the module length of the vector S j . It can be seen from Fig. 3. That when the norm of vector is small and large, the scaling scale of z 1 function increases less, and when the norm of vector is in the middle, the scaling scale increases more. After the activation function is changed, the network's attention to image features is increased to achieve better results.
The input of CapsNet S j is calculated with Eq. (4):   The total input of a capsule S j is a weighted sum of all prediction vectors u j|i from the capsule of the lower layer, and u i is the output of a capsule of the lower layer multiplied by a weight matrix W ij . The coupling coefficient c ij is determined by the iterative dynamic routing process, calculated using Eq. (6).
To get c ij , b ij must first be found; b ij is calculated using Eq. (7): The initial value of b ij is 0; from this we can get c ij and u i , which is the output of the previous layer of capsules. With these three values, we can determine the next level of S j . Hinton et al. 's 4 experiments with MNIST showed that CapsNet has a unique effect in processing an image target or part of a target, which cannot be solved by traditional CNNs.

DA-capsnet.
Overall structure of DA-CapsNet. Figure 4 presents the architecture of DA-CapsNet. Unlike CapsNet, DA-CapsNet adds two layers of attention mechanisms, including Conv-Attention in ReLU-Conv to PrimaryCaps and Caps-Attention in PrimaryCaps to DigitCaps.
In Fig. 4, the purpose of Part A is to turn an image into a higher contribution attention convoy. The image is input with the dimensions of [32,32,3]. Using two layers of 3 × 3 step size 1 convolution kernel instead of 5 × 5 convolution check image to convolute to obtain more details of the image information, using ReLU to activate function, the result of convolution is a tensor of 10 × 10 with 128 channels ([10,10,128]). Then, through Conv-Attention processing, the image provides a higher contribution to the experimental results of attention convolution.
The purpose of Part B is to transform the PrimaryCaps into the more productive attcaps and then generate a DigitCaps that is higher level than in the original network. In this process, one-dimensional convolution is used instead of full connection operation to improve the operation efficiency. The activation function is linear, which generates 100 [10,16] dimensional PrimaryCaps. Each capsule shares the weight. The PrimaryCaps is then processed by Caps-Attention to become attention capsules(AttCaps).
In Part C, the dynamic routing algorithm is used to change the AttCaps into DigitCaps. The length of the activation vector of each capsule in the DigitCaps layer represents the corresponding predicted probability of each class.
Attention mechanism in DA-CapsNet. The function of Conv-Attention is to transform the result of the first convolution into the attention convolution. The role of Caps-Attention is to change the PrimaryCaps into AttCaps. The purpose of the two attention modules is to make the capsule focus on a wider area of the image, get more information and improve the accuracy of classification. We will use the results of model reconstruction to prove that the dual attention mechanism makes the capsule pay more attention to the image information.
Conv-attention module. Figure 5 shows the principle of Conv-Attention. After the image is processed by ReLU Conv, the global pooling operation is carried out, which gathers the plane information into the point informa- The resulting setting after global pooling is u g . The shape of u g is [1,1,Q]. The second step is to synthesize and process the features extracted in the first step to improve the nonlinear expression ability of the model 36 . Here, two layers of full connection are used to process u g to get u 1 and u 2 .The first activation function is ReLU, and the second activation function is tanh. These are calculated using Eqs. (9) and (10): where u 1 and u 2 are the result of two layers of full connection, respectively.W 1 andW 2 are the corresponding weight matrix, and b 1 andb 2 are the corresponding offset. After two full connection operations, the shape of u 2 is [1,1,Q].
In the third step, after u 2 is obtained, multiplying u 2 and u pc to get u 3 , and then add u 3 and u 2 to get attention convolution u c−att ; u c−att and u 3 are calculated using Eqs. (11) and (12): Caps-attention module. Figure 6 shows the principle of Caps-Attention. AttCaps is found by changing the shape of the PrimaryCaps, turning the PrimaryCaps into a vector, passing through the fully connected neural network controlled by the ReLU and tanh activation functions, and then multiplying and adding with the Pri-maryCaps.
In the second step, u pr is fully connected twice to get u p1 and u p2 36 . The activation function of the first operation is ReLU, and the activation function of the second operation is tanh. Thus, u pr is calculated using Eqs. (13) and (14): where u p1 and u p2 are the result of two layers of full connection, respectively.W 3 andW 4 are the corresponding weight matrix, and b 3 andb 4 are the corresponding offset.
In the third step, after u p2 is obtained, multiplying u p2 and u p to get u p3 , and then add u p3 and u p to get attention capsules u p−att ; u p−att and u p3 are calculated using Eqs. (15) and (16): u p3 = u p * u p2 . Conv-Attention single-layer attention mechanism, Caps-Attention single-layer attention mechanism, and twolayer attention mechanism. All four cases were tested on each dataset. In these six datasets, preprocessing and real-time data expansion were carried out, and the number of image samples was increased through image transformation such as translation and flipping. For many attention mechanisms applied to image classification tasks, the last activation function mostly uses softmax or sigmoid, while the tanh function is used in experiments. The value range of the tanh function was (− 1,1).
image reconstruction.    Figure 10 shows the epochs needed for four experiments to achieve 100% accuracy on MNIST test dataset. For MNIST, the experiment was based on the network structure of 4 Fig. 4. The DA-CapsNet was trained on the training dataset and then input into the test dataset, and 100 epochs were run. As shown in Figs. 8,10 epochs were needed for DA-CapsNet to reach an accuracy of 100%, whereas 25 epochs were needed for CapsNet, 16 for CapsNet with Conv-Attention, and 13 for CapsNet with Caps-Attention.
CIFAR10, SVHN and FashionMNIST results. Figures 11,12,13 are line charts that demonstrate the accuracy of four experimental results using CIFAR10, SVHN and FashionMNIST, respectively. Table 2 shows the highest accuracy and improvement rate of four experiments in each dataset. Before processing the attention mechanism, the image is first convoluted. Table 3 shows the specific steps and methods of convolution. In Fash-ionMNIST experiment, the input and output tensor was [10,10,128] through Conv-Attention and [10,100,16] through Caps-Attention. In CIFAR10 and SVHN experiment, the input and output tensor was [10,10,128] through Conv-Attention and [10,100,16] through Caps-Attention.    from different angles. It is of great significance to study the unique spatial invariance of CapsNet. Compared with smallNORB, COIL-20 has more image categories, more features, such as texture, posture, and more differences between images. In the experiment, setting the image size of smallNORB and COIL-20 at 32 × 32 pixels. We choose the top ten categories of COIL-20 to experiment, and the ratio of training set to test set is 9:1, 100 epoch were run, and the batch_size is 16. At the same time, we trained a CNN as a baseline to compare with DA-CapsNet. CNN has two convolution layers including 32 and 64 channels respectively. Both layers have a kernel size of 5 and a stride of 1 with a 2 × 2 max pooling, and full connection layer with 1,024 unit with dropout. In smallNORB, CNN connects to 5-way softmax output layer, while in COIL-20, CNN is connected to 20-way softmax output layer. Figure 14 shows the reconstruction results of DA-CapsNet and CapsNet in COIL-20. In Fig. 14a, the direction of image reconstructed by CapsNet tends to be horizontal, while DA-CapsNet is inclined, and the image reconstructed by DA-CapsNet contains more texture. The image of DA-CapsNet in Fig. 14b and c can more     All results. The results of CNN baseline, CapsNet and DA-CapsNet for the six datasets are summarized in Table 4. For the MNIST test dataset, the training results were all 100%, which is not easy to compare. Therefore, the average values of the first five epochs were used for the comparison. It can be seen from Table 4 that in

Discussion
The effect of CapsNet depends on the characteristics of the capsule. The higher the level of the capsule, the more various attributes of the specific entity in the image, such as location, size, direction, etc. Improving the characteristics and enriching the content of capsules are the key goals of CapsNet research. On this basis, our study of DA-CapsNet focused on all the contents of the capsule, extracted the key content (enlarged the relevant parameters), discarded the non-key content (reduced the relevant parameters), improved the level of the capsule, and finally obtained the capsule with a larger proportion of key information.
In CapsNet, there is no uniform specification for the number of PrimaryCaps, which is determined artificially according to the convolution mode of the convolution layer, as can be seen from Table 3. The discreteness of the PrimaryCaps formed by the artificial convolution mode is strong, and the fitting function is subject to great limitation. The attention mechanism can be regarded as a multiplier to increase the number of functions that will fit the neural network model32-34. DA-CapsNet uses two levels of attention mechanism. In the network presented in this paper, the two attention mechanisms are in a series, and the results of the two attention mechanisms can be regarded as a composite function.
In Figs. 8, 9, 10, 11, 12, 13, More functions can be fitted by composite function than by multiplier, and the fitted function result is better than for a single attention level. At the same time, it can be seen that different attention layers have different effects depending on the network. For example, in the SVHN experiment, results were better for Conv-Attention than Caps-Attention, while in the FastionMNIST experiment, Caps-Attention had better results. The attention mechanism enables the neural network to focus on only a part of its input information, and it can select a specific input. The entities in the images had many attributes, such as posture, texture, etc. By adding two layers of attention mechanisms, the neural network can pay more attention to the information. The more CapsNet understands the entity characteristics of the image, the better its performance in the classification task.

conclusion
In this paper, our team proposed a CapsNet based on a double attention mechanism to improve the hierarchy of capsules, which was verified through six open datasets. The experimental results showed that DA-CapsNet with two attention mechanisms is better than CapsNet and a single attention mechanism for image classification. From the results of image reconstruction, DA-CapsNet pays more attention to image information faster and more accurately, have more outstanding ability to master image information. For SVHN, CIFAR10, FashionMNIST, smallNORB and COIL-20, the accuracy of DA-CapsNet was 3.46%, 2.52%, 1.57%, 1.33%, and 1.16% higher than that of CapsNet.