Object detectors involving a NAS-gate convolutional module and capsule attention module

Several state-of-the-art object detectors have demonstrated outstanding performances by optimizing feature representation through modification of the backbone architecture and exploitation of a feature pyramid. To determine the effectiveness of this approach, we explore the modification of object detectors’ backbone and feature pyramid by utilizing Neural Architecture Search (NAS) and Capsule Network. We introduce two modules, namely, NAS-gate convolutional module and Capsule Attention module. The NAS-gate convolutional module optimizes standard convolution in a backbone network based on differentiable architecture search cooperation with multiple convolution conditions to overcome object scale variation problems. The Capsule Attention module exploits the strong spatial relationship encoding ability of the capsule network to generate a spatial attention mask, which emphasizes important features and suppresses unnecessary features in the feature pyramid, in order to optimize the feature representation and localization capability of the detectors. Experimental results indicate that the NAS-gate convolutional module can alleviate the object scale variation problem and the Capsule Attention network can help to avoid inaccurate localization. Next, we introduce NASGC-CapANet, which incorporates the two modules, i.e., a NAS-gate convolutional module and capsule attention module. Results of comparisons against state-of-the-art object detectors on the MS COCO val-2017 dataset demonstrate that NASGC-CapANet-based Faster R-CNN significantly outperforms the baseline Faster R-CNN with a ResNet-50 backbone and a ResNet-101 backbone by mAPs of 2.7% and 2.0%, respectively. Furthermore, the NASGC-CapANet-based Cascade R-CNN achieves a box mAP of 43.8% on the MS COCO test-dev dataset.

• We proposed the NAS-gate convolutional module, which utilized the NAS operation based on differentiable architecture search (DARTS) with multiple kernel sizes and dilation rates for the convolutional operation of the classification backbone network to decrease the computation cost of NAS-based backbones and alleviate the issues arising from the object scale variation. • We introduced a capsule attention module, based on a capsule network, to improve the feature representation by mitigating the information loss problem of FPN using the strong spatial relation ability of the capsule network. • We evaluated the performance of both the proposed modules and the incorporation of the proposed modules with state-of-the-art object detectors, NASGC-CapANet, on MS COCO and PASCAL VOC. The experiment results show that NASGC-CapANet considerably improves the detection performance compared to start-ofthe-art baseline object detectors.

Method and experiment
In this section, we describe the architecture design of the proposed NASGC-CapANet, which is a combination of the state-of-the-art object detectors and our proposed modules. In general, the NAS-gate convolutional module and capsule attention module can both be incorporated in one-stage as well as two-stage object detectors. However, most studies in this domain focus on incorporating these modules in two-stage detectors such as Faster R-CNN 28 and Cascade R-CNN 15 . In order to mitigate the problems arising from object scale variation, we optimized the feature extractor ability of the backbone by replacing the standard convolution of the classification backbone network with the proposed module, i.e., a NAS-gate convolutional module based on Neural Architecture search method, to increase the detection performance on the multiscale objects in the images with smaller computation cost compared to the NAS-based object detectors backbones. In order to enhance the localization ability of the object detectors, we improve the feature representation of the feature fusion network or neck by alleviate the information lost at the highest feature level problem with the capsule attention module. The capsule attention module was designed to incorporate with the feature fusion networks, i.e., FPN and FPN-based methods such as PAFPN from PANet, which the architecture are shown in Fig. 1a, c, respectively. Capsule attention module is adopted at the highest level of the FPN and PAFPN as shown in Fig. 1b www.nature.com/scientificreports/ NAS-gate convolutional module. The NAS-gate convolutional module was designed to explore the object detector's backbone architecture by optimizing the standard convolution using a Neural Architecture Search approach. Generally, NAS automatically finds an optimal network architecture for a certain task and dataset. The NAS domain involves three key areas: reinforcement learning-based methods that train a recurrent neural network (RNN) controller to generate the cell structure and form the CNN architecture; evolutionary algorithm (EA)-based methods that update the architecture or network by mutating the current best architectures; and gradient-based methods, which were utilized in the NAS-gate convolutional module, that define an architecture parameter for the continuous relaxation of the search space, thereby allowing differentiable optimization in the architecture search to accelerate the search process. Specifically, the NAS-gate convolutional module uses the NAS gradient-based method, named Differentiable ARchiTecture Search (DARTS) 41 , to search for the optimal condition of the convolutional operation of the object detector's backbone. In order to mitigate the computational infeasibility and time consumption of the NAS-based object detector backbone, we utilized the NAS method to search for the optimal convolutional operation of classification network backbones such as ResNet-50 and ResNet-101, instead of searching for new backbone architectures. In the NAS-gate convolutional module, each 3 × 3 convolutional operation was defined as the computation cell within which the NAS operation searched for the final backbone architecture. Each cell was regarded as a directed acyclic graph, which was formed by sequentially connecting N nodes. Each node y (i) was a feature representation in convolutional networks, and each directed edge (i, j) was associated with some operation p (i,j) that transformed y (i) . The output of each node was obtained by the summation of the transformed y (i) with operations p (i,j) : In order to optimize the standard convolution of object detector backbone, we define the node y (i) as the feature representation input of the 3 × 3 convolutional operations in ResNet-50 and ResNet-101, P is the set of candidate operations where each operation represented function p(·) to be applied to y (i) . As the differing scales of objects require different kernel sizes or dilated rates in the convolutional operation to effectively extract features of the scale-variant object in the images, we utilized two different kernel sizes and two different dilated rates as following for the candidate operations in P:  To make the search space continuous, the categorical choice of a particular operation was defined as the softmax overall possible operations: where the operation mixing weights for each node (i,j) are parameterized by a vector γ p as shown in Fig. 2.
Then, the aim of the architecture search was to learn a set of variables γ . At the end of the DARTS search, a discrete architecture is obtained by replacing each mixed operation p (i,j) (y) with the most likely operation (i.e., p(y) = argmax p∈P γ p ). However, selecting only one operation for each node can lead to a decreased efficiency of feature extraction because a single convolutional operation cannot extract features of scale-variant objects in the images as effectively as the mixing operation, which is a combination of multiple convolutional operation options with weight parameters γ . The mixing operation can provide the feature representation that contains important features for all scale object sizes via the combination process. Therefore, we did not change the final architecture to discrete architecture; we instead utilized the architecture with a mixing operation during training at the end of the search operation. Furthermore, the learning of the variables γ in DARTS was updated through the gradient descent and optimized using the validation loss. However, updating the parameter γ by optimizing the validation loss is time-consuming for learning until the optimal architecture is obtained. Therefore, we updated the parameter γ by optimizing the training loss, L train (w, γ ): where the L cls (w, γ ) represents the loss function for object classification and the L loc (w, γ ) indicates the loss function for bounding box localization. In addition, the object classification loss and bounding box localization loss were determined by the weights of the network w and the operation mixing weights γ . The algorithm 1 is the NAS-gate convolutional module searching algorithm.

Capsule attention module.
In this subsection, we present the details of our proposed capsule attention module. The capsule attention module has been designed based on the structure of Capsule Network or CapNet 42 . CapNet was proposed to overcome the challenges faced by convolutional neural networks (CNNs), specifically, the loss of information via the pooling process, sensitivity to object orientation, and difficulty in transferring the understanding of the geometric relationship to new viewpoints. Therefore, the concept architecture and optimization process is different from the CNN. The capsule in a CapNet is a group of neurons that utilizes a vector to represent the instantiating parameters of a specific type of entity such as an object or object parts. The length of a capsule vector represents the probability of the objects existing in the image while the direction of the vector represents the corresponding pose information. Therefore, CapNet is more robust to changes in the orientation and size of the input. Furthermore, CapNet can encode spatial information and account for the spatial relations between the parts of the image. Accordingly, we exploited these abilities of CapNets to generate the attention mask, which was applied to improve the feature representation by emphasizing the object-related features and suppressing unrelated ones, in the proposed capsule attention module. In addition, we utilized the capsule atten- www.nature.com/scientificreports/ tion module in FPN and FPN-based methods in order to increase the performance of the detectors by alleviating the highest-level information loss problem and enhancing the feature representation with strong semantic information, especially the spatial information and spatial relationships between the objects in the images. The proposed capsule attention module consists of two layers of capsules, as illustrated in Fig. 3. The first layer of capsules, named primary caps, reformulates the input feature representation, which was the feature representation of the highest level of the backbone, into N c channels of convolutional D 1 capsules, Z i where N c is defined as 12 and D 1 was 52. Each capsule in primary caps consists of 52 convolutional units with a 3 × 3 kernel and a stride of 1. In addition, the output of the primary caps layer has [ N c × H × W ] capsule outputs (each output was a 52-D vector), where H and W denote the height and width of the input feature representation, respectively. Each capsule in primary caps is transformed to provide a vote with transformation matrix W ij . The vote is: The second layer is the object caps (Obj caps) layer that includes only one D 2 capsule with a single channel, where we define D 2 as 52. Each capsule in this layer receives the votes from the primary caps as input, and the vector outputs of this layers are computed through dynamic routing 42 . The routing mechanism identifies a coefficient r ij for each vote Ẑ j|i , which are all determined by the iterative dynamic routing process, and takes all votes to calculated weighted sum over all votes as output vectors t j : The coefficient r ij between capsule i and all the capsules in the primary caps are determined by a "routing softmax" to enforce the probabilistic nature of coefficient r ij to be non-negative number, and their summation equals to one. Furthermore, the routing softmax utilized the log prior probabilities b ij , which can be defined as network parameters, and learned at the same time as all the other weights, to determine the coefficient r ij : As the length of the output vector of the capsule represents the probability that objects are presented, the capsule uses a non-linear "squashing" function to ensure that each feature related to the object is represented by a length slightly less than one while the background feature has a vector length of almost zero. The squashing function is defined as: where v j is the vector output of capsule j.
A final attention mask is created by computing the length of the capsule vectors in the final layer, Obj Caps, and the attention mask is multiplied to the input feature, which is the feature representation of the highest level of the backbone to improve the feature representation. www.nature.com/scientificreports/ The capsule attention module is a new concept of the attention mechanism, which is designed to strengthen feature representation power by exploiting global context without losing spatial relation. For object detection, we adopt the capsule attention module at the highest-level of FPN and FPN-based methods in order to improve feature representation and alleviate information loss problem, which results in improving the localization performance of the object.
Dataset. We evaluated the performance of the proposed NASGC-CapANet on two different benchmark datasets including PASCAL VOC 43 , which is the public dataset for VOC2012 challenges that is available at http:// host. robots. ox. ac. uk/ pascal/ VOC/ voc20 12/ index. html and MS COCO 44 , which is the public dataset for the MS COCO challenge that is available at https:// cocod ataset. org. Since it is nearly impossible to obtain informed consent for all persons present in the two Internet image datasets, the data were collected without consent. All methods on the data were performed in accordance with relevant guidelines and regulations. In order to remove privacy concerns, we cropped the head area from the image. PASCAL VOC contains 20 object classes. The union of VOC-2007 trainval and VOC-2012 trainval (10k images) was used for the model training, and VOC-2007 test (4.9k images) was used for the model evaluation. The performance on the PASCAL VOC was evaluated using the mAP scores with an intersection over union (IoU) of 0.5. MS COCO 2017 contains 80 object classes with 118k and 5k images for training (train-2017) and evaluation (val-2017), respectively. In addition, 20k images in test-dev did not have any disclosed labels. We conducted an ablation study and reported the final result for val-2017 and test-dev. Results for MS COCO were reported using mAP, mAP 50 (mAP scores with IoU of 0.5), and mAP 75 (mAP scores with IoU of 0.75). Here, mAP S , mAP M , and mAP L correspond to results on small, medium, and large scales, respectively. Implementation details. The NAS-gate convolutional module was designed such that it could be install on all members of the ResNet backbone family, i.e., ResNet-50, ResNet-101, ResNeXt 45 ,and ResNeSt 46 . However, owing to the limited GPU memory of our hardware environment, we conducted the experiment using only two backbones, i.e., ResNet-50 and ResNet-101. However, if only hardware GPU memory is secured, then expanding the proposed methods to a detector model (e.g., Mask R-CNN and Cascade Mask R-CNN) that requires expensive computational memory will not be an issue as there is no difference in module installation. In our implementation, we replaced all the 3 × 3 convolutional operations of the ResNet-50 and ResNet-101 backbones with the NAS-gate convolutional module. Furthermore, the capsule attention module was adopted in the highest level of the FPN and PAFPN, as shown in Fig. 1b, d. We implemented our model using MMDetection 47 , an opensource object detection toolbox based on PyTorch. Both of the proposed modules were implemented on twostage detectors such as Faster R-CNN 28 and Cascade R-CNN 15 as well as one-stage detectors such as RetinaNet 6 and FCOS 7 . In the experiments on PASCAL VOC, the models were trained for four epochs and training was repeated on the training dataset three times per epoch with an initial learning rate of 0.01. The learning rate was multiplied by 0.1 every three epochs. Furthermore, we trained the model on MS COCO for 12 epochs with an initial learning rate of 0.02. After eight and 11 epochs, the learning rate was multiplied by 0.1. We used the SGD optimizer with momentum, that equal to 0.9 to minimize the summation of the cross-entropy loss for classification prediction head and smooth L1 loss with beta=1.0 for bounding box prediction head. In addition, we resized the input images to the same size, i.e., 1333 × 800, and trained the model with a batch size of four images per GPU on an environment equipped with NVIDIA Titan Xp GPU, CUDA version 10.2, and PyTorch 1.5.

Results
In order to evaluate the effectiveness of each proposed module, we conducted experiments comparing the existing method with similar concepts or methods by using the same dataset for training and testing, including the same software and hardware environment in each experiment for a fair comparison. Moreover, in Tables 1-6 showing the experimental results, the best value for each metric is highlighted in bold.

NAS-gate convolutional module.
We examined the effectiveness of the proposed NAS-gate convolutional module on MS COCO test-dev. We evaluated the performance of the NAS-gate convolutional module incorporated in Cascade R-CNN with FPN and ResNet-101 against existing NAS backbone-based object detectors including DetNAS, AmoebaNet, and Hit-Detector. Table 1 indicates that the Cascade R-CNN with FPN with the proposed NAS-gate convolutional module implemented on ResNet-101 outperforms the other NAS backbones, with a mAP of 43.5%.   Table 3, adding the NASgate convolutional module improved the mAP by 2.5% and 1.7% on the ResNet-50 and ResNet-101 backbone, respectively. Furthermore, adding the capsule attention improved the mAP by 0.6% and 0.4% on the ResNet-50 and ResNet-101 backbone, respectively. Combining the two proposed modules improved the mAP by 2.7% and 2.0% on the ResNet-50 and ResNet-101 backbone, respectively. To examine the impact of using both of the proposed modules, we implemented the proposed modules on one-stage as well as two-stage detectors. Table 4 presents the performance comparison on MS COCO test-dev of the baseline detector and the baseline detector equipped with the proposed modules attached. We tested our modules against two state-of-the-art one-stage object detectors using ResNet-101 as a backbone, i.e., RetinaNet 6 and FCOS with group normalization and without multi-scale training 7 . The results indicated that using both the proposed modules could enhance the mAP by 2.0%. In the case of two-stage object detectors, we compared the performance of Faster R-CNN and Cascade R-CNN with and without our proposed modules (baseline). As indicated in Table 4, the proposed modules effectively improved the mAP by 5.5% and 1.0% when using the  Fig. 4; the baseline model could not detect the frisbee that is the small object and the overlapping bounding box with the dog bounding box. In contrast, the proposed modules could successfully detect the small bottle with a high probability. In another case, as presented in the bottom rows in Fig. 4, the baseline Faster R-CNN could detect only one car from three small cars located in the background. In contrast, when only the NAS-gate convolutional module or capsule attention module was incorporated in the baseline detector, the model could successfully recognize more small cars located in the background  www.nature.com/scientificreports/ with higher confidence than the baseline detector. As presented in Fig. 4, besides improving the performance of detecting small objects in the images, the use of the proposed modules led to an enhanced detector localization performance and could predict high-quality bounding boxes that could precisely cover the objects. As listed in the top row in Fig. 4, the baseline Faster R-CNN erroneously recognized one dog as two different objects (dog and human). However, after adding only one of our proposed two different modules, the detector correctly detected the object as a single object. When both the modules were used in the baseline, the bounding box was set up correctly with higher confidence. Thus, it can be inferred that the capsule attention module and NAS-gate convolutional module can alleviate the scale variance problem and enhance object localization in the image.
Comparisons with state-of-the-art detectors.

Discussion
In this study, we proposed a new object detector, NASGC-CapANet, which combines the state-of-the-art object detector with two newly proposed modules: a NAS-gate convolutional module and capsule attention module. The NAS-gate convolutional module replaces the standard convolutional operation of the classification network backbones and is designed to enhance the feature extraction ability of the backbone network using the NAS gradient method. It utilizes various conditions of convolutional operations such as different kernel sizes and dilated rates of convolution in order to improve the performance of the detector in recognizing objects of varied scales in the images. Furthermore, the NAS-gate convolutional module can optimize the object detector's backbone architecture with lower computation cost compared to existing NAS-based object detectors. We also introduced a new concept for the attention mechanism, called capsule attention module. The capsule attention module utilizes the global context to improve feature representation by concentrating on object-relevant features without losing spatial relationships. We adopt the capsule attention module in FPN and FPN-based methods in order to mitigate the information loss at the highest level of the FPN and FPN-based methods as well as enhance the localization of the detectors. We conducted an experiment to evaluate the performance of both proposed modules and NASGC-CapANet on some public object detection datasets, i.e., PASCAL VOC and MS COCO. The experimental results show that replacing the convolutional operation of the ResNet-50 and ResNet-101 backbone outperforms the NAS-based object detectors' backbone. In addition, adopting capsule attention module at the highest level of FPN improves upon the performance of the existing attention mechanism. Furthermore, NASGC-CapANet, which combines both proposed modules with state-of-the-art object detectors, can significantly outperform baseline detectors. www.nature.com/scientificreports/ NASGC-CapANet-based Faster-RCNN has a 5.8% higher mAP than the baseline Faster-RCNN on MS COCO test-dev. We also analyzed the qualitative performance of NASGC-CapANet with the baseline object detector. The results demonstrate that the detection performance of NASGC-CapANet is more accurate in terms of multiscale object recognition and localization.