Introduction

Object detection, a fundamental and challenging task in computer vision that has been widely adopted in real-world applications, aims to localize and classify multiple objects in an image. Typically, deep learning-based object detectors can be divided into two categories based on their architecture: one-stage methods1,2,3,4,5,6,7,8 such as YOLO9 and SSD10, which directly utilize convolutional neural networks (CNNs) to classify and predict the bounding boxes of the object, and two-stage detectors11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27 such as Faster R-CNN28 that adopt a region proposal network (RPN) to extract the region proposal from the CNN backbone feature map to classify and predict the bounding boxes. Generally, object detection systems in both categories involve three components: a backbone for basic feature extraction, a neck for fusing multi-level features, and a detection head to realize the object classification and bounding box regression. Two-stage object detection systems have an additional component, RPN, to propose candidate object bounding boxes. Owing to the architectural differences between the two categories, the two-stage detectors have high localization and object recognition accuracy, whereas the one-stage detectors achieve high inference speed.

Most backbone networks for detection are generally used for classification, e.g. ResNet29 and VGG1630, with the last fully connected layers removed. For better detection accuracy, a deeper and densely connected backbone is adopted to replace its shallower and sparsely connected counterpart. However, a classification network usually reduces the spatial resolution of the feature maps with a large downsampling factor, which is beneficial for visual classification, although the low-spatial resolution impedes the accurate localization of large objects and recognition of small objects. There have been several attempts to alleviate the issues arising from scale variation and instances of small objects in object detection, such as proposing new backbone architectures that maintain a high spatial resolution in the deep layers31,32,33, modification of convolution by utilizing Atrous convolution26, and adoption of an attention mechanism34. These approaches have achieved considerably higher detection performance. Nevertheless, these methods have been based on hand-crafted network design, which requires expert knowledge and experience. To overcome this limitation, the use of neural architecture search (NAS) frameworks, which automatically determine the optimal network architecture for a certain task and dataset, has attracted attention, especially in computer vision tasks including object detection. For example, DetNAS33 and Hit-Detectors35 have used NAS to search for a new backbone, NAS-FPN36 and Auto-FPN37 have attempted to search the architecture for the neck (feature fusion network). However, the optimization of the object detector’s backbone based on NAS has difficulty in promptly evaluating the candidate models in the search space. Furthermore, the architecture of the backbone keeps changing during the search, which is computationally infeasible and time-consuming. To reduce the computational cost, we proposed a NAS-gate convolutional module, which optimized only the standard convolution on the classification network backbone by exploiting the NAS gradient search method. We also utilized multiple kernel sizes and dilated rates in the convolutional operation as the candidate operation on the NAS searching operation in order to improve object scale variation detection performance of the designed NAS-based backbone. In other words, adopting the NAS-gate convolution module in the classification network backbone can improve efficiency and overcome issues arising from object scale variation with a smaller computation load compared to previous NAS object detector backbones.

While a better feature extractor certainly plays an important role, considerable improvement comes from better design of work architectures for the feature fusion network or the neck. Feature Pyramid Network (FPN)36 is one of the representative model architectures for the feature fusion neck that has achieved remarkable performance in object detection. FPN propagates features in a top-down path, and the low-level features can be improved with stronger semantic information from higher-level features. Furthermore, there have been several works optimizing FPN architecture36,37,38 to improve the feature representation of FPN. However, FPN and FPN-based methods suffer from information loss in the highest-level feature map. Although the information loss can be mitigated by combining the global context feature, this strategy leads to the spatial relationship between objects in the images getting lost. Another effective approach to mitigate information loss and improve the feature representation is utilizing an attention mechanism such as SENet39 and CBAM40. The attention mechanism improves the feature representation by concentrating only on relevant features and ignoring others. However, previous attention mechanisms have exploited the global context, which results in losing spatial relationships. In this work, we propose an attention mechanism, named capsule attention module, which improves feature representation without losing spatial relationships. Our capsule attention module is based on a capsule network, which can encode spatial information and account for the spatial relationships between the objects in the image. Using the strong spatial relation accounting ability of the capsule network, the capsule attention module can identify stronger relationships between the underlying object than existing attention mechanisms. Therefore, adopting the capsule attention at the highest level of FPN or FPN-based methods can alleviate the information loss problem without losing spatial relationships, improving the localization ability.

We incorporated both proposed modules, i.e., the NAS-gate convolutional module and capsule attention module, into state-of-the-art object detectors such as Faster R-CNN and Cascade R-CNN to create NASGC-CapANet. Experiment results demonstrate that NASGC-CapANet substantially improves the performance of the baseline object detectors. NASGC-CapANet-based Faster R-CNN with FPN increases mAP by 5.8%, and NASGC-CapANet-based Cascade R-CNN with PAFPN increases mAP by 1.0% on MS COCO test-dev. The main contributions of our work are summarized as follows:

  • We proposed the NAS-gate convolutional module, which utilized the NAS operation based on differentiable architecture search (DARTS) with multiple kernel sizes and dilation rates for the convolutional operation of the classification backbone network to decrease the computation cost of NAS-based backbones and alleviate the issues arising from the object scale variation.

  • We introduced a capsule attention module, based on a capsule network, to improve the feature representation by mitigating the information loss problem of FPN using the strong spatial relation ability of the capsule network.

  • We evaluated the performance of both the proposed modules and the incorporation of the proposed modules with state-of-the-art object detectors, NASGC-CapANet, on MS COCO and PASCAL VOC. The experiment results show that NASGC-CapANet considerably improves the detection performance compared to start-of-the-art baseline object detectors.

Method and experiment

In this section, we describe the architecture design of the proposed NASGC-CapANet, which is a combination of the state-of-the-art object detectors and our proposed modules. In general, the NAS-gate convolutional module and capsule attention module can both be incorporated in one-stage as well as two-stage object detectors. However, most studies in this domain focus on incorporating these modules in two-stage detectors such as Faster R-CNN28 and Cascade R-CNN15. In order to mitigate the problems arising from object scale variation, we optimized the feature extractor ability of the backbone by replacing the standard convolution of the classification backbone network with the proposed module, i.e., a NAS-gate convolutional module based on Neural Architecture search method, to increase the detection performance on the multiscale objects in the images with smaller computation cost compared to the NAS-based object detectors backbones. In order to enhance the localization ability of the object detectors, we improve the feature representation of the feature fusion network or neck by alleviate the information lost at the highest feature level problem with the capsule attention module. The capsule attention module was designed to incorporate with the feature fusion networks, i.e., FPN and FPN-based methods such as PAFPN from PANet, which the architecture are shown in Fig. 1a, c, respectively. Capsule attention module is adopted at the highest level of the FPN and PAFPN as shown in Fig. 1b, d, respectively.

Figure 1
figure 1

Architecture of feature fusion networks with and without the capsule attention module—(a) FPN; (b) FPN with capsule attention module; (c) PAFPN; (d) PAFPN with capsule attention module.

NAS-gate convolutional module

The NAS-gate convolutional module was designed to explore the object detector’s backbone architecture by optimizing the standard convolution using a Neural Architecture Search approach. Generally, NAS automatically finds an optimal network architecture for a certain task and dataset. The NAS domain involves three key areas: reinforcement learning-based methods that train a recurrent neural network (RNN) controller to generate the cell structure and form the CNN architecture; evolutionary algorithm (EA)-based methods that update the architecture or network by mutating the current best architectures; and gradient-based methods, which were utilized in the NAS-gate convolutional module, that define an architecture parameter for the continuous relaxation of the search space, thereby allowing differentiable optimization in the architecture search to accelerate the search process. Specifically, the NAS-gate convolutional module uses the NAS gradient-based method, named Differentiable ARchiTecture Search (DARTS)41, to search for the optimal condition of the convolutional operation of the object detector’s backbone.

In order to mitigate the computational infeasibility and time consumption of the NAS-based object detector backbone, we utilized the NAS method to search for the optimal convolutional operation of classification network backbones such as ResNet-50 and ResNet-101, instead of searching for new backbone architectures. In the NAS-gate convolutional module, each \(3 \times 3\) convolutional operation was defined as the computation cell within which the NAS operation searched for the final backbone architecture. Each cell was regarded as a directed acyclic graph, which was formed by sequentially connecting N nodes. Each node \(y^{(i)}\) was a feature representation in convolutional networks, and each directed edge (i, j) was associated with some operation \(p^{(i,j)}\) that transformed \(y^{(i)}\). The output of each node was obtained by the summation of the transformed \(y^{(i)}\) with operations \(p^{(i,j)}\):

$$\begin{aligned} {{\hat{y}}}^{(i)} = \sum _{i<j}p^{(i,j)}(y^{(i)}) \end{aligned}$$
(1)
Figure 2
figure 2

Mixing operation of NAS-gate convolutional module for generating the optimal convolutional operation of the final backbone architecture.

In order to optimize the standard convolution of object detector backbone, we define the node \(y^{(i)}\) as the feature representation input of the \(3 \times 3\) convolutional operations in ResNet-50 and ResNet-101, P is the set of candidate operations where each operation represented function \(p(\cdot )\) to be applied to \(y^{(i)}\). As the differing scales of objects require different kernel sizes or dilated rates in the convolutional operation to effectively extract features of the scale-variant object in the images, we utilized two different kernel sizes and two different dilated rates as following for the candidate operations in P:

  • \(3\times 3\) dilated convolution with rate 1,

  • \(3\times 3\) dilated convolution with rate 2,

  • \(5\times 5\) dilated convolution with rate 1,

  • \(5\times 5\) dilated convolution with rate 2

To make the search space continuous, the categorical choice of a particular operation was defined as the softmax overall possible operations:

$$\begin{aligned} p^{(i,j)}(y) = \sum _{p\in P } \dfrac{exp(\gamma ^p)}{\sum _{p'\in P_b} exp(\gamma ^{p'})} p(y), \end{aligned}$$
(2)

where the operation mixing weights for each node (i,j) are parameterized by a vector \(\gamma ^p\) as shown in Fig. 2. Then, the aim of the architecture search was to learn a set of variables \(\gamma\). At the end of the DARTS search, a discrete architecture is obtained by replacing each mixed operation \(p^{(i,j)}(y)\) with the most likely operation (i.e., \(p(y) = argmax_{p\in P}\gamma ^p\)). However, selecting only one operation for each node can lead to a decreased efficiency of feature extraction because a single convolutional operation cannot extract features of scale-variant objects in the images as effectively as the mixing operation, which is a combination of multiple convolutional operation options with weight parameters \(\gamma\). The mixing operation can provide the feature representation that contains important features for all scale object sizes via the combination process. Therefore, we did not change the final architecture to discrete architecture; we instead utilized the architecture with a mixing operation during training at the end of the search operation. Furthermore, the learning of the variables \(\gamma\) in DARTS was updated through the gradient descent and optimized using the validation loss. However, updating the parameter \(\gamma\) by optimizing the validation loss is time-consuming for learning until the optimal architecture is obtained. Therefore, we updated the parameter \(\gamma\) by optimizing the training loss, \(L_{train}(w,\gamma )\):

$$\begin{aligned} L_{train}(w,\gamma ) = L_{cls}(w,\gamma ) + L_{loc}(w,\gamma ), \end{aligned}$$
(3)

where the \(L_{cls}(w,\gamma )\) represents the loss function for object classification and the \(L_{loc}(w,\gamma )\) indicates the loss function for bounding box localization. In addition, the object classification loss and bounding box localization loss were determined by the weights of the network w and the operation mixing weights \(\gamma\). The algorithm 1 is the NAS-gate convolutional module searching algorithm.

figure a

Capsule attention module

In this subsection, we present the details of our proposed capsule attention module. The capsule attention module has been designed based on the structure of Capsule Network or CapNet42. CapNet was proposed to overcome the challenges faced by convolutional neural networks (CNNs), specifically, the loss of information via the pooling process, sensitivity to object orientation, and difficulty in transferring the understanding of the geometric relationship to new viewpoints. Therefore, the concept architecture and optimization process is different from the CNN. The capsule in a CapNet is a group of neurons that utilizes a vector to represent the instantiating parameters of a specific type of entity such as an object or object parts. The length of a capsule vector represents the probability of the objects existing in the image while the direction of the vector represents the corresponding pose information. Therefore, CapNet is more robust to changes in the orientation and size of the input. Furthermore, CapNet can encode spatial information and account for the spatial relations between the parts of the image. Accordingly, we exploited these abilities of CapNets to generate the attention mask, which was applied to improve the feature representation by emphasizing the object-related features and suppressing unrelated ones, in the proposed capsule attention module. In addition, we utilized the capsule attention module in FPN and FPN-based methods in order to increase the performance of the detectors by alleviating the highest-level information loss problem and enhancing the feature representation with strong semantic information, especially the spatial information and spatial relationships between the objects in the images.

Figure 3
figure 3

Architecture of the capsule attention module.

The proposed capsule attention module consists of two layers of capsules, as illustrated in Fig. 3. The first layer of capsules, named primary caps, reformulates the input feature representation, which was the feature representation of the highest level of the backbone, into \(N_c\) channels of convolutional \(D_1\) capsules, \(\mathbf{Z }_i\) where \(N_c\) is defined as 12 and \(D_1\) was 52. Each capsule in primary caps consists of 52 convolutional units with a \(3 \times 3\) kernel and a stride of 1. In addition, the output of the primary caps layer has [\(N_c\times H \times W\)] capsule outputs (each output was a 52-D vector), where H and W denote the height and width of the input feature representation, respectively. Each capsule in primary caps is transformed to provide a vote with transformation matrix \(W_{ij}\). The vote is:

$$\begin{aligned} \hat{\mathbf{Z }}_{j|i} = W_{ij}\mathbf{Z }_i \end{aligned}$$
(4)

The second layer is the object caps (Obj caps) layer that includes only one \(D_2\) capsule with a single channel, where we define \(D_2\) as 52. Each capsule in this layer receives the votes from the primary caps as input, and the vector outputs of this layers are computed through dynamic routing42. The routing mechanism identifies a coefficient \(r_{ij}\) for each vote \(\hat{\mathbf{Z }}_{j|i}\), which are all determined by the iterative dynamic routing process, and takes all votes to calculated weighted sum over all votes as output vectors \(\mathbf{t }_j\):

$$\begin{aligned} \mathbf{t }_j = \sum _{i} r_{ij}\hat{\mathbf{Z }}_{j|i} \end{aligned}$$
(5)

The coefficient \(r_{ij}\) between capsule i and all the capsules in the primary caps are determined by a ”routing softmax” to enforce the probabilistic nature of coefficient \(r_{ij}\) to be non-negative number, and their summation equals to one. Furthermore, the routing softmax utilized the log prior probabilities \(b_{ij}\), which can be defined as network parameters, and learned at the same time as all the other weights, to determine the coefficient \(r_{ij}\):

$$\begin{aligned} r_{ij} = \dfrac{exp(b_{ij})}{\sum _{k} exp(b_{ik})} \end{aligned}$$
(6)

As the length of the output vector of the capsule represents the probability that objects are presented, the capsule uses a non-linear ”squashing” function to ensure that each feature related to the object is represented by a length slightly less than one while the background feature has a vector length of almost zero. The squashing function is defined as:

$$\begin{aligned} \mathbf{v }_j = \dfrac{\left| \left| \mathbf{t }_j\right| \right| }{1+ \left| \left| \mathbf{t }_j\right| \right| ^2}\dfrac{\mathbf{t }_j}{\left| \left| \mathbf{t }_j\right| \right| } \end{aligned}$$
(7)

where \(\mathbf{v }_j\) is the vector output of capsule j.

A final attention mask is created by computing the length of the capsule vectors in the final layer, Obj Caps, and the attention mask is multiplied to the input feature, which is the feature representation of the highest level of the backbone to improve the feature representation.

The capsule attention module is a new concept of the attention mechanism, which is designed to strengthen feature representation power by exploiting global context without losing spatial relation. For object detection, we adopt the capsule attention module at the highest-level of FPN and FPN-based methods in order to improve feature representation and alleviate information loss problem, which results in improving the localization performance of the object.

Dataset

We evaluated the performance of the proposed NASGC-CapANet on two different benchmark datasets including PASCAL VOC43, which is the public dataset for VOC2012 challenges that is available at http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html and MS COCO44, which is the public dataset for the MS COCO challenge that is available at https://cocodataset.org. Since it is nearly impossible to obtain informed consent for all persons present in the two Internet image datasets, the data were collected without consent. All methods on the data were performed in accordance with relevant guidelines and regulations. In order to remove privacy concerns, we cropped the head area from the image. PASCAL VOC contains 20 object classes. The union of VOC-2007 trainval and VOC-2012 trainval (10k images) was used for the model training, and VOC-2007 test (4.9k images) was used for the model evaluation. The performance on the PASCAL VOC was evaluated using the mAP scores with an intersection over union (IoU) of 0.5. MS COCO 2017 contains 80 object classes with 118k and 5k images for training (train-2017) and evaluation (val-2017), respectively. In addition, 20k images in test-dev did not have any disclosed labels. We conducted an ablation study and reported the final result for val-2017 and test-dev. Results for MS COCO were reported using mAP, mAP50 (mAP scores with IoU of 0.5), and mAP75 (mAP scores with IoU of 0.75). Here, mAPS, mAPM, and mAPL correspond to results on small, medium, and large scales, respectively.

Implementation details

The NAS-gate convolutional module was designed such that it could be install on all members of the ResNet backbone family, i.e., ResNet-50, ResNet-101, ResNeXt45 ,and ResNeSt46. However, owing to the limited GPU memory of our hardware environment, we conducted the experiment using only two backbones, i.e., ResNet-50 and ResNet-101. However, if only hardware GPU memory is secured, then expanding the proposed methods to a detector model (e.g., Mask R-CNN and Cascade Mask R-CNN) that requires expensive computational memory will not be an issue as there is no difference in module installation. In our implementation, we replaced all the \(3\times 3\) convolutional operations of the ResNet-50 and ResNet-101 backbones with the NAS-gate convolutional module. Furthermore, the capsule attention module was adopted in the highest level of the FPN and PAFPN, as shown in Fig. 1b, d. We implemented our model using MMDetection47, an open-source object detection toolbox based on PyTorch. Both of the proposed modules were implemented on two-stage detectors such as Faster R-CNN28 and Cascade R-CNN15 as well as one-stage detectors such as RetinaNet6 and FCOS7. In the experiments on PASCAL VOC, the models were trained for four epochs and training was repeated on the training dataset three times per epoch with an initial learning rate of 0.01. The learning rate was multiplied by 0.1 every three epochs. Furthermore, we trained the model on MS COCO for 12 epochs with an initial learning rate of 0.02. After eight and 11 epochs, the learning rate was multiplied by 0.1. We used the SGD optimizer with momentum, that equal to 0.9 to minimize the summation of the cross-entropy loss for classification prediction head and smooth L1 loss with beta=1.0 for bounding box prediction head. In addition, we resized the input images to the same size, i.e., 1333 \(\times\) 800, and trained the model with a batch size of four images per GPU on an environment equipped with NVIDIA Titan Xp GPU, CUDA version 10.2, and PyTorch 1.5.

Results

In order to evaluate the effectiveness of each proposed module, we conducted experiments comparing the existing method with similar concepts or methods by using the same dataset for training and testing, including the same software and hardware environment in each experiment for a fair comparison. Moreover, in Tables 1–6 showing the experimental results, the best value for each metric is highlighted in bold.

NAS-gate convolutional module

We examined the effectiveness of the proposed NAS-gate convolutional module on MS COCO test-dev. We evaluated the performance of the NAS-gate convolutional module incorporated in Cascade R-CNN with FPN and ResNet-101 against existing NAS backbone-based object detectors including DetNAS, AmoebaNet, and Hit-Detector. Table 1 indicates that the Cascade R-CNN with FPN with the proposed NAS-gate convolutional module implemented on ResNet-101 outperforms the other NAS backbones, with a mAP of 43.5%.

Table 1 Comparison of the performance of the NAS-gate convolutional module and other NAS backbone frameworks on MS COCO test-dev.
Table 2 Comparison performance of capsule attention module with other attention modules on MS COCO val-2017.
Table 3 Effect of each proposed module on MS COCO val-2017.

Capsule attention module

To validate the effectiveness of the proposed capsule attention module, we evaluated the performance of a baseline Faster R-CNN with FPN and two different backbones (ResNet-50 and ResNet-101) with various exist attention mechanisms (CBAM40 and SE Module from SENet39) including the capsule attention module on MS COCO val-2017. Furthermore, the competing attention modules (CBAM and SE module) were adopted at the highest level of FPN, just like the capsule attention module in Fig. 1b. It can be observed from Table 2 that the capsule attention module enhanced the mAP of the baseline detectors on the ResNet-50 and ResNet-101 backbones while outperforming the other attention modules in terms of the mAP.

Impact of proposed modules on the quantitative performance

To evaluate the impact of the proposed NAS-gate convolutional module and capsule attention module on MS COCO val-2017, we compared the box mAP of the baseline detectors (Faster R-CNN with FPN using ResNet-50 and ResNet-101 as a backbone) with the baseline detectors incorporating the proposed modules. As presented in Table 3, adding the NAS-gate convolutional module improved the mAP by 2.5% and 1.7% on the ResNet-50 and ResNet-101 backbone, respectively. Furthermore, adding the capsule attention improved the mAP by 0.6% and 0.4% on the ResNet-50 and ResNet-101 backbone, respectively. Combining the two proposed modules improved the mAP by 2.7% and 2.0% on the ResNet-50 and ResNet-101 backbone, respectively.

Table 4 Comparison of the results obtained using the proposed module for MS COCO test-dev with different detectors.
Table 5 State-of-the-art comparison on PASCAL VOC - VOC-2007 test for bounding box object detection.
Table 6 State-of-the-art comparison on MS COCO test-dev for bounding box object detection.

To examine the impact of using both of the proposed modules, we implemented the proposed modules on one-stage as well as two-stage detectors. Table 4 presents the performance comparison on MS COCO test-dev of the baseline detector and the baseline detector equipped with the proposed modules attached. We tested our modules against two state-of-the-art one-stage object detectors using ResNet-101 as a backbone, i.e., RetinaNet6 and FCOS with group normalization and without multi-scale training7. The results indicated that using both the proposed modules could enhance the mAP by 2.0%. In the case of two-stage object detectors, we compared the performance of Faster R-CNN and Cascade R-CNN with and without our proposed modules (baseline). As indicated in Table 4, the proposed modules effectively improved the mAP by 5.5% and 1.0% when using the Faster R-CNN and Cascade R-CNN, respectively.

Figure 4
figure 4

Visualization of the impact of the proposed NAS-gate convolutional module and capsule attention module to detect small object in the image.

Figure 5
figure 5

Visualization of the impact of the proposed NAS-gate convolutional module and capsule attention.

Impact of proposed modules on the qualitative performance

For a qualitative performance analysis of the proposed modules, we visualized the impact of each module, as shown in Figs. 4, 5. The result of the baseline Faster R-CNN is listed in the upper row in Fig. 4; the baseline model could not detect the frisbee that is the small object and the overlapping bounding box with the dog bounding box. In contrast, the proposed modules could successfully detect the small bottle with a high probability. In another case, as presented in the bottom rows in Fig. 4, the baseline Faster R-CNN could detect only one car from three small cars located in the background. In contrast, when only the NAS-gate convolutional module or capsule attention module was incorporated in the baseline detector, the model could successfully recognize more small cars located in the background with higher confidence than the baseline detector. As presented in Fig. 4, besides improving the performance of detecting small objects in the images, the use of the proposed modules led to an enhanced detector localization performance and could predict high-quality bounding boxes that could precisely cover the objects. As listed in the top row in Fig. 4, the baseline Faster R-CNN erroneously recognized one dog as two different objects (dog and human). However, after adding only one of our proposed two different modules, the detector correctly detected the object as a single object. When both the modules were used in the baseline, the bounding box was set up correctly with higher confidence. Thus, it can be inferred that the capsule attention module and NAS-gate convolutional module can alleviate the scale variance problem and enhance object localization in the image.

Comparisons with state-of-the-art detectors

The main results of NASGC-CapANet are summarized in Tables 5 and 6. We compared the performance of the proposed module with that of state-of-the-art detectors on MS COCO test-dev and PASCAL VOC - VOC-2007. The performances of state-of-the-art detectors were obtained from the original research experiment results of each method. To compare the performance on the PASCAL VOC dataset, we implemented the proposed modules on state-of-the-art detectors, namely, Faster R-CNN with FPN and Faster R-CNN with PAFPN. Similarly, to evaluate the performance on MS COCO, we used Faster R-CNN with FPN, Faster R-CNN with PAFPN, and Cascade R-CNN with PAFPN. All the models utilized either ResNet-50 or ResNet-101 as a backbone. The presented results are divided into four categories, i.e., one-stage detectors; two-stage detectors; NAS-based backbone detectors, which are similar to the proposed backbone; and the proposed approach, NASGC-CapANet. As summarized in Tables 5 and 6, the proposed NASGC-CapANet achieves an mAP50 of 82.70% and 43.8% on PASCAL VOC -VOC-2007 and MS COCO test-dev, respectively.

Discussion

In this study, we proposed a new object detector, NASGC-CapANet, which combines the state-of-the-art object detector with two newly proposed modules: a NAS-gate convolutional module and capsule attention module. The NAS-gate convolutional module replaces the standard convolutional operation of the classification network backbones and is designed to enhance the feature extraction ability of the backbone network using the NAS gradient method. It utilizes various conditions of convolutional operations such as different kernel sizes and dilated rates of convolution in order to improve the performance of the detector in recognizing objects of varied scales in the images. Furthermore, the NAS-gate convolutional module can optimize the object detector’s backbone architecture with lower computation cost compared to existing NAS-based object detectors. We also introduced a new concept for the attention mechanism, called capsule attention module. The capsule attention module utilizes the global context to improve feature representation by concentrating on object-relevant features without losing spatial relationships. We adopt the capsule attention module in FPN and FPN-based methods in order to mitigate the information loss at the highest level of the FPN and FPN-based methods as well as enhance the localization of the detectors.

We conducted an experiment to evaluate the performance of both proposed modules and NASGC-CapANet on some public object detection datasets, i.e., PASCAL VOC and MS COCO. The experimental results show that replacing the convolutional operation of the ResNet-50 and ResNet-101 backbone outperforms the NAS-based object detectors’ backbone. In addition, adopting capsule attention module at the highest level of FPN improves upon the performance of the existing attention mechanism. Furthermore, NASGC-CapANet, which combines both proposed modules with state-of-the-art object detectors, can significantly outperform baseline detectors. NASGC-CapANet-based Faster-RCNN has a 5.8% higher mAP than the baseline Faster-RCNN on MS COCO test-dev. We also analyzed the qualitative performance of NASGC-CapANet with the baseline object detector. The results demonstrate that the detection performance of NASGC-CapANet is more accurate in terms of multiscale object recognition and localization.