Micro-architecture design exploration template for AutoML case study on SqueezeSEMAuto

Chantrapornchai, Chantana; Kajkamhaeng, Supasit; Romphet, Phattharaphon

doi:10.1038/s41598-023-37682-0

Download PDF

Article
Open access
Published: 30 June 2023

Micro-architecture design exploration template for AutoML case study on SqueezeSEMAuto

Chantana Chantrapornchai¹,
Supasit Kajkamhaeng¹ &
Phattharaphon Romphet¹

Scientific Reports volume 13, Article number: 10642 (2023) Cite this article

455 Accesses
Metrics details

Subjects

Abstract

Convolutional Neural Network (CNN) models have been commonly used primarily in image recognition tasks in the deep learning area. Finding the right architecture needs a lot of hand-tune experiments which are time-consuming. In this paper, we exploit an AutoML framework that adds to the exploration of the micro-architecture block and the multi-input option. The proposed adaption has been applied to SqueezeNet with SE blocks combined with the residual block combinations. The experiments assume three search strategies: Random, Hyperband, and Bayesian algorithms. Such combinations can lead to solutions with superior accuracy while the model size can be monitored. We demonstrate the application of the approach against benchmarks: CIFAR-10 and Tsinghua Facial Expression datasets. The searches allow the designer to find the architectures with better accuracy than the traditional architectures without hand-tune efforts. For example, CIFAR-10, leads to the SqueezeNet architecture using only 4 fire modules with 59% accuracy. When exploring SE block insertion, the model with good insertion points can lead to an accuracy of 78% while the traditional SqueezeNet can achieve an accuracy of around 50%. For other tasks, such as facial expression recognition, the proposed approach can lead up to an accuracy of 71% with the proper insertion of SE blocks, the appropriate number of fire modules, and adequate input merging, while the traditional model can achieve the accuracy under 20%.

Micro-expression recognition model based on TV-L1 optical flow method and improved ShuffleNet

Article Open access 20 October 2022

Pet dog facial expression recognition based on convolutional neural network and improved whale optimization algorithm

Article Open access 27 February 2023

A study on expression recognition based on improved mobilenetV2 network

Article Open access 07 April 2024

Introduction

Designing convolutional neural network (CNN) models require a lot of extensive experiments. Besides exploring the machine learning parameters such as learning rates, and momentum, also, for each layer, different hyperparameters should be considered. For example, for the convolution layer, one needs to explore the convolution filter sizes, the number of strides, an initialization approach, etc., for the dropout layers, a dropout rate can be varied, and for the pooling layers, the pooling sizes and pooling operations, e.g., max, average, global average are considered. The explorations of these parameters are time consuming and need to be done systematically to alleviate the model construction process.

Beside the exploration in each layer design, one needs to find the right structure for the specific recognition task. Searching for all possible structures is very computational-intensive and consumes resources. If the specific forms of the architectures are allowed, searching for the right connections may be possible within acceptable accuracy and time.

In this research, we study the methodology for automatic exploration of the micro-architecture types. If we are given the baseline architecture, our approach enables the search for model modification automatically to obtain good accuracy. We demonstrate the approach using the vanilla SqueezeNet¹ since it is a small architecture and it can still be expanded. The proposed approach will enable the neural network designer to explore models to answer typical design questions such as:

What is the number of convolutions blocks that are needed?
The number of features that should be used and how they are merged?
Should the skip connection be applied and where to apply?
What if some optimization block is inserted and where to inserted properly?

The flexible design of SqueezeNet as a prototype, called SqueezeNetSEMAuto is presented. It is the traditional SqueezeNet with a flexible length, the selected SE block insertion and skip connection, and the selected merging operation for more than one input. It also can be integrated with the hyperparameter exploration for each layer. The source of the prototype models is available at github https://github.com/cchanra/SqueezeNetSEMAuto.

The structure of the paper is as follows. Next, we introduce the background and literature review. In Section “Design methodology”, the methodology of flexible design is described. Section “Experiments” presents the experimental methods and results. Finally, Section “Other applications” concludes the work and explains future work.

Background

In this section, we introduce the convolution neural network and some previous models. Next, the complexity in various levels of model exploration is explained. At last, we present the literature review of model exploration.

Convolutional neural networks

Convolutional Neural Networks (CNN)² are designed based on convolution operations. It has been applied to many high-impact image processing tasks such as medical imaging^3,4, self-driving car^5,6, etc. They have convolutional layers connected in some manners. The layer utilizes the convolution operation with using the fixed-size filter to perform such operation with the matrices or tensors. The operations involved are multiplications and additions.

The typical CNN model consists of the following layers: convolution, pooling, dropout, activation, fully-connected, etc. These layers are connected in certain manners. Convolution layers have important parameters such as the filter, stride, padding sizes. The filter and stride sizes imply the output feature map sizes and the number of consecutive convolution layers implies the size of receptive fields.

Pooling layers are used to reduce the feature size. They also can help select outstanding features. They have the typical parameters which are such as the pooling and stride sizes. Pooling operations are e.g. max, average, and global average pooling. The various combinations of the convolutions and pooling cause the different feature map output values and sizes.

The dropout layer is used for the purpose of overfitting elimination. Its typical parameter is the dropout rate ranging between 0 and 1. The dropout rate needs to be examined as well.

The activation function is applied to change the output values. Activation functions are selected depending on the tasks, e.g., ReLU, LeakyReLu, Sigmoid, Tanh, and many others. The fully connected layers (FC) are usually attached at last for classification outputs. The number of fully connected layers can be more than one depending on the design and the number of classification outputs.

Besides the above-mentioned aspects which are called hyperparameters of CNN, the learning rate, and optimization function can also affect the high accuracy. The common learning rate values are 0.1, 0.01, 0.001, or 0.0001. The very small learning rate leads to a slow convergence and may get stuck in local minima while using a large learning rate can skip the global optimal values. Sometimes, the learning rate is also described as a function. The optimization functions can be Adam, RMSProp, Stochastic Gradient Descent (SGD), Adagrad, etc. Different optimization yields different gradients and directions leading to the optimal weights.

State-of-the-arts architectures

Various designed modules for CNN are proposed to lengthen the network to increase accuracy. Some modules can reduce the computation and the model size while maintaining accuracy.

One of the popular pre-trained models such as AlexNet⁷ was developed by Krizhevsky et.al. It was trained by ImageNet dataset in ILSVRC-2010 and ILSVRC-2012, with 1.2 million images in 1000 categories. The architecture contains eight learnable layers five convolutional and three fully connected (called fc6, fc7, fc8). The first five layers are convolutional and the remaining three are fully connected.

The reference CaffeNet⁸ is a similar version of AlexNet, except that the max pooling layer precedes the local response normalization (LRN) to reduce memory usage. GoogLeNet⁹ is a deep convolutional neural network structure. It was used as a classification and detection tool in ILSVRC14 with the goal to work with small data sets and use small computing power and memory. It employs an inception module that simultaneously computes 1×1, 3×3, and 5×5 convolutions, enabling the selection of proper filter sizes automatically. It was trained in ILSVRC 2014 challenge to classify the image into one of 1000 leaf-node categories. ImageNet dataset consists of over 1.2 million training images, 50,000 validation and 100,000 testing images.

SqueezeNet¹ aims to improve AlexNet efficiency while holding the same level of accuracy. The minimized CNN has advantages: saving communication time between the server and the clients for over-the-air updates, and feasibility for embedded-device deployment. SqueezeNet utilizes the methods such as reducing filter sizes, reducing the number of input channels, and delaying downsampling. SqueezeNet was trained on ILSVRC2012 ImageNet. The design focuses on achieving a smaller model size while keeping the same accuracy as AlexNet.

VGG net improved the accuracy by adding more convolutional layers and removing LRN layers. It was trained on ImageNet ILSVRC-2014¹⁰. The model has various numbers of layers: 11, 13, 16, and 19 layers, making the model parameters vary between 133 and 144 million. It was trained on ILSVRC2012 ImageNet (1.3 million images, with 50K validation images, and 100K testing images).

ResNet was one of the very first models that contain many layers. Particularly, it consists of many convolutional blocks consecutively. The convolutional block forms a residual block designed to solve the gradient vanishing or exploding problem. The network won the ILSVRC competition in 2015. It has variations such as ResNet50, ResNet101, ResNet152. It may be combined with modules in GoogLeNet, known as GoogleResNet etc.

SENet was proposed¹¹, based on two subsequence modules, Squeeze and Excitation. It has the squeeze and excite operations. The squeeze operation performs the combining of feature maps across the dimension \(H\times W\) to obtain a channel descriptor. The excitation operation captures the channel dependency and learns the relationships between channels. It performs the activation of the excitation on each channel.

To utilize these above-pre-trained models, transfer learning is a common approach that transfers the knowledge from the source model to the target model¹². For image applications, image features such as edges, and shapes are learned in the early layers. They are used in the later fully connected layers which are supposed to be fine-tuned for specific tasks. It is useful when the target data set size is smaller than the source data set and when the nature of target images is similar to the source images. The closer they are, the fewer tune layers there should be. With pre-trained models, a small learning rate should be used for pre-trained models so as not to skip unlearned features.

CNN architecture design

The above typical CNN architectures need to be adjusted when they are applied to a new dataset. At a small scale, one needs to fine-tune the hyperparameters of each layer. At the medium scale, the architecture needs to be adjusted. For example, adding the connection to merge the features from different scales. Some convolutional blocks may be added to reduce the feature map size. At a large scale, the whole architecture can be changed, for example, by changing to use the transformer or sequence-to-sequence.

In this paper, the change in the small scale is called hyperparameter tuning. In the medium scale, the connection structures are called micro-architecture. From the previous work¹³, using transfer learning to fine-tune to the new task while adopting conventional different architecture does not lead to a significant change in model accuracy. Thus, we are interested in the changes in the micro-architecture level. The adjustment in this level can lead to accuracy improvement and optimal parameters suitable for a specific classification task. The hyperparameter is possible to explore during the micro-architecture search. This sometimes is called, CNN optimization.

At the micro-architecture level, the choices of using different kinds of layers (e.g. convolutional layers, pooling layers, classification layers) are explored. These layers may be combined into the module for certain purposes. In the module, the convolutional layers certainly play the dominant role in hierarchically extracting meaningful features. As a result, the effective optimization of micro-architecture primarily relates to utilizing the different types of modules to contribute the accuracy while maintaining the network size.

Inception module It is based on GoogLeNet⁹. The purpose of the module is to increase the model depth and make the network wider to allow parallel computation and increase accuracy. The module factorizes the large convolution into smaller ones to reduce the total number of computations and model size. For example, Inception-V1 contains 1x1, 3x3, and 5x5 convolutions, 3x3 max pooling computing simultaneously, and concatenates the results from them together. Figure 1 is an example of the first version of the Inception module. The simultaneous computation of these convolutions can speed up the training time significantly although the network is very deep. The use of various filter sizes enables the selection of proper filter sizes automatically. In Inception-V2, the module is made wider with several small filters, 3x3, 1x3, 3x1, etc. The filter 1x1’s are used to reshape the feature maps and change the dimensionality. The variation of the modules depends on how to factor convolutions to increase the depth and accuracy. GoogLeNet’s training time is faster than other previous networks.

Residual block used in ResNet¹⁴, proposed as shown in Fig. 2 to make the network deeper. In particular, the mapping \(F(x)+x\), called identity mapping is created and the feature output from previous layers can be transferred. F(x) is called residual learning operation which may be some convolution layers. During the learning, \(F(x)+x\) is approximate as well as F(x). The idea is to solve the problem of accuracy degrading when increasing the depth of the network. This is depicted by the shortcut edge shown in Fig. 2.

Making it deeper this way helps prevent overfitting and exposes more opportunities to improve accuracy by gradually tweaking the model into the underlying function instead of only skipping unneeded layers. The depth of the model is maintained by doing the exact identity mapping for the following layers by adjusting their weights of the residual function to zero. On the other hand, the residual function alleviates a little remaining error of prediction by finding the optimal function which is closer to the identity mapping.

Skip connection Highway networks¹⁵ enables the flow of intermediate data on the previous layers by using the skip connections across the sequence of following layers instead of only the layer-by-layer forwarding. The research motivated by recurrent neural networks uses the learned gating units to control the rate of information flow which gives the benefit of individually better responding on each of different input data.

Fire module SqueezeNet¹ introduces Fire module shown in Fig. 3 which splits a regular convolution layer into two sub-layers called squeeze and expand layers. The squeeze layer demonstrates the usage of 1x1 convolution filters to decrease several input channels into each convolutional layer. The expand layer minimizes the number of model parameters while preserving a level of accuracy via a given ratio of using 1x1 and 3x3 convolution filters instead of using the whole of 3x3 filters in order to extract features from the input transformed by the squeeze layer. Both results are concatenated as an output of the module. According to the empirical results, SqueezeNet reveals a capability of prediction at the same level of accuracy as AlexNet (citeAlex whereas using 50x fewer parameters.

SE block Squeeze-and-Excitation Network (SENet)¹¹ proposed the SE block which can be attached to a convolutional layer as shown in Fig. 4. The SE block is used to investigate the importance and relationship of each feature of output channels. It applies global average pooling on each feature map to derive the channel descriptor (squeeze operation) which is then fed into two fully-connected layers to further learn the feature importance (excitation operation). Thus, the block has a role to rescale the original feature maps, strengthening the significant ones, and suppressing the less impact ones.

Using the attached SE blocks requires additional parameters compared to the original model. Our work exposes the worthiness of enhancing considerably the level of accuracy with the minimal additional cost of memory size and computation. In other words, the SE block attachment, rather than adding convolution layers, can help to improve accuracy on the deep model with a small increment of parameters.

Related works

The concept of neural architecture search (NAS) has been established years ago. The goal is to automatically discover the optimal architecture for specific tasks. The area has recently become active in deep learning research. Various techniques have been applied to NAS and demonstrated great success on a variety of deep learning tasks, such as image recognition, natural language processing etc.¹⁶

In earlier years, reinforcement learning (RL)-based NAS methods have been used to search for the architecture. It has a controller to generate the potential neural networks being trained to acquire their performance as a reward for evolving the controller with a reinforcement learning algorithm. The earlier example is¹⁷ where the whole networks are searched. NASNet¹⁸, proposed a design of search space to discover only an architectural building block, instead of the entire network architecture, on a proxy dataset and then the learned block, i.e. scaling a number of the learned block, is transferred to the targeted dataset. It can save computational time and resources for searching and enable transferability to the related tasks.

ENAS¹⁹ improved the efficiency of searching by sharing weights of all possible architectures of the search space via creating a large computational graph gathering all possible models, and then every subgraph, i.e. any sampled architecture, utilizes their corresponding weights together instead of retraining the new sampled architecture from scratch.

While the algorithm of the NAS style performs an automatic search on the whole architecture, searching the micro-architecture types is also possible. For some classification tasks, the need to adjust the portion of the network can lead to better performance. Exhaustively searching for possible architectures consumes time resources and effort.

With the rise of AutoML framework, it becomes possible to implement the micro-architecture search easily. AutoKeras²⁰ is one of the tools for AutoML which can automate the model finding. The tool relies on the NAS algorithm and it has three steps: Update, Generate, and Observation. In the update phase, it trains the Gaussian process model from existing architectures. and During the generation phase, it creates the training model based on the acquisition function (UCB)²¹. Also, several optimizations are considered to limit the search space such as editing distance and tree optimization. In the observation step, the generated model is trained and the accuracy is observed. Keras Tunner²² is a hyperparameter optimization framework for the hyperparamter search. It contains a built-in search algorithm such as Bayesian optimization. The algorithm has a behavior based on randomness, but its search time and results are acceptable. Furthermore, the framework also provides other recent search algorithms such as Hyperband²³, and the traditional random search. The framework also allows the new implementation of the search algorithm.

In the recent work²⁴, the Autopytorch framework utilized the multi-fidelity approach to optimize the hyperparameters²⁵ The approach utilizes less cost and uses a meta-learning approach.

Squeeze-and-Excite block (SE block)¹¹ is a kind of attention mechanism used previously in Seq2Seq networks. The concept is to concentrate on more useful feature channels than the less useful ones. Figure 4 focuses on the channel attention. Such a mechanism is increasingly popular among many researchers who study at the micro-architecture level.

To learn the channel importance of a given immediate result from any module of CNN, SE block used on SENet learns and performs feature re-calibration to highlight the informative feature channels. It aggregates the spatial information of each channel by using global average pooling (GAP) operation and then these channel representations are learned their importance through a bottleneck with two fully connected (FC) layers.

While SENet uses only GAP operation for embedding information of each channel, there are other variations. CBAM²⁶ uses additional information via max-pooling operation. Global information embedding is utilized by the shared two FC layers to compute and combine both results, i.e. channel importance, using the element-wise summation. Not only the channel attention module but also CBAM presents a spatial attention module to refine the feature map along a spatial dimension.

The above-mentioned channel attention modules moderately increase the model complexity. Although the bottleneck with two FC layers proposed can reduce a fewer number of parameters when compared to the non-bottleneck version, it causes a channel dimensionality reduction during learning. ECA-Net²⁷ selected 1D convolution operation instead of the two FC layers to perform local cross-channel interaction for calculating the channel importance. The local cross-channel interaction strategy helps avoid the problem of channel dimensionality reduction and still preserves the performance while significantly decreasing model complexity.

In this paper, we demonstrate the proper use of SE blocks: by exploring the block attachment positions and variations. The idea can be adapted to explore more options like in CBAM and ECA-Net.

Design methodology

In this section, we explain how we add flexibility to the models to facilitate the exploration process. The variability is divided into two levels. The first is the machine learning parameters and the hyper-parameters of the layers. Both are handled directly by AutoKeras hyperparameter packages.

Secondly, at the micro-architecture level, the user may explore the possible use of e.g, residual blocks, the number of inserted blocks and the location of insertion, etc.

Baseline architecture

In Fig. 5, we present the baseline architecture used in the methodology. The SqueezeNet architecture consists of 7 fire modules (fire2-fire9). The fire module is depicted in Fig. 5. The input size is 224\(\times\) 224. The implementation is adopted from²⁸.

In Fig. 5a, the original SqueezeNet contains 3 fire modules followed by a max pooling layer, then 4 fire modules, and a max pooling layer again. The two final layers are the 9th fire module (fire9) and the 10th convolution (conv10) for classification (instead of the fully connected layer). In Fig. 5, max-pooling layers are inserted differently. In particular, it is inserted in every two modules. Both have (fire2–fire9) and 1 convolution layers.

In reality, the number of fire modules can be varied as demonstrated in the original work. Thus, we add on the first flexibility by introducing how to create a network with a flexible length.

Adding variable length and bypass

In the original paper of SqueezeNet, the authors proposed to have bypass connections. The bypass connections skip only the odd fire modules due to the compatibility of dimension sizes on the combined layer. Figure 6a shows the bypass configuration. In the first modification, we propose to put the flag on the bypass connection. In Fig. 6, the bypass connection is shown in a dashed line i.e., the connection can either be inserted or not inserted. After the last fire module, the dropout layer is added. It has a dropout rate as a hyperparameter whose possible values are in [ 0, 0.5, 0.8].

Since the fire modules are used in pair, we propose to use the flexible number of fire modules in pairs. Figure 6b shows the two dashed rectangles which highlight the groups of two fire modules. In each group, the second fire module can still be coupled with the skip connection. The first group has a flexible length, either 1 or 2 since it is required to have at least one fire module. In the second group, the possible number of fire modules is 0, 1, 2. This leads to the total possibility of up to 4 groups, or equivalently up to 8 fire modules. There are choices of pooling operations, either average or max pooling, as well.

Module insertions

Adding SE block may yield accuracy improvement¹¹. It is a channel-based attention module and can be easily attached to the baseline network, in a similar way as the residual operation. However, there are many possible insertion points for a deep network. Considering one by one each is time-consuming²⁹. Figure 7 presents SqueezeNet with SE block insertions. It is seen that there are many possible points of insertion after the fire modules.

In Fig. 8, we can attach the blocks in various positions in SqueezeNet. Adding more flexibility, we take the network from Fig. 6b and add the possible connection of SE block after each fire module. The group in the dashed rectangle now contains the SE insertion. The hyperparameter for SE block is the squeeze ratio valued in [ 8, 16, 32]. Also, the variable skip connection is added to, perhaps, bypass the SE block.

Multiple input merging

At last, due to the concept of the neural network, if some domain knowledge is added, the network can yield higher accuracy. Thus, adding input features can improve the accuracy. However, adding too many features can lead to over-fitting or high computation without the improvement of accuracy.

Figure 9 shows the typical approach to merge the two inputs, using the addition operation. In some networks, such as Siamese network³⁰, the use of two models was proposed. Each input is fed into each model and then the merging of outputs is done at the last stage. In our methodology, we consider the addition of inputs at different layers. Thus, this implies the combining of multiple features at the flexible merging layer. Figure 9 is improved from Fig. 8 by merging after the first layer. Finally, Fig. 10 shows the variable merging points. Note that there exists only one merge point. After the two paths are merged, only one path remains.

In Fig. 11, the selected architecture contains a first block of two fire modules, and then the two outputs are merged. After that, there are a pooling layer and two consecutive pairs of fire modules. For each layer, the hyperparameters are selected. The concept can be expanded to merge any number of inputs.

The proposed methodology presents the concept of adding varieties for architecture exploration in the three aspects. In the next section, we conduct experiments using the methodology to find the model architectures with high accuracy. The framework will facilitate the model architecture exploration process.

Experiments

The experiments compare the results of 4 types of micro-architecture explorations based on the previous section.

(1)
The variable number of fire modules,
(2)
The addition of skip connections,
(3)
The insertion of the SE blocks in various places, and
(4)
The multiple-input merge points.

The two data sets: CIFAR-10³¹ and Tsinghua Facial Expression datasets³² is used for the experiments. Three search strategies are executed to perform model searching and the validation accuracy is reported. We measure the top-10 model sizes of the best solutions found. Some of the found solutions yield better accuracy than the baselines with the smaller model size.

We divide the section as follows. First, we report the results on CIFAR-10 benchmark. The goal is to find a suitable architecture by varying the micro-architecture connection and hyperparameters. Second, the results for the recent facial expression dataset are reported. For this data set, we also extract the landmark feature from the dataset. The feature is combined with the default inputs where the merging locations are explored. The main goal is to find out, whether adding input features yield a model with better accuracy. This demonstrates the need for combining inputs to the model with possible merging layers.

Hypothesis

In the experiments, the assumptions have been set up for training. Table 1 presents the machine learning and hyperparameters.

Table 1 Machine learning and Hyper parameters.

Subjects

Abstract

Similar content being viewed by others

Micro-expression recognition model based on TV-L1 optical flow method and improved ShuffleNet

Pet dog facial expression recognition based on convolutional neural network and improved whale optimization algorithm

A study on expression recognition based on improved mobilenetV2 network

Introduction

Background

Convolutional neural networks

State-of-the-arts architectures

CNN architecture design

Related works

Design methodology

Baseline architecture

Adding variable length and bypass

Module insertions

Multiple input merging

Experiments

Hypothesis

CIFAR-10 dataset

Tsinghua facial expressions

Other applications

Image segmentation

Object detection

Conclusion and future work

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links