Complex hybrid weighted pruning method for accelerating convolutional neural networks

The increasing interest in filter pruning of convolutional neural networks stems from its inherent ability to effectively compress and accelerate these networks. Currently, filter pruning is mainly divided into two schools: norm-based and relation-based. These methods aim to selectively remove the least important filters according to predefined rules. However, the limitations of these methods lie in the inadequate consideration of filter diversity and the impact of batch normalization (BN) layers on the input of the next layer, which may lead to performance degradation. To address the above limitations of norm-based and similarity-based methods, this study conducts empirical analyses to reveal their drawbacks and subsequently introduces a groundbreaking complex hybrid weighted pruning method. By evaluating the correlations and norms between individual filters, as well as the parameters of the BN layer, our method effectively identifies and prunes the most redundant filters in a robust manner, thereby avoiding significant decreases in network performance. We conducted comprehensive and direct pruning experiments on different depths of ResNet using publicly available image classification datasets, ImageNet and CIFAR-10. The results demonstrate the significant efficacy of our approach. In particular, when applied to the ResNet-50 on the ImageNet dataset, achieves a significant reduction of 53.5% in floating-point operations, with a performance loss of only 0.6%.

method, known as Filter Pruning via Geometric Median (FPGM) 24 , is employed to accurately trim redundant filters within the model.By introducing the Weighted Hybrid Criterion (WHC) 25 , a data-independent scheme robustly identifies the most redundant filters, taking into account factors such as filter size and linear correlations between filters, thus facilitating their targeted and precise pruning.
Structured pruning allows the use of computational acceleration libraries, while unstructured pruning, although capable of achieving the maximum pruning rate, cannot utilize computational acceleration libraries.Therefore, researchers prefer structured pruning.Regardless of the pruning strategy, an evaluation of the redundancy of filters must be conducted first, with evaluation criteria categorized into norm-based criteria and similarity-based criteria.
Filters with small norms are considered less crucial, while redundant filters exhibit similarity.However, these investigations predominantly focus on the convolutional layer alone.In contemporary neural networks, a BN layer is often introduced following the convolutional layer during training, aimed at stabilizing the input data for the subsequent convolutional layer 26 .This addition modifies the data distribution.The fundamental condition for the safe removal of a filter is the minimal impact on the subsequent convolutional layer.Notably, the data pipeline encompasses not only convolutional layers but also BN layers.Consequently, when undertaking pruning, it becomes imperative to account for the data transformation introduced by the BN layer.
To mitigate the influence on the input of the subsequent convolutional layer, we present a novel method termed Complex Hybrid Weighted Pruning (CHWP).This approach accounts for both the convolutional layer and the BN layer, merging the norm-based and similarity-based criteria.In a detailed manner, we employ a weighted allocation approach to distribute the parameters of the BN layer among filters.This allocation is used to recalculate the norms of filters after applying the weighted distribution.Additionally, we utilize the norms of other filters as weights for the similarity of those filters.Subsequently, a score is computed for each filter, assigning higher scores to filters with larger norms and notable dissimilarities from other filters.Following this, filters with lower scores are identified and removed from consideration.It is noteworthy that CHWP differs from the criteria for filter selection based on norms and similarity.Even when the conditions set by these criteria are not met (Norm-based criteria require a large variance in the norms of these filters, while the similarity-based criterion performs poorly when all filters are dissimilar), its performance remains unaffected, as shown in Fig. 1.In figure 1 The criteria based on norm and similarity are complementary.The norm-based criterion performs poorly when the norm distribution is concentrated, while the similarity-based criterion excels in such cases.However, the limitation of the similarity-based criterion is similar to that of the norm-based criterion; it is challenging to identify redundant filters when all filters are dissimilar.These two methods assess filter redundancy from different perspectives.Consequently, we combine these two criteria and propose the CHWP (Complex Hybrid Weighted Pruning) method.Following the principle of minimizing the impact on the input of the next layer, CHWP aims to better identify redundant filters with both a concentrated norm distribution and low similarity.We calculate scores for each filter using the CHWP method, considering filters with low scores as redundant.Extensive experiments on two benchmark datasets validate the effectiveness and efficiency of the proposed method.

Preliminaries
In this subsection, we introduce the symbols and notations used to describe neural networks.We assume a neural network with L convolutional layers and BN layers.We use N l and N l+1 to denote the number of input and output channels of the l-th convolutional layer, and F li to denote the i-th filter of this layer, F li ∈ R N l ×K×K , 1 ≤ l ≤ L, 1 ≤ i ≤ N l+1 , K denotes the size of the convolution kernel.γ li and β li represent the i-th parameter pair of the l-th BN layer.

Analysis of norm-based and similarity-based criterion
Several approaches mentioned earlier have demonstrated the utilization of norm-based and similarity-based criteria.However, in certain models, these criteria may not be well-suited, leading to unpredictable outcomes.This is illustrated in Fig. 2, where the blue dashed line and yellow solid line represent the ideal distribution and the actual distribution of filter norms or similarity, respectively.
As depicted in Fig. 2a, the deviation of the filter norm distribution may be too small, indicating that the norm values are highly concentrated within a narrow range.This makes it challenging to identify suitable thresholds for selecting filters to be pruned.In the case shown in Fig. 2b, where the smallest filter norm is relatively large, filters that are considered irrelevant by norm-based criteria may still have a significant impact on the network.This implies that pruning these filters could result in severe negative consequences.Similar to norm-based criteria, the distribution of filter similarity scores depicted in Fig. 2c exhibits excessive concentration, where the narrow range of scores makes it challenging to select an appropriate threshold for filter pruning.In Fig. 2d, the highest cosine similarity scores among the filters in the model remain notably low.In other words, the filters demonstrate significant dissimilarity.For example, similarity-based criteria would treat (0, 0.1) and (1, 0) as equally important.Under such circumstances, criteria based on similarity cannot effectively accomplish the intended purpose.
The statistical data obtained from ResNet-18 pre-trained on ImageNet 27 , presented in Fig. 3, substantiates the previous rule-based analysis.The norm or similarity distribution is plot in the kernel density estimation curve, a non-parametric technique for estimating the probability density of random variables.
In the case of the first convolutional layer in ResNet-18, as shown in Fig. 3a, a large number of filter norms are distributed near 0 to 3, which is close to a uniform distribution, making it suitable for norm-based criteria.Conversely, as illustrated in Fig. 3c, the norms in the second convolutional layer of ResNet-18 are concentrated in the range of 1 to 1.5, close to a normal distribution.Compared to the observed range of norms in the first layer, this distribution is noticeably narrower, making it challenging to set an appropriate threshold to distinguish the importance of filters.
For the first convolutional layer of ResNet-18, the scores based on similarity criteria, as shown in Fig. 3b, have the majority of filter norms falling within the interval [− 2, 2].The dense distribution of filters presents a challenge in selecting the optimal threshold for differentiating critical filters.This is because similarity criteria consider filters with lower scores (greater dissimilarity) as more critical, but there are few filters in the low-score range of [− 4, − 2].Regarding the second convolutional layer of ResNet-18, as depicted in Fig. 3d, the scores for these filters approximate an ideal distribution, making similarity-based criteria suitable.
By analyzing and comparing, it is determined that the first convolutional layer is more suited to normbased criteria, while the second convolutional layer is better suited to similarity-based criteria.In practice, calculating scores for these filters based on both criteria and manually selecting the appropriate criterion can be www.nature.com/scientificreports/time-consuming and labor-intensive.Therefore, this paper combines both methods using a weighted approach, eliminating the need for manual analysis and criterion selection.

Complex hybrid weighted pruning
Pruning aims to remove redundant filters that have the least impact on the next layer (convolutional or fully connected layer).The computation process from the current layer to the next layer is illustrated in Fig. 4, where data not only undergo convolutional operations but also pass through BN layers for scaling and shifting.When pruning redundant filters, corresponding BN layer parameters need to be removed as well.Therefore, pruning requires simultaneous consideration of both convolutional and BN layer parameters.The computation of the BN layer is described by Eq. (1), where µ and σ are the mean and variance of all feature maps in the l-th layer.x i represents the feature map output of the i-th channel in the convolutional layer, and y i is the corresponding output of the BN layer.ε is a small positive constant added to prevent division by zero.γ i and β i are learnable parameters used for scaling and shifting the normalized values.They are trained through backpropagation to enable the network to adapt to the distribution of the data.These computations are independently performed for each feature channel.These computations demonstrate that the BN layer performs learnable scaling and shifting on the feature maps of the convolutional layer before they are input to the next convolutional layer.Therefore, we believe that when pruning, the learnable parameters of the BN layer should also be taken into consideration.
We propose a complex hybrid weighted pruning method to robustly prune redundant filters while minimizing their impact on subsequent network layers.In CWHP, there are two instances of weighting.The first instance involves weighting the filter norms using the parameters of the BN layer.The second instance involves weighting the dissimilarity using the filter norms.This method takes into account not only the norms and similarities of filters in the convolutional layers but also the parameters of the BN layers.The importance score calculation for the i-th filter F li in the l-th layer is as follows: where (1)   www.nature.com/scientificreports/and F li 2 represents the l2 norm of the filter parameters F li .In Eq. ( 2), the first part ψ (l,i) represents the norm- based significance of filter F li after being weighted by the parameters of the BN layer, while the remaining part(excluding ψ (l,i) ), indicates the cumulative dissimilarity between filter F li and other filters.
To justify the functioning of CHWP theoretically when applying CHWP in Eq. ( 2), we first discuss the ψ (l,i) component.Following the prevalent use of CNN-based design models, the forward computation process involves convolution operations followed by the BN layer.As shown in Eq. ( 3), Due to the scaling and shifting performed by the BN layer on the feature maps, we also apply corresponding scaling and shifting to the L2 norm of the feature maps, denoted as F li 2 .Here, α is a hyperparameter that balances the influence of γ and β.
In Eq. ( 2), the dissimilarity metric is defined as 1 − cos θ i,j ∈ [0, 1] , with ψ as the weighting parameter.This metric effectively enhances the relationship between filter norms and dissimilarity, addressing the challenge of norm-based criteria losing effectiveness when norms are close.Additionally, unlike traditional Euclidean distance or angle-based distance 28 , CHWP select filters that are more orthogonal to other filters.This is because their projection lengths onto other filters are relatively short, making it advantageous for removing more redundant features.
In CHWP, we directly use filter norms and BN layer parameters as weights, effectively eliminating blind spots associated with norm-based and similarity-based criteria.When dealing with filters that exhibit minimal norm discrepancies, CHWP adeptly utilizes dissimilarity information to evaluate filters and identify those with the highest redundancy.When facing filters with relatively high angular similarity, it can select critical filters based on norm information.There is a scenario in which CHWP's efficiency may decrease, which is when the scores computed by CHWP for various filters are close to each other.However, this situation implies the absence of redundancy, thereby negating the need to prune the corresponding model.

Algorithm description
As described in Algorithm 1, we employ CHWP to execute filter pruning following the common "Pretrain-Prune-Finetune" pipeline mechanism (as shown in Fig. 5), whereby pruning is conducted at the different pruning rate for each layer.Although iterative mechanisms 29 , knowledge distillation 20,30 , sensitivity analysis for determining layered pruning rates , and certain fine-tuning techniques have been demonstrated to enhance the performance of pruned CNNs, we have refrained from utilizing these methods for the purpose of presentation and validation.

Ethical and informed consent for data used
The data used in this study were publicly available data sets on the Internet.No animals or humans were victims.

Experimental settings
Following SFP and FPGM, we utilized several ResNet models of different depths for experiments conducted on both the CIFAR-10 (Canadian Institute for Advanced Research, 10 classes) 31 and ImageNet datasets 27 .The reason we use these datasets and models is for ease of comparison with other pruning methods, as these datasets and models are widely adopted.We assess CHWP on various-depth ResNet models with pruning rates set at 40%, 50%, and 60% for those datasets.
The CIFAR-10 dataset is a subset of the Tiny Images dataset, comprising 60,000 32 × 32 color images.Each image is assigned to one of the 10 mutually exclusive classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.For each class, there are 6000 images in total, with 5,000 images designated for training and 1000 for testing.The relatively low resolution of the images, coupled with the small size of the objects within them, imposes higher performance requirements on algorithms being evaluated.The CIFAR-10 dataset is widely used Figure 5. "Pretrain-prune-finetune" pipeline mechanism flow chart.
in the development, testing, and comparison of various machine learning and deep learning models within the computer vision domain.
The ImageNet dataset is a large-scale visual recognition dataset containing over 1.2 million training images and 50K validation images spanning 1000 distinct classes.Each class represents a wide range of object categories, encompassing animals, and objects.This dataset is a fundamental resource for training and evaluating computer vision models, particularly those designed for image classification tasks.ImageNet has played a crucial role in advancing the field of deep learning, serving as the basis for the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which has been pivotal in benchmarking state-of-the-art image classification algorithms.
We conducted experiments using the Python programming language on the PyTorch deep learning platform.We maintained consistent experimental settings as outlined in the FPGM and WHC, which encompassed data augmentation strategies, pruning configurations, and fine-tuning.We use the accuracy of the unpruned pretrained model as the baseline.Due to the fact that we pretrain the network in a different deep learning framework than a few other methods, there is a slight discrepancy (less than 0.5%) in our baseline compared to theirs.Therefore, our primary focus lies on examining the relationship between the reduction in FLOPs and the corresponding drop in accuracy.During the pruning phase, for a clearer comparison, we adopted the same pruning strategy as SFP and FPMG.This also implies that their reduction rates in FLOPs (Floating Point Operations) are identical.CHWP was compared against a selection of well-established methodologies, including data-independent normbased PFEC 20 , SPF 21 , relation-based FPGM 24 , WHC 25 , ASPF 32 , as well as various data-dependent techniques such as HRank 33 , GAL 34 , LFPC 35 , CP 36 , NISP 37 , ThiNet 38 and ABC 39 .

Evaluation on CIFAR-10
In order to reduce experimental errors, we conducted three repeated experiments on the CIFAR-10 dataset, and the results were averaged.The results presented in Table 1 demonstrate the average accuracy achieved after fine-tuning.Table 1 clearly shows that the proposed CHWP method outperforms the several pruning methods that have been proposed in recent years.Specifically, in the case of ResNet-110, CHWP achieves a remarkable reduction in FLOPs by 65.8%, while maintaining minimal impact on average accuracy.In contrast, under the same experimental conditions, the rule-based SFP method experiences a notable decrease of 0.78% in accuracy.Furthermore, when compared to the pioneering WHC method, CHWP exhibits a competitive performance.These results suggest that CHWP, when applied at a moderate pruning ratio, effectively mitigates model overfitting and removes redundant filters without compromising overall model performance.
When compared to iterative ASFP, data-driven HRank, automl-based ABC, and LFPC, CHWP achieves a greater reduction in FLOPs in both ResNet-56 and ResNet-110.Remarkably, in terms of accuracy, CHWP surpasses LFPC by 0.42% and 0.75% for ResNet-56 and ResNet-110, respectively.This underscores CHWP's effectiveness in identifying the most redundant filters and underscores the importance of considering BN layers during the pruning process.Furthermore, when compared to the aforementioned methods and at similar pruning rates, as the depth of the CNN increases, CHWP demonstrates a smaller decline in performance for the pruned models.This phenomenon can be attributed to the fact that deeper CNNs inherently contain more redundancy, which CHWP robustly eliminates without significantly compromising the CNN's capacity.
In experiments on the CIFAR-10 dataset, it can be observed that as the depth of the network increases, the redundancy of CNN parameters gradually increases.These redundant parameters interfere with the decisionmaking of the CNN.For a ResNet with a depth of 20, when the FLOPs are reduced by 42.2%, the accuracy decreases by 0.16%.Interestingly, for a depth of 101, when the FLOPs decrease by 40.8%, the accuracy actually increases by 0.67%.As the pruning rate increases, the number of redundant parameters decreases.When FLOPs decrease by 65.8%, the accuracy increases by 0.16%.This result indicates that training larger CNN models on small datasets is prone to overfitting.Proper pruning can reduce computational load, alleviate overfitting, and maintain model performance.

Evaluation on ImageNet
Alongside top-1 accuracy, we incorporate top-5 accuracy as a metric due to the ImageNet dataset's extensive collection of images, many containing multiple objects.Each image is assigned only one true label.Given that the algorithm's classification result may correspond to one of the objects in the image, which might not align with the provided true label, we deem the algorithm prediction correct if it predicts one of the top 5 objects, and one of them matches the ground truth.
Three experiments were conducted using the ImageNet dataset, and the results are comprehensively presented in Table 2.As expected, CHWP not only achieved the highest top-1 and top-5 accuracies, surpassing several state-of-the-art approaches, but also exhibited the least degradation in performance.Specifically, in the case of ResNet-50, CHWP effectively reduced FLOPs by over 40% while experiencing minimal compromises in both top-1 and top-5 accuracies.In contrast, the norm-based SFP method encountered a significant decline of 14% in top-1 accuracy, surpassing the 1% threshold observed in other methods.
For ResNet-50, with pruning rates set at 50%, our pruned model outperforms FPGM by 0.7% and 0.2% in Top-1 and Top-5 accuracy, respectively.Additionally, for the pruned pre-trained ResNet-101, CHWP reduces model FLOPs by 42.2%.Surprisingly, top-5 accuracy improves by 0.31%, and top-1 accuracy increases by 0.42%.At this point, FPGM experiences a performance decline of 0.02%, while WHC sees an improvement of 0.38%.Compared with norm-based and relation-based criteria, CHWP's superior performance can be attributed to its synergistic utilization of both filter norm and similarity information, in conjunction with BN layer parameter pairs.This approach yields more robust and resilient results.

Ablation study
To further validate the efficacy of CHWP, ablation experiments were conducted to gradually decouple CHWP into distinct sub-components, as depicted in Table 3.In order to facilitate a comprehensive comparison, the results of cosine criterion were incorporated.We performed three rounds of 40% filter pruning on ResNet-32 and ResNet-56, and reported the average decrease accuracy after fine-tuning.Compared to the cosine similarity criterion 40 , the dissimilarity metric (DM) exhibited lesser precision degradation.

Visualization
This section presents the application of filter pruning with a 40% pruning rate on the shallow layer (first convolutional layer), intermediate layer (22nd convolutional layer), and deep layer (final layer) of ResNet-50 using CHWP, followed by the visualization of the corresponding output feature maps (Fig. 6). Figure 6a represents the input image, while (b), (c), and (d) depict the output feature maps of various filters in different depth convolutional layers of ResNet-50.Many filters with high similarity or low norms have been removed, as filters pruned in simplifying the network are considered ineffective in extracting valuable features.
, (a) is an example of a simple convolutional layer that does not fully satisfy the norm-based criterion and the similarity-based criterion.(b) and (c) are the scores for each filter in (a) based on the norm and based on the similarity criterion, respectively.The score distributions in (b) and (c) are quite concentrated with standard deviations of 0.08 and 0.06 respectively, which makes identifying redundant filters challenging.(d) applies our scoring method to score each filter in (a), with a standard deviation of 0.32, making the classification of whether a filter is redundant more obvious, thereby achieving robust performance.

Figure 1 .
Figure 1.Score of filter (a) under different methods (b-d).

Figure 2 .
Figure 2. Ideal and reality based on norms and similarity criteria.The blue dashed curve represents the ideal distribution, while the orange solid curve represents the distribution that might occur in practical situations.v and s respectively denote the values of norm and similarity.

Figure 3 .
Figure 3.The distribution of filter norms and similarity scores for the first two convolutional layers of ResNet-18.

Figure 4 .
Figure 4.The operation that data flows from the current layer to the next layer.

Figure 6 .
Figure 6.Visualization of different depth convolutional layers of ResNet-50 output feature maps.The feature map in the red box corresponds to the removed filter.
Taking into account the norms and dissimilarities, WHC based on a hybrid rule achieved favorable outcomes.In contrast to other methods presented in the table, CHWP yielded the most promising experimental results.In comparison to WHC, our devised CHWP demonstrated performance improvement in both ResNet-32 and ResNet-56, indicating the significance of employing a hybrid rule and considering the influence of BN layers.As the considered factors in the criteria become more comprehensive, the precision of removing redundant filters increases.The improvements in accuracy for HC, WHC, and CHWP (0.19% and 0.22%, respectively) demonstrate the equal significance of norm-based, relation-based criteria, and the introduction of BN layer.