Enhancing the accuracies by performing pooling decisions adjacent to the output layer

Learning classification tasks of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$({2}^{n}\times {2}^{n})$$\end{document}(2n×2n) inputs typically consist of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le n(2\times 2$$\end{document}≤n(2×2) max-pooling (MP) operators along the entire feedforward deep architecture. Here we show, using the CIFAR-10 database, that pooling decisions adjacent to the last convolutional layer significantly enhance accuracies. In particular, average accuracies of the advanced-VGG with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m$$\end{document}m layers (A-VGGm) architectures are 0.936, 0.940, 0.954, 0.955, and 0.955 for m = 6, 8, 14, 13, and 16, respectively. The results indicate A-VGG8’s accuracy is superior to VGG16’s, and that the accuracies of A-VGG13 and A-VGG16 are equal, and comparable to that of Wide-ResNet16. In addition, replacing the three fully connected (FC) layers with one FC layer, A-VGG6 and A-VGG14, or with several linear activation FC layers, yielded similar accuracies. These significantly enhanced accuracies stem from training the most influential input–output routes, in comparison to the inferior routes selected following multiple MP decisions along the deep architecture. In addition, accuracies are sensitive to the order of the non-commutative MP and average pooling operators adjacent to the output layer, varying the number and location of training routes. The results call for the reexamination of previously proposed deep architectures and their accuracies by utilizing the proposed pooling strategy adjacent to the output layer.


Introduction:
Classification tasks are typically solved using deep feedforward architectures 1- 6 .These architectures are based on consecutive convolutional layers (CLs) and terminate with a few fully connected (FC) layers, in which the output layer size is equal to the number of input object labels.The first CL functions as a filter revealing a local feature in the input, whereas consecutive CLs are expected to expose complex, large-scale features that finally characterize a class of inputs 1,[7][8][9][10] .
The deep learning (DL) strategy is efficient only if each CL consists of many parallel filters, the layer's depth, which differ by their initial convolutional weights.The depth typically increases along the deep architecture, resulting in enhanced accuracy.In addition, given a deep architecture and the ratios between the depths of consecutive CLs, accuracies increase as a function of the first CL depth. 11e deep learning strategy resulted in several practical difficulties, including the following.First, although the depth increases along the deep architecture, the input size of the layers remains fixed.The second difficulty is that the last CL output size, depth × layer input size, becomes very large, serving as the first FC layer input, which consists of a large number of tunable parameters.These computational complexities overload even powerful GPUs, limited by the accelerated utilization of a large number of filters and sizes of the FC layers.
One way to circumvent these difficulties is to embed pooling layers along the CLs 1 .Each pooling reduces the output dimension of a CL by combining a cluster of outputs, e.g., 2 × 2, at one, and  such operations along the deep architecture reduce the CL dimension by a factor 4  .The most popular pooling operators are max-pooling (MP) 12 , which implements the maximal value of each cluster, and average pooling (AP) 13,14 , which implements the average value of each cluster; however, more types of pooling operators exist 12,[15][16][17] .
The core question in this work is whether accuracies can be enhanced depending on the location of the pooling operators along the CLs of a given deep architecture.For instance, VGG16 consists of 13 CLs, three FC layers, and five (2 × 2) MP operators located along the CLs 2 (Fig. 1A).The results indicate that accuracies can be significantly increased by a smaller number of pooling operators adjacent to the last CL with optionally larger pooling sizes, for example, the advanced VGG16 (Fig. 1B).The optimized accuracies of these types of advanced VGG architectures with  layers (A-VGGm) are first presented for selected  values (6 ≤  ≤ 16).Next, the underlying mechanism of the enhanced A-VGGm accuracies is discussed.
Note that the replacement of the pair of pooling operators, [(4 × 4), (2 × 2)] along A-VGG16 (Fig. 1B), by several other options, for example, [(2 × 2), (4 × 4)] and [(2 × 2), (8 × 8)], also yielded an average accuracies > 0.95, indicating the superior robustness of A-VGG16 accuracies over VGG16.Removing the last three CLs (the fifth block of A-VGG16) resulting in A-VGG13, with an average accuracy of 0.955 , identical to that of A-VGG16 up to the first three leading digits (Table 1).One possible explanation to the same accuracies is that the receptive field 19 of the last three CLs of A-VGG13 is 7 × 7 saturating the 8 × 8 layers' input size.It also suggests that accuracies are only mildly affected by  > 13.
The A-VGG8 architecture that consists of only 8 layers, results in 0.940 averaged accuracy, exceeding the optimized VGG8 accuracy of 0.915, which consists of 5 (2 × 2) one after each CL 2,20 , and also exceeds the average accuracy of VGG16.Here again, (2 × 2) and (4 × 4) were placed after the 3 rd and the 5 th CLs, respectively (Table 1).This result indicates that a shallow architecture, with fewer pooling operators adjacent to the output, can imitate the accuracies of a deeper architecture with double the number of layers, while the receptive field covers a small portion of the layers' input size.
Using only one FC layer reduces the number of layers by two, from A-VGG16 to A-VGG14, and from A-VGG8 to A-VGG6 (Table 1).The results indicate that accuracies are only mildly affected by such modifications, where A-VGG6 achieves an average accuracy of 0.936, which slightly exceeds that of VGG16 and A-VGG14 exceeds 0.954 (Table 1).We note that this type of architectures with only one FC layer consists of fewer parameters and can be mapped onto tree architectures 21 .
Similarly, the A-VGG13 and A-VGG16 architectures with linear activation functions for the FC layers achieved similar averaged accuracies of 0.954 and 0.955, respectively, both with small standard deviations (Supplementary Information).The three linear FC layers can be folded into one in the test procedure 22 , minimizing its latency; however, training must be performed with three separated FC layers.
The gap between the average accuracies of A-VGG8 and A-VGG6 (~0.004) was slightly greater than that between the enhanced accuracies of A-VGG16 and A-VGG14 (Table 1), indicating that the gap decreases with .

Nevertheless, the comparable average accuracies of A-VGG13 and A-VGG14
with A-VGG16 indicate that removing two out of three FC layers or removing three out of the thirteen CLs does not affect accuracies.Hence, it is interesting to examine the average accuracies of VGG11 where two FC layers as well as the last three CLs are removed.

Table 1: Architectures and accuracies of A-VGGm.
A-VGGm architectures, m=6, 8, 13, 14, and 16, and their maximized average accuracies obtained from 10 samples (detailed parameters and accuracies' standard deviations are presented in the Supplementary Information).

Optimized learning gain using pooling operators:
The backpropagation learning step 23 updates the weights towards the correct output values for a given input.Typically, such a learning step can add noise and is destructive to a fraction of the training set [24][25][26][27] .However, the average accuracy increases with epochs and asymptotically saturates at a value that identifies the quality of the learning algorithm for a given architecture and database.
One important ingredient of DL is downsizing the input size as the layers progress.This can be done by either pooling operators or using the stride of the CLs.Although both reduce the size of the input, the pooling operators transfer specific output fields, such as maximal field in the MP operators.It aims to select the most influential field from a small cluster on a node in the successive layer, for example, MP (2 × 2).Its underlying logic is to maximize the learning step gain for the current input while minimizing the added noise by zeroing other routes; maximize learning with minimal side-effect damage.
However, this local maximization does not ensure a global one.
Commonly, several MP operators are placed among the CLs, for example, five times in the case of VGG16 (Fig. 1A), and apparently solve simultaneously the following two difficulties.First, although the depth, , increases along the CLs (Fig. 1A), the input size, , of the layers shrinks accordingly such that the output sizes of the CLs,  × , do not grow linearly with depth.Second, successive MP operators appear to select the most influential routes on the first FC layer, which is adjacent to the output layer.However, these local decisions following consecutive MP operators do not necessarily result in the most influential routes in the first FC layer, as elaborated below using a toy model.Assume a binary tree, where its random nodal values are low, medium, or high (Fig. 2A).The tree output is equal to the branch with the maximal field, which is calculated as the product of its three nodal values.The first strategy is based on local decisions, similar to MP operators.For each node the maximal child, among the two, is selected (gray circles in Fig. 2A), and the selected route is the one composed of gray nodes only, where its value is  ⋅  ⋅  (the brown branch in Fig. 2A).However, a global decision among the eight branches results in a maximal field  ⋅  ⋅  (green branch in Fig. 2A).This toy model indicates that a global decision differs from local decisions; however, the probability of such an event is unclear.
A more realistic model, imitating deep architectures (Fig. 1), is Gaussian random (1024 × 1024) inputs followed by ten (3 × 3) CLs with unity depth (Fig. The results clearly indicate that a global decision selects the most influential routes to the first FC layer.Hence, pooling adjacent to the output layer, is superior to the selection following consecutive local pooling decisions.This supports that using larger pooling operators adjacent to the output of the CLs enhances accuracies (Table 1).It is expected that using pooling operators solely after the entire CLs will enhance accuracies even further; however, its validation in simulations of A-VGGm architectures is difficult.The running time per epoch of such large  ×  deep architectures is several times longer, and the optimization of accuracies is currently beyond our computational capabilities.
A simpler architecture is the LeNet5 28,29 , with much lower depth and total number of CLs, consisting of two CLs followed by (2 × 2) MP each and three FC layers (Fig. 3A).The optimized average accuracy on the CIFAR10 database is 0.77. 11Advanced LeNet5 (A-LeNet5) architectures consist of pooling operators only after the second CL (Fig. 3A).In particular, the two pooling options, (2 × 2)  ∘ (2 × 2)  and (2 × 2)  ∘ (2 × 2)  were examined (examples  and  in Fig. 3A), imitating the dimensions of the two (2 × 2)  of LeNet5.Indeed, these A-LeNet5 architectures enhance average accuracies by up to ~0.02, in comparison to LeNet5 (Fig. 3B), as predicted by the abovementioned argument.Similarly, using either (4 × 4)  or (4 × 4)  after the second FC layer resulted in ~0.79 maximized average accuracies (not shown).The shift of the MP by only one CL, from the first to the second, improves the accuracies, and an enhanced effect might be expected by skipping over more CLs in deeper architectures (Fig. 2C).An interesting aspect of A-LeNet5 is that accuracies improved although the receptive field covers only a small portion of the input, in contrast to A-VGG16.
Another type of A-LeNet5 is a combination of a pair of (2 × 2) and (3 × 3) pooling operators after the two CLs (,  and  in Fig. 3A).Although the input size of the first FC layer decreased from 400 in LeNet5 to 256, the average accuracies was enhanced by ~0.011 ( in Fig. 3B).This result exemplifies the improved A-LeNet5 accuracies even when the input size of the first FC layer decreases.Examples  and  (Fig. 3A) consist of the same pooling operators, (2 × 2)  and (3 × 3) , but with the exchanged order of the operators.Their average accuracies differ by ~0.016 (Fig. 3B), indicating that these pooling operators do not commute with the exchanged order.Another possible class of commutation is the exchanged type of operation (color) while maintaining their size; exchanged  and  ( and  or  and  in Fig. 3A).Average accuracies indicate that pooling operators do not necessarily commute with exchanged colors.
The two non-commutative classes, order and type of operations, stem from different numbers and locations of the backpropagation active routes in the lower layers (Fig. 3C).The number of locally active backpropagation routes in a (6 × 6) window is 9 for (3 × 3)  ∘ (2 × 2) , whereas for (3 × 3)  ∘ (2 × 2)  is 4 .For the exchanged order of operators ( and  in Fig. 3A), the number of backpropagation active routes is the same, 4, in both cases.
However, these 4 routes were localized in (2 × 2) ( in Fig. 3B and Fig. 3C), but delocalized over (6 × 6) ( in Fig. 3B and Fig. 3C).Hence, the noncommutation of pooling operators can stem either from the different numbers of active backpropagation routes or from their different locations.

Discussion:
The aim of pooling operators, is downsizing the input size as the layers progress while transferring specific output fields, such as maximal field in the MP operators.It selects the most influential local field, but does not ensure a most influential global field on the output.The proposed enhanced learning strategy is based on updating the most influential routes, that is, the maximal fields, on the output units.This is supported by the A-VGGm and A-LeNet5 simulations, where the average accuracies are enhanced using pooling operators placed closer to the output layer (Fig. 1, Table 1, and Fig. 3).Its underlying mechanism is aimed at maximizing the learning gain for the current input, while simultaneously minimizing the average damage on the current learning of the entire training set.Each learning step for a given input induces noise on the learning of other inputs.Hence, increasing the signal-to-noise ratio (SNR) of a learning step, average over the training set, requires updating the most influential routes of the current input; maximize learning with minimal sideeffect damage.
The realization of the proposed advanced learning strategy entails a discussion of the following three theoretical and practical aspects.First, the selection of the most influential routes on the first FC layer is not necessarily equivalent to the selection of the most influential routes on the output units.However, a backpropagation step initiated at the most influential input weight on an output unit, updates all the CLs' routes since the spatial structure disappears within the one-dimensional FC layers.Hence, the proposed strategy approximates only the most influential routes on the outputs.The exceptional architectures were A-VGG6 and A-VGG14 (Table 1), consisting of one FC layer, demonstrating accuracies that were only slightly below A-VGG8 and A-VGG16, respectively.
The second aspect concerns the computational complexity of the proposed advanced learning strategy.Selecting the most influential routes after all CLs with their fixed depth overloaded even advanced GPUs since the depth increases while the layer's dimension does not decrease.For instance, the running time per epoch of (32 × 32) MP placed after all CLs of A-VGG16 was slowed down by a factor of ~10.To circumvent this difficulty, the advanced learning strategy was approximated by placing the first pooling operator before the CLs with maximal depth and the second operator after all CLs (Fig. 1 and Table 1).Nevertheless, it is interesting to examine, using advanced GPUs, whether placing pooling operators after all CLs further advances accuracies.
The third aspect is the selection of the types, dimensions, and locations of pooling operators along the deep architecture to maximize accuracies.For a given A-VGGm, several pooling arrangements result in similar accuracies, and we report only the one that maximizes the average accuracies under a given number of epochs.Nevertheless, the maximized average A-VGGm accuracies hint at preferred combinations where the AP is placed before CLs with maximal depth and the MP operates after all CLs (Table 1 and Fig. 1), which might stem from the following insight.MP after all CLs carefully selects only one significant backpropagation route among a cluster of routes, whereas an AP close to the input layer spreads its incoming backpropagation signal to multiple routes.This arrangement was found to maximize accuracies for several A-VGGm architectures (Table 1).However, A-LeNet indicated an opposite trend, where AP at the top of two adjacent pooling operators maximized accuracies (Fig. 3).
The role is not yet clear and may depend on the database and details of the training architecture.
We present an argument indicating that pooling decision adjacent to the output layer enhances accuracy (Table 1).However, one might attribute this improvement to the increase in the number of parameters in the first FC layer, where the number of parameters in the rest of CLs and FC layers remain the same.In order to pinpoint the accuracy improvement to the location of the pooling operators, we obtained ~0.954 for A-VGG16 with 4 × 4 AP after the 7 th CL and with 8 × 8 MP operator after the 13 th CL.In this architecture the size of the first FC layer is the same as in VGG16, and therefore the number of parameters in both remain the same, yet there is a clear improvement in the accuracy.
The non-commutative pooling operator features exemplify the sensitivity of the maximal average accuracies to their order and type, and significantly enrich the possible number of pooling operators with a given dimension.For (8 × 8)     pooling dimension, for instance, one can find 8 possible pooling operators; (2 × 2) ∘ (2 × 2) ∘ (2 × 2), where ,  and  equal either to  (Max) or  (Average).Similarly, the number of pooling operators with dimension (2  × 2  ) is 2  , and exponentially increases when more than two types of (2 × 2) pooling operators are allowed.The results for A-LeNet indicate that enhanced accuracies can be achieved using combinations of consecutive pooling operators after the second CLs (Fig. 3).However, the identification of preferred combinations to maximize the accuracies in general deep architectures deserves further research.
The non-commutative features of pooling operators also stem from their different number of backpropagation downstream updated routes (Fig. 3C).For instance, A-VGG16 with (32 × 32) MP, before the first FC layer, consists of a single backpropagation downstream updated route per filter, whereas for (32 × 32) AP there are 1024 routes.Nevertheless, the preferred pooling operators to maximize accuracies need to be determined.The most influential route is favored to correct the output of the current input; however, it induces output noise on other training inputs, resulting in a low SNR.Similarly, updating 1024 downstream routes using AP, including the weak ones, increases the correct output of the current input in comparison to MP; however, with enhanced side-effect, noise on other training inputs, resulting in a possibly low SNR.Hence, for a given architecture and dataset, the selection of pooling operators that maximize the averaged SNR per epoch is yet unclear.
The accuracies of A-VGG6 and A-VGG14 with only one FC layer were only slightly below those of the three FC layers, A-VGG8 and A-VGG16, respectively (Table 1).Architectures with only one FC layer are characterized by lower learning complexity and number of tunable parameters.In addition, these architectures can be mapped onto tree architectures 30,31 , generalizing recent LeNet mapping into tree architecture without affecting accuracies but with lower computational learning complexity 31 .Tree mapping of architectures comprising more than two CLs, inspired by dendritic tree learning [30][31][32][33][34][35] , is beyond the scope of the presented work and will be discussed elsewhere.

Advanced VGGm architectures. The examined advanced VGGm (A-VGGm)
architectures consist of m layers, 6 ≤  ≤ 16 (Fig. 1A-B, exemplifies  = 16 for VGG16 1 and A-VGG16. For  = 6 and 8, the architecture is similar to the VGG8 1 , with initial depth of 64 for the first CL and doubling depth for the next three CLs, and with a single zero-padding around the input of each CL.For  = 6, a (2 × 2) average pooling (AP) is applied after the third CL and an (8 × 8) max-pooling (MP) after the fifth CL.For  = 8, a (2 × 2) AP is applied after the third CL and a (4 × 4) MP after the fifth CL.  = 6 terminates with one FC layer consisting of 2048 hidden units and  = 8 with three FC layers with 8192 hidden units each.
For  = 14 and 16,there are 13 CLs with doubling depth (except for the last 3 CLs) and with a single zero-padding around the input of each CL, followed by one FC layer with 8192 hidden units for  = 14 and three FC layers with 4096 hidden units for  = 16.For both  = 14 and 16, a (4 × 4) AP is applied after the 7 th CL and a (2 × 2) MP is applied after the 13 th CL.
For  = 13 the last three CLs are withdrawn, resulting in ten CLs, where a (4 × 4) AP is also applied after the 7 th CL and a (4 × 4) MP is applied after the 10 th CL terminating with three FC layers consisting of 2048 hidden units each.
After each CL, a batch normalization layer was applied.The softmax function was applied to the ten outputs.The ReLU activation function was assigned to each hidden unit (not including the ten output units and pooling operators), and all weights were initialized using a uniform distribution with a zero mean and unity standard deviation (Std) according to the He normal initialization 3 .and d 2 = 16 and three FC layers (Fig. 3A).These architectures are similar to the LeNet5 2 , however, the pooling operators are applied only after the second CL (Fig. 3A).The ReLU activation function was assigned to each hidden unit where the softmax function was applied to the ten output units.All weights were initialized using a uniform distribution with a zero mean and unity Std according to the He normal initialization 3 .Data preprocessing.Each input pixel of an image (32 × 32) from the CIFAR-10 database was divided by the maximal pixel value, 255, multiplied by 2, and subtracted by 1, such that its range was [−1 , 1].In all simulations, data augmentation was used, derived from the original images, by random horizontally flipping and translating up to four pixels in each direction.
Optimization.The cross-entropy cost function was selected for the classification task and was minimized using the stochastic gradient descent algorithm 4,5 .The maximal accuracy was determined by searching through the hyper-parameters (see below).Cross-validation was confirmed using several validation databases, each consisting of 10,000 random examples from the training set, as in the test set.The averaged results were in the same Std as the reported average success rates.The Nesterov momentum 3 and L2 regularization method 4 were applied.
Hyper-parameters.The hyper-parameters η (learning rate), μ (momentum constant 3 ), and α (regularization L2 4 ) were optimized for offline learning, using a mini-batch size of 100 inputs.The learning rate decay schedule 5,6 was also optimized such that it was multiplied by the decay factor, q, every Δt epochs, and is denoted below as (q, Δt).
Out of phase scheduling.For A-VGG16 and A-VGG8 the decay schedules 5,6 of the FC layers and the CLs learning rates had a phase of 10 epochs in between.The decay scheduling for the FC layers starts at ℎ = 10, while  The decay schedule for the learning rate is defined as follows: For CLs: (q, Δt) = { (0.65,20) epoch ≤ 140 (0.55, 20)  epoch > 140 For FC layers, with 10 epochs out of phase: (q, Δt) = { (0.65,20) epoch < 150 (0.5, 20)  epoch ≥ 150 The accuracies' Std is 0.0015.The decay schedule for the learning rate is defined as follows:

Statistics.
Statistics for each architecture were obtained using 10 samples.
Hardware and software.We used Google Colab Pro and its available GPUs.
We used Pytorch for all the programming processes.

1 )
2B).Two scenarios, local decisions and a global decision, are discussed.In the first, (2 × 2) MP operators are placed after each of the first  CLs (Fig. 2B top, exemplified  = 4), where in the second one a single (2  × 2  ) MP operator is placed after the ten CLs (Fig. 2B bottom, exemplified  = 4).For both scenarios, there are (2 10− × 2 10− ) non-negative (ReLU) outputs, denoted by   (Sequence Pooling), representing local decisions and   (Top Pooling),representing a global decision.For a given , the 2 10− × 2 10− ratios,   /  , were calculated and averaged over many Gaussian random inputs and several sets of ten randomly selected convolutional filters, which were identical for both scenarios.The probability ( indicates that local decisions,  consecutive MPs, result in a larger output than a global decision, a single (2  × 2  ) MP (Supplementary Information).This probability rapidly decreased with , possibly exponentially (Fig.2C), and even for  = 2 it was below 0.1.The increase inCLs depth beyond unity does not qualitatively affect the probability  (     > 1), as indicated by simulations of VGG8 with five consecutive (2 × 2) MP operators after each CL and a single (32 × 32) MP after five CLs.The same five random (3 × 3) convolutions were used for both architectures, and the 512 ratios,     for the single output of each filter, were calculated.Averaging over CIFAR10 training inputs and several selected sets of fixed random convolutions results in (10 −3 ) for probability (

Figure 2 :
Figure 2: Comparison between several small MP operators along CLs and a large one at their end.(A) A binary tree where the random nodal values are low (L), medium (M), or high (H), e.g., 1, 10, and 1000.A local decision selects the path to the maximal nodal child (gray), resulting in the brown route connecting three gray nodes.A global decision selects the green route, maximizing the product of its nodal values.(B) Gaussian random (1024 × 1024) input followed by ten (3 × 3) CLs, where (2 × 2) MP is placed after the first four CLs (Top), and similar architecture where a single (16 × 16) MP is placed after the 10 CLs.The (64 × 64) output values are denoted by   (Top) and   (bottom) (Supplementary Information).(C) The probability (     > 1) as a function of the
For A-VGG13 and A-VGG16 with linear activation functions for the FC layer the architectures remain the same.Advanced LeNet5 architectures.The advanced LeNet5 (A-LeNet5) architectures consist of two consecutive CLs of size (5 × 5) with depths d 1 = 6 for the CLs at ℎ = 20.Specifically, decay learning rate of the FC layers occurs at ℎ = [10, 30, 50 … ], while for the CLs at ℎ = [20, 40, 60 … ].

Fig. 2 . 1 ) 1 )Fig 3
Fig. 2. In Fig. 2, two architectures were compared, each consists of ten CLs with unity depth and the same ten (3 × 3) filters.Random inputs of size (1024 × 1024) with values taken from a Gaussian distribution with zero mean and unity Std were tested.The ReLU activation function was assigned to all the hidden and output units.