Towards a universal mechanism for successful deep learning

Recently, the underlying mechanism for successful deep learning (DL) was presented based on a quantitative method that measures the quality of a single filter in each layer of a DL model, particularly VGG-16 trained on CIFAR-10. This method exemplifies that each filter identifies small clusters of possible output labels, with additional noise selected as labels outside the clusters. This feature is progressively sharpened with each layer, resulting in an enhanced signal-to-noise ratio (SNR), which leads to an increase in the accuracy of the DL network. In this study, this mechanism is verified for VGG-16 and EfficientNet-B0 trained on the CIFAR-100 and ImageNet datasets, and the main results are as follows. First, the accuracy and SNR progressively increase with the layers. Second, for a given deep architecture, the maximal error rate increases approximately linearly with the number of output labels. Third, similar trends were obtained for dataset labels in the range [3, 1000], thus supporting the universality of this mechanism. Understanding the performance of a single filter and its dominating features paves the way to highly dilute the deep architecture without affecting its overall accuracy, and this can be achieved by applying the filter’s cluster connections (AFCC).


Introduction
A prototypical supervised learning task involves object classification, which is realized using deep architectures [1][2][3] .These architectures consist of up to hundreds of convolutional layers (CLs) [4][5][6] , each of which consists of tens or hundreds of filters, and several additional fully connected (FC) hidden layers.As the classification task becomes more complex, a small training dataset and distant objects that belong to the same class, deeper architectures are typically required to achieve enhanced accuracies.The training of their enormous number of weights requires nonlocal training techniques such as backpropagation (BP) [7][8][9] , which are implemented by advanced GPUs, and can guarantee convergence to a suboptimal solution only.
The current knowledge of the underlying mechanism of successful deep learning (DL) is vague 1,[10][11][12][13] .The common assumption is that the first CL reveals a local feature of an input object, where large-scale features and features of features, which characterize a class of inputs, are progressively revealed in the subsequent CLs 1,[14][15][16][17] .The terminologies of the features and features of features and the possible hierarchy among them have not been quantitatively well defined.In addition, the existence of the underlying mechanism of successful DL remains unclear.Is the realization of a classification task using deep and shallower architectures with different accuracies based on the same set of features?
Similarly, is the realization of different classification tasks using a given deep architecture based on the same type of features?
A quantitative method to explain the underlying mechanism of successful DL 18 was recently presented and exemplified using a limited deep architecture and dataset, namely VGG-16 10 on CIFAR-10 14 and advanced variants thereof 10,19 .This method enables the quantification of the progressive accuracies with the layers and the functionality of each filter in a layer, and consists of the following three main stages.
In the first stage, the entire deep architecture is trained using optimized parameters to minimize the loss function.In the second stage, the weights of the first  trained layers remain unchanged and their outputs are FC with random initial weights to the output layer, which represent the labels.The output of the first  layers represents the preprocessing of an input using the partial deep architecture and the FC layer is trained to minimize the loss, which is a relatively simple computational task.The results indicate that the test accuracy 20 increases progressively with the number of layers towards the output.
In the third stage, the trained weights of the FC layer are used to quantify the functionality of each filter constituting its input layer.The single-filter performance is calculated with all weights of the FC layer silenced except for the specific weights that emerge from a single filter.At this point, the test inputs are presented and preprocessed by the first  layers, but influence the output units only through the small aperture of one filter.The results demonstrate that each filter essentially identifies a small subset among the ten possible output labels, which is a feature that is progressively sharpened with the layers, thereby resulting in enhanced signal-to-noise ratios (SNRs) and accuracies 18 .These three stages, which constitute the method by which the performance of a single filter is calculated, are presented in Fig. 1. layer is FC to the output and is then trained to minimize the loss with fixed weights of the previous  layers ( Stage 2).The properties of a specific filter are calculated by silencing all the weights except those emerging from that specific filter.The matrix elements representing the average output field on an output unit for a specific input label are calculated using the training dataset.The clusters and noise elements of each filter are then calculated using the matrix elements.Finally, learning using a diluted deep architecture in accordance with the calculated clusters, namely the AFCC method, is performed.
As the method for the underlying mechanism of successful DL was tested for only one deep architecture and one dataset composed of small images 18 , its generality is questionable.In this study, we investigate its universality by training EfficientNet-B0 21 and VGG-16 on extended datasets where the number of output labels is in the range of [3, 1,000], taken from CIFAR-10 14 , CIFAR-100 14 and ImageNet 15,22 .The results strongly suggest the universality of the proposed DL mechanism, which is verified for varying numbers of output labels with three orders of magnitudes, small (32 × 32) and large (224 × 224) images, and state-of-the-art deep architectures.
In the following section, the underlying mechanism of DL is explained using the results for VGG-16 on CIFAR-100.Thereafter, the results are extended to EfficientNet-B0 on CIFAR-100 and ImageNet.Finally, the case of training VGG-16 and EfficientNet-B0 on varying number of labels taken from CIFAR-100 as well as VGG-16 on CIFAR-10 is discussed.Subsequently, a summary and several suggested techniques for improving the computational complexity and accuracy of deep architectures are briefly presented in the discussion section.

A. Results of VGG-16 on CIFAR-100
The training of VGG-16 on CIFAR-100 (Fig. 2A) with optimized parameters yielded a test accuracy of approximately 0.75 (Table 1 and Supplementary Information), which was slightly higher than the previously obtained accuracy 23 .Next, the weights of the first  trained layers were held unchanged, and their outputs were FC with random initial weights to the output layer.The selected layers were those that terminated with max-pooling,  = 2, 4, 7, 10, and 13.The training of these FC layers indicates that the accuracy increased progressively with the number of layers and saturated at  = 10, (Table 1), which is a result of the small image inputs of 32 × 32.The three CLs (3 × 3), layers 8 − 10, generate a 7 × 7 receptive field 24 covering a filter size of 4 × 4. Hence, layers 11 − 13 are redundant for small images.The performance of a single filter is represented by a 100 × 100 matrix and is exemplified for layer 10 (Fig. 3, left).The element (, ) represents the average of the fields that are generated by the label  test inputs on output , where the matrix elements are normalized by their maximal element.Next, its Boolean clipped matrix following a specified threshold is calculated (Fig. 3, middle) as well as its permuted version to form diagonal clusters (Fig. 3, right, Supplementary Information).The above-threshold elements out of the diagonal clusters are defined as the filter noise  (yellow elements in Fig. 3, right).
The performance of each filter was calculated using test inputs, with all weights of the trained FC layer silenced except for those that emerged from the filter.The estimated main averaged properties of the   () filters belonging to the  ℎ layer are the cluster size   (), number of clusters per filter   (), and number of noise elements out of the clusters () (Table 1).The results clearly indicate that () decreases with  until the accuracy is saturated at  = 10, where the average cluster size is small at 2 out of 100 labels.In addition, the average number of cluster elements is very small,   ⋅   2 = 2.6 × 2 2 = 10.4 out of the 10,000 matrix elements (Table 1).
The estimation of the SNR using the following quantities is required to understand the mechanism underlying DL.The average appearance number of each label among the   labels in the clusters of the layer is which represents the  under the assumption of uniform number of appearances of each diagonal element over all clusters.The average expected  that emerges from the 10 th layer is approximately 26.6 (Table 1 and Eq. ( 1)), which fluctuates among the 100 labels (Fig. 4a).The average internal cluster noise,   , is equal to the average number of appearances of other labels in the clusters forming the  of a given label, which results in an average   of approximately 0.27 for the 10 th layer, with relatively small fluctuations among the labels (Fig. 4a).Furthermore,   =    ≫ 1 provided that The second type of noise stems from the above-threshold matrix elements out of the clusters, which is the external noise .Using the assumption of uniform noise over the offdiagonal matrix elements, the average value of this noise is approximated as follows: where the average number of elements that belong to the clusters of each filter is negligible compared to (  ) 2 (Fig. 3).As   ∝ , which increases with a decrease in .This is the origin of the DL mechanism, where  decreases progressively with the number of layers, thereby enhancing the accuracy (Eq. (4)).For example,   is approximately 0.83 for the 10 th layer, whereas it is approximately 29 for the 4 th layer where the signal is only 18 (Table 1 and Eqs. ( 1)-( 3)).
Note that the above calculations neglect the subthreshold elements; however, they are typically several orders of magnitude smaller than the above-threshold elements and are frequently negative 18 (Fig. 3).
Although the above estimations of   and   , Eqs. ( 1)-( 4), were expected to fluctuate among the labels, they were found to be much greater than unity per label (Fig. 4a).In addition, these SNRs may be far from reality because the matrix (Fig. 3, left) was first normalized by its maximal value, which varied significantly among the filters, following which the above-threshold elements were defined to form a Boolean matrix.Nevertheless, the summation of the fields of the above-threshold elements, instead of their Boolean summations, indicates that   and   for each label were much greater than unity (Fig. 4b), and their averaged values are comparable to the estimated values based on the Boolean filters.
The progressive decrease in   with the layers of a given trained deep architecture is the underlying mechanism for successful DL (Eq.( 4)).Nevertheless, a large estimated   does not necessarily ensure an accuracy that approaches unity because it is based only on averaged quantities ((Eqs.( 1)-( 4)), where large fluctuations around their average values are expected, particularly for large   .In addition, a positive field of a cluster element cannot exclude negative fields for a large fraction of the corresponding input label.The average signal (dashed blue horizontal line),   (red), and   +   (orange) values are 301, 4.7, and 9.7, respectively.

B. Results of EfficientNet-B0 on CIFAR-100
The training of the expanded 224 × 224 images 25 of CIFAR-100 on EfficientNet-B0 was performed using transfer learning 26,27 (Supplementary Information) and yielded an improved accuracy of 0.867 (Table 2).This architecture does not include max-pooling operators, and a decrease of a factor of two in the layer dimensions is achieved using stride-2 at specific CLs.Hence, similar to the case of VGG-16, the accuracies and average filter properties were estimated at the end of the stages with stride-2, 1, 3, 4, 5, 7, and 9.The outputs of these stages were first sampled by 7 × 7 average pooling as built-in in stage 9, followed by a layer that was FC to the 100 output units which was trained to minimize the loss (Supplementary Information).The results indicate that the accuracy almost always increased with the number of stages and the noise per filter decreased (Table 2), thereby supporting the proposed universal mechanism underlying DL.The semi-plateau of the accuracies of stages 4 and 5 was common to all examined datasets using EfficientNet-B0, which suggests that this architecture might be simplified without affecting its accuracy by removing, for example, some layers around stage 5 (see Discussion section).
The progressive decrease in the noise  with the layers or stages of a particular deep architecture is the underlying mechanism of DL.However, a comparison of the SNRs of two deep architectures does not necessarily correlate with accuracies.For instance, the improved EfficientNet-B0 accuracy of 0.867, in comparison with ~0.75 for VGG-16 (Tables 1-2), could not be simply deduced from their SNRs (Eq.( 4)) because   was doubled for EfficientNet-B0, whereas   ⋅   was reduced from 5.2 in VGG-16 to only approximately

4.
The accuracy improvement of EfficientNet-B0 probably stems from the enhanced  of approximately 64, whereas it was only approximately 27 for VGG-16 (Eq.( 1)), as well as the distribution of their output fields for the test inputs.Table 2. Accuracy per stage and statistical features of their filters for EfficientNet-B0 trained on CIFAR-100.The presented results were obtained at the end of stages consisting of stride-2 only, reducing by factor two the size of the output layer, similar to the max-pooling operator in VGG-16 (Table 1).

C. Results of EfficientNet-B0 on ImageNet
The presented underlying mechanism of DL was extended to a dataset consisting of 1,000 labels and 224 × 224 input images, with the pre-trained EfficientNet-B0 on the ImageNet dataset 15,22 (Fig. 2B) constituting the initial stage of the following procedure.The output layer of stages 1, 3, 4, 5, 7, and 9 was FC with random initial weights to the 1,000 outputs (Table 3).Next, these FC weights were trained to minimize the loss, with all remaining weights of the trained EfficientNet-B0 kept fixed.Finally, the accuracy of the different stages and statistical properties of their filters were estimated (Table 3).
As training of these FC layers using the large ImageNet dataset (1.4 images) was beyond our computational capability, we divided the 50,000 images from the validation test into 40,000 images for training and 10,000 for testing.This training of the stage 9 FC layer was similar to transfer learning 26,27 and yielded an accuracy of approximately 0.75, where the original accuracy of the entire pre-trained EfficientNet-B0 was approximately 0.78 (Supplementary Information).
The accuracy increases with the stages, whereas the noise  typically decreases (Table 3), which supports the universal underlying mechanism of DL.Interestingly, the average cluster size,   , and number of clusters per filter,   , which were measured at the last stage or layer that saturated the accuracy, increased only slightly while   increased from 100 to 1,000 (Tables 1-3).The exception of stage 3 in which  was non-monotonic (Table 3) may stem from the small   = 24, resulting in   ⋅   ⋅   ∼ 601 < 1,000, whereas it was greater than 1,000 for other stages.For stage 3, a large fraction of the labels (∼ 500) did not appear in any of the clusters and their estimated signal was zero.For all other stages,   was larger and   ⋅   ⋅   > 1,000, resulting in significantly lower number of labels with zero signal.Note that this anomaly of stage 3 was indeed absent in CIFAR-100 (Table 1).
Similar trends are expected for VGG-16 on ImageNet with much lower accuracy and higher noise than EfficientNet-B0.In this case, the image dimension is greater by a factor of 7; hence, the FC layer sizes become significantly larger, and the optimization of those layers is currently beyond our computational capabilities.1).The accuracy increased progressively with the number of layers until saturation at the 10 th layer, and the out-of-cluster noise  decreased progressively with the number of layers.Interestingly,   and   were only slightly affected by  at the 10 th layer (Tables 1   and 4).The test error,  = 1 − , is expected to increase with  since the classification task is more complex; the results indicate that this increase is approximately linear with  (Fig. 5).Nevertheless, the extrapolation of the linear fit to a smaller  approaching unity indicates that a limited crossover is expected, as  is expected to vanish for  = 1.5).Again, the accuracy increased progressively with the stages (except for stage 5 at  = 60) and  decreased progressively with the stages, thereby exemplifying the universality of the mechanism underlying DL.Similar to the case of VGG-16, the test error  also increased approximately linearly with  and almost vanished, as expected, at  = 1 (Fig. 6).Note that the slope of the approximated linear fit fluctuated slightly among the samples (Supplementary Information).In addition, the average cluster size   increased slightly from 1.6 for  = 10 to 3 for  = 100, whereas the number of clusters per filter   was approximately 1.1 and independent of  (Table 5).trained on  labels from CIFAR-100.The results here are similar to those of Table 2,

D2. CIFAR-10 with varying number of labels
The universal mechanism of DL was also verified for VGG-16 trained on CIFAR-10 with varying  = 3, 6, 8, and 10 (Table 6).The accuracy increased progressively with the number of layers until saturated at the 10 th layer, and  decreased progressively with the number of layers.Similar to the case of CIFAR-100, the test error increased approximately linearly with  (Fig. 7), where the extrapolation for  = 1,  approaches zero, as expected.

E. Applying Filter Cluster Connections (AFCC):
The new comprehensive understanding of how the filters function in a trained deep architecture can promote improved technological implementation methods by applying filter's cluster connections (AFCC) (Fig. 1).As each filter consists of only several small clusters, thereby generating a significant output signal for a small set of labels, its output for any other label can be neglected and the same accuracy can be achieved.To test the AFCC hypothesis a trained VGG-16 on CIFAR-100 was examined, where the accuracy of approximately 0.752, is saturated at the 10 th layer (Table 1).The number of weights of the FC layer is 204,800; 512 × 2 × 2 input units emerging from the 512 filters multiplied by 100 output units.All these weights which did not belong to a cluster in a specific filter were set to zero, resulting in approximately 194,000 zeroed weights out of 204,800 (a 95% reduction).The remaining 10,800 weights is well approximated by 512 ⋅ 2 ⋅ 2 ⋅   ⋅   ≈ 10,600 (Table 1).After only a few training epochs, while maintaining the ~194,000 zeroed weights as zero, the similar accuracy, ~0.752, was recovered, which indicates that the FC layer can be significantly reduced and yield similar results (Supplementary Information).
Note that the same filter clusters were detected for both training and test sets 18 .The performance of the same classification tasks with a significantly smaller amount of weights of the FC layer can improve the test computational complexity, as well as reduce the memory usage.Thus, the expansion of the AFCC method to include several layers can significantly reduce the complexity and deserves further research.
A similar effect was observed for EfficientNet-B0 trained on CIFAR-100, with an accuracy of 0.867 (Table 2).The number of weights of the FC layer is 128,000; 1,280 × 1 input units emerging from the 1,280 filters multiplied by 100 output units.All of these weights that did not belong to a cluster in a specific filter were set to zero, resulting in 4,900 (~1280 ⋅   ⋅   ) non-zero weights only (a ~96% reduction).After retraining the entire network with the same parameters, while including the 4,900 non-zeroed weights only, the accuracy increased to ~0.873, indicating that the FC layer can be significantly reduced and still yield similar or even increased accuracy (Supplementary Information).One cannot exclude a similar increase in accuracy without pruning the FC layer and using different training parameters, however, AFCC training is more efficient.This gain in the test computational complexity is expected to be enhanced further in datasets with a higher number of labels, such as ImageNet, and larger classification tasks.
The training of EfficientNet-B0 on CIFAR-100 indicates almost identical accuracies for stages 4 and 5 (Tables 2 and 5) whereas the noise, , is non-monotonic between stages 3 and 4 for EfficientNet-B0 trained on ImageNet (Table 3).These results hint that stages 3 − 5 of EfficientNet-B0 might be further optimized.Indeed, reducing the number of layers constituting stages 3 and 4 to one and training this modified EfficientNet-B0 on CIFAR-100 using transfer learning 26,27 , resulted in an accuracy ~0.864, which approached the original accuracy (Table 2).Similarly, reducing the number of layers in stage 5 from 3 to 2, resulted in an accuracy of at least 0.862 (Supplementary Information).Hence, following the proposed method, the latency of EfficientNet-B0 can be reduced without practically affecting its performance, at least for the CIFAR-100 dataset.Another simplification is the removal of stage 9 from the construction of EfficientNet-B0 and connecting stage 8 with only 320 filters to the output layer, using the AFCC method.In this case, the obtained accuracy is at least 0.868, which slightly exceeds the accuracy of the entire model terminating with 1280 filters for the classification of CIFAR-100 (Supplementary Information).

Discussion
The underlying mechanism of DL was quantitatively examined for two deep architectures, namely VGG-16 and EfficientNet-B0, trained on the CIFAR-10, CIFAR-100, and ImageNet datasets.These examinations enabled the verification of the suggested underlying mechanism of DL with different architectures consisting of 16 to over 150 layers as well as with the number of output labels ranging over three orders of magnitude [3,1,000].
The first step of the proposed method involves quantifying the accuracy of each CL of a trained deep architecture using the following procedure with relatively low computational complexity: The entire deep architecture is trained to minimize the loss.The weights of the first specified number of trained layers are held unchanged and their output units are FC to the output layer.These output units of an intermediate hidden layer represent the preprocessing of an input using a partial deep architecture, and the FC layer is trained to minimize the loss.The test set results indicate that the accuracy increases progressively with the number of layers towards the output (Tables 1-6).
The trained FC layer weights are used to quantify the functionality of each filter that belongs to its input layer.The single-filter performance is calculated when all weights of the FC layer are silenced, except for the specific weights that emerge from the single filter.
At this point, the test inputs are preprocessed by the first given number of trained layers, but influence the   output units, representing the labels, only through the small aperture of one filter.This procedure generates an (  ,   ) matrix, where element (, ) represents the average fields that are generated by label  test inputs on output .This matrix is normalized by its maximal element, following which a Boolean clipped matrix is formed following a given threshold.Its permuted version forms diagonal clusters (Fig. 3), the sizes of which increase only slightly when a deep architecture is trained on a dataset with an increasing number of labels (Tables 2 and 3).The diagonal elements of the clusters represent the signal, whereas their off-diagonal elements represent the internal noise, resulting in uncertainty regarding the input label given an above-threshold output.The second type of noise, namely the external noise, stems from the above-threshold elements out of the diagonal clusters.This noise progressively decreases with the number of layers and forms the underlying mechanism of DL.
The proposed method suggests quantitative measures and building blocks to describe the underlying mechanism of DL.The vocabulary is the preferred subset of labels of each filter clusters, which compete with the filter's noise.In addition to the contribution of this method to the understanding of how DL works, it provides insight into several practical aspects, including the following two.The first one is the possibility of improving the computational complexity and accuracy of deep architectures, and the second one is identifying weak stages in the construction of pre-existing deep architectures.
Using the single filter performance can lead to an efficient way to dilute the system without affecting its performance, as demonstrated by the AFCC method.Its expansion to include several layers can significantly reduce the complexity and deserves further research.This insightful dilution technique should be explored further on other datasets and deep architectures.In addition, its efficiency should be compared with that of other methods that primarily rely on random dilution processes [28][29][30][31] and assess their effectiveness in reducing complexity.
The presented universal underlying mechanism of DL may suggest an estimation method for the necessary number of filters in each layer.Each label must appear at least once in the clusters of the layer, hence, 1,280 filters in stage 9 of EfficientNet-B0 appear to be insufficient to classify, for example, 100,000 labels.Nevertheless, the results indicate that the number of diagonal elements,   ⋅   , increases from 3.6 for CIFAR-100 to 15.4 for ImageNet (Tables 2 and 3).Therefore, one cannot exclude the reality in which the filters constitute many relatively small clusters when the number of labels increases further.In addition, the information that is embedded in a single filter, namely clusters and noise, suggests procedures for pruning or retraining inefficient filters, such as highly noisy or low output-field filters.These procedures may improve the accuracy with reduced computational complexity and latency in the test phase, however, the investigation thereof requires further research.
Architectures and Training the fully connected layer.Two different architectures were examined.VGG-16 1 and EfficientNet-B0 2 .Both architectures were trained to classify the CIFAR-10 and CIFAR-100 datasets, as well as subclasses of their labels.In addition, EfficientNet-B0 was trained to classify the ImageNet dataset.Both architectures were trained with no biases on the output units.This was done to assure that each filter's effect on the output fields will be exemplified and will not be overshadowed by the much larger biases.Removing the biases of the output layer did not affect the architectures' average accuracies, in comparison to architectures trained with output biases.
The examination process was done by taking each architecture at designated layers and training a fully connected (FC) layer between the output of that specific layer and the output layer, corresponding to the labels.During training, only the FC layer was trained, while weights and biases of the rest of the architecture remain fixed.For VGG-16 the input units to the FC layers were selected after the max-pooling operations adjacent to layer 2, 4, 7, 10, and 13.For EfficientNet-B0, stages 1, 3, 4, 5, 7 and 9 which reduce the input size due to the stride-2 were examined.
For each examined layer, , the output of the training set for the  ℎ layer was used as a preprocessed dataset to train the FC layer.For each architecture, optimized hyperparameters were used for the examined layers.
Data preprocessing.For VGG-16, each input pixel of an image (32 × 32) from the CIFAR-10 and CIFAR-100 databases was divided by the maximal pixel value, 255, multiplied by 2, and subtracted by 1, such that its range was [−1, 1].In all simulations, data augmentation was used, derived from the original images, by random horizontally flipping and translating up to four pixels in each direction.
For EfficientNet-B0, the images were normalized by subtracting the average value of each color and dividing by its standard deviation.This varies by the size of the training set that was used, which change based on the number of different labels trained, .For CIFAR-K/10 and CIFAR-K/100 the images were also expanded from their initial size of (32 × 32) to (224 × 224) 3 .For all datasets, data augmentation was also used, which included a random horizontal flip, a random rotation of up to two degrees, a random translation of the image of up to four pixels in each direction and a shear of up to two degrees.
Optimization.The cross-entropy cost function was selected for the classification task and was minimized using the stochastic gradient descent algorithm 4,5 .The maximal accuracy was determined by searching through the hyper-parameters (see below).Cross-validation was confirmed using several validation databases, each consisting a fifth of the training set examples, randomly selected.The averaged results were in the same standard deviation (Std) as the reported average success rates.The Nesterov momentum 3 and L2 regularization method 4 were applied.
Hyper-parameters.The hyper-parameters η (learning rate), μ (momentum constant 6 ), and α (regularization L2 4 ) were optimized for offline learning, using a mini-batch size of 100 inputs.The learning rate decay schedule 5,7 was also optimized.A linear scheduler was used such that it was multiplied by the decay factor, q, every Δt epochs, and is denoted below as (q, Δt).Different hyper-parameters were used for each one of the architectures on each classification task.For ImageNet, 10,000 images, 10 images per label, were selected out of the validation set as the test dataset and the remaining 40,000 as training images.
VGG-16 was trained using the following hyper-parameters to reach maximal accuracies Where the decay schedule for the learning rate is: (q, Δt) = (0.65, 20) Where the decay schedule for the learning rate is: (q, Δt) = (0.975, 1) For the first seven stages, the learning rate  was multiplied by a factor of 0.1, and for the last stage by 0. In the right column, the axes are permuted such as all labels belonging to a cluster are grouped together consecutively, thereby displaying the clusters in an adjacent fashion where they are displayed as a diagonal block of elements with value 1.Each cluster is defined as a subset of  indices where for each ,  ∈  elements (, ) have the value of 1.
The size of the cluster is defined as  2 where  is the number of labels whose all possible pair permutations form the cluster, where the minimal size can be 1, that is one element on the diagonal or 100, the entire matrix.The elements that are equal to 1 are then colored as white, representing that they belong to a cluster in the filter, while non-cluster cells with the value of 1 are classified as above-threshold external noise and are colored yellow.
The calculation of the clusters was done by running along the diagonal, from index (0,0) to (99,99) where the first (, ) element to have a value of 1 is initially designated as a cluster of size 11.The next (, ) ℎ  ≠  element to have a value of 1 is then checked to see if can complete a cluster with (, ), if yes, then it is added to the cluster and the next diagonal element to have a value of 1 is checked if completes a cluster with  and , if yes it is appended to the cluster, if not the system continues to the next cell.This process is repeated for all value 1 cells in the diagonal as long as there are elements who do not belong to a cluster.Note that this process is not uniquely defined, the order by which the indices are iterated can change the outcome of the clustering process, such as a filter with two clusters of sizes 33 and 11 retrieved by iterating from 0 to 99 can yield in certain very rare scenarios, two clusters of size 22.While possibly alternating the results of a single filter, the overall obtained averaged results remain the same when performing the cluster creation while iterating in a reversed order, since those scenarios are very rare and occur in a negligible number of filters.
The external noise is calculated for each filter as the elements with value 1 who do not belong to any cluster.They can be seen in color yellow in the right column.
Explanations for Figure 2. A. The clipped binary signal per label was obtained by the diagonal signal of the summation of all binary clipped matrices of the filters together.The   , the average internal noise of each label, is equal to the sum of all non-diagonal elements belonging to a cluster, on that label's row.The external noise,   , of each label is equal to the sum of all non-diagonal elements not belonging to a cluster, on that label's row.B. Similar to A but now the signal, internal noise and external noise were calculated by the original accumulated fields of the filters and not the clipped binary fields.
The internal and external noise were summed by their obtained unit indices from the binary clipped matrix.where each output unit only receives a signal from filters whose clusters contain that output label, but also the entire network was further trained, yielding an accuracy of ~0.873.The hyper-parameters used were  = 0.005,  = 0.975,  = 1 − 4. For the first seven stages, the learning rate  was multiplied by a factor of 0.1, and for the last stage by a factor of 0.2.The decay schedule for the learning rate was (q, Δt) = (0.975, 1).

Figure 1 .
Figure 1.Flowchart of the three stages for calculating the performance of a single filter.The entire deep network is trained to minimize the loss function (Stage 1).The  th

Figure 2 .
Figure 2. Image samples of the datasets.(A) Eight image samples with different labels from the CIFAR-100 dataset.(B) Eight image samples with different labels from the ImageNet dataset.
Accuracy per layer and statistical features of their filters for VGG-16 trained on CIFAR-100.  : number of filters of layers terminating with max-pooling,   : filter sizes,   : size of trained FC layer connected to the output units, : average noise per filter,   : average number of clusters per filter, and   : average cluster size.

Figure 3 :
Figure 3: Single filter performance.Left: The matrix element (,) of a filter belonging to layer 10 of VGG-16 trained on CIFAR-100 represents the averaged fields that were generated by label  test inputs on an output , where the matrix elements were normalized by their maximal element.Middle: The Boolean clipped matrix (0/1 is represented by black/white pixels) following a given threshold.Right: Permutations of the clipped matrix labels resulting in three diagonal clusters: two 2 × 2 and one 3 × 3 (magnified upper-left corner red box), where above-threshold  elements out of the cluster are noise elements, denoted by yellow.

Figure 4 :
Figure 4: Comparison of SNRs obtained from above-threshold Boolean filters and their fields.A. The signal per label (blue),   per label (red), and   +   per label (orange) (Eqs.(1)-(4)) that were obtained from the above-threshold clipped Boolean fields of the 512 filters of the 10 th layer of VGG-16 trained on CIFAR-100.The average signal (dashed blue horizontal line),   (red), and   +   (orange) are 26.95,0.46 and 1.3, respectively, which are similar to the estimated values obtained from Eqs. (1)-(3).B. Similar to A, using the fields of the above-threshold elements of the filters.
the training of the FC layer.Each layer  was FC to the  outputs via a FC layer of size () ⋅ .The FC layer was trained using the hyper-parameters:  = 0.005,  = 0.975,  = 1.5 − 5, with a learning rate scheduler of  = 0.65 every 20 epochs while the rest of the system's weight values and biases remained fixed.VGG-16 was trained using the following hyper-parameters to reach maximal accuracies on CIFAR-K/10: of the FC layer,  = 0.02,  = 0.995,  = 1 − 7, with a learning rate scheduler of  = 0.6 every 20 epochs while the rest of the architecture's weight values and biases remained fixed.EfficientNet-B0 Hyper-parameters.EfficientNet-B0 was trained on CIFAR-K/100 and ImageNet datasets using transfer learning8 on the pre-trained EfficientNet-B0 on ImageNet dataset.The transfer learning was done using the following hyper-parameters and learning rate scheduler:

2 .Explanations for Figure 1 .
The output of each layer  was sampled by a 7 × 7 average-pooling and then FC to the  outputs via a FC layer of size () ⋅ .The FC layer was trained using the hyperparameters:  = 0.005,  = 0.975,  = 1.5 − 5, with a learning rate scheduler of  = 0.975 every epoch while the rest of the architecture's weight values and biases remained fixed.The training of EfficinetNet-B0 with reduced number of layers in stages 3 and 4 to 1, was done by using the hyper-parameters:  = 0.002,  = 0.98,  = 1 − 4.The training of EfficinetNet-B0 with reduced number of layers in stage 5, from 3 to 2, was done by using the hyper-parameters:  = 0.01,  = 0.965,  = 5 − 4. In Fig1Left column, For VGG-16 on CIFAR-100, the 100 output fields of each filter were summed over all 10,000 inputs of the test set, resulting in a 100100 matrix where each cell (, ) represents the summed field of output field  for all test set inputs of label .The matrix was then normalized by dividing by its maximal value, resulting in each matrix having a maximal value of 1.In the center column the clipped Boolean output field matrix is displayed, where each element whose value is above a threshold (0.3) is set to 1 and all others are zeroed.

Figure 3 .
Figure 3. Test error for VGG-16 trained on CIFAR-K/100The error rate of VGG-16 on CIFAR-K/100 was tested with  = 20, 40, 60 and 100, where the subset for the lowest value of K labels were randomly chosen and then for progressively increasing K the previous K labels were included, e.g. the labels chosen for  = 20 are included in  = 40.The K labels were chosen uniformly from the 20 super-classes of the dataset.

Figure 4 .Figure 3 .
Figure 4. Test error for EfficientNet-B0 trained on CIFAR-K/100 The error rate of EfficientNet-B0 on CIFAR-K/100 was tested with  = 20, 40, 60 and 100, as done in Figure 3.The slope of the fitted line for the accuracies was 0.0013.This process was repeated for 5 different subsets and the slopes fluctuated in the range [0.0012, 0.0013].

Figure 5 .
Figure 5. Test error for VGG-16 trained on CIFAR-K/10The error rate of VGG-16 on CIFAR-K/10 was tested with  = 3, 6, 8 and 10, where the subset for the lowest value of K labels were randomly chosen and then for progressively increasing K the previous K labels were included, e.g. the labels chosen for  = 3 are included in  = 6.

Table 3 : Accuracy per stage and statistical features of their filters for EfficientNet- B0 trained on ImageNet.
The presented results were obtained at the end of stages consisting of stride-2 only, similar to Table2. D.

Datasets with varying number of labels D1. CIFAR-100 with varying number of labels
The proposed universal mechanism for DL was extended by varying the output labels  out of 100 in CIFAR-100, where  = 10, 20, 40, and 60.The results for VGG-16 are summarized in Table4, and indicate similar trends to those observed for  = 100 (Table

Table 4 : Accuracy per layer and statistical features of their filters for VGG-16 trained on
labels from CIFAR-100.The results are similar to those of Table1, where VGG-16 was trained on  = 10, 20, 30, and 60 labels out of 100, namely CIFAR-K/100 (Supplementary Information).