Introduction

Traditionally, artificial neural networks have been derived from brain dynamics, where synaptic plasticity modifies the connection strength between two neurons in response to their relative activities1,2. The earliest artificial neural network was the Perceptron3,4, which was introduced approximately 65 years ago, consisting of a feedforward classifier with many inputs and a single Boolean output unit. The development of more structured feedforward architectures with numerous convolutional and fully connected hidden layers, which can be increased to hundreds5,6, as well as the development of their non-local training techniques, such as backpropagation (BP)7,8, are required to address solutions to complex and practical classification tasks. These are essential components of the current implementation of deep learning (DL) algorithms. The underlying rationality of DL algorithms is that the first convolutional layer is sensitive to the appearance of a given pattern or symmetry in limited areas of the input, whereas the subsequent convolutional layers are expected to reveal large-scale features characterizing a class of inputs9,10.

In a supervised learning scenario, a feedforward step is initially performed, in which the distance between the current and desired outputs for a given input is computed using a given error function. The BP procedure is utilized in the next step, where weights are updated to locally minimize the error function7,11. Graphic processing units (GPUs) are used to accelerate this time-consuming computational process of multiplying large matrices and vectors, and its use is repeated several times over the training set until a possible desired test error is achieved. Architectures with an increasing number of hidden layers enable learning to be efficiently optimized for complex classification tasks, which goes together with the advancement of powerful GPU technology.

However, the brain’s architecture differs significantly from that of DL and consists of very few feedforward layers12,13,14, only one of which approximates the convolutional wiring, mainly from the retinal input to the first hidden layer12,15. The key question driving our research is whether learning non-trivial classification tasks using brain-inspired shallow feedforward networks can achieve the same error rates as DL, while potentially requiring less computational complexity. A positive answer will question the need for DL architectures and might direct the development of unique hardware for the efficient and fast implementation of shallow learning. Additionally, it will demonstrate how brain-inspired shallow learning has advanced computational capability with reduced complexity and energy consumption16,17.

Results

LeNet18,19, a five-layer prototype of a shallow feedforward architecture, has two convolutional layers with max-pooling operations and three successive fully connected layers (Fig. 1A). The first and second convolutional layers have \({d}_{1}=6\) and \({d}_{2}=16\) filters, respectively, representing the depth of each layer, and their convolutional layer sizes after max-pooling, \({m}_{i}\times {m}_{i}\; \left(height\times width\right),\) are \(14\times 14\) and \(5\times 5\), respectively. One can notice that

$$\frac{{d}_{2}}{{d}_{1}}=\frac{16}{6} \simeq \frac{14}{5}=\frac{{m}_{1}}{{m}_{2}}=2.8,$$
(1)

which hints on the following conservation law along the convolutional layers

$$dept{h}_{i}\times {m}_{i}=constant,$$
(2)

where \({m}_{i}^{2}\) and \(dept{h}_{i}\) represent the ith convolutional layer size and the number of filters, respectively. We minimized the LeNet error rates for the CIFAR-10 database20 as a function of \({d}_{1}\) while maintaining the ratio \({d}_{2}/{d}_{1}\) constant using the stochastic gradient descent (SGD) algorithm21,22 (Fig. 1B, Supplementary Information). The results indicate decaying of error rates, \(\epsilon\), with increasing \({d}_{1}\) as a power law

$$\epsilon \left({d}_{1}\right)=\frac{A}{{\left({d}_{1}\right)}^{\rho }},$$
(3)

with an exponent \(\rho \sim 0.41\), even for small \({d}_{1}\). Although the error rate of the original LeNet, \({ d}_{1}=6\), is \(\epsilon \simeq 0.23\), it can be further minimized by increasing \({d}_{1}\). Any small \(\epsilon\) can be achieved on the test set using generalized LeNet, a shallow architecture, based on the power law extrapolation for large \({d}_{1}\) values. However, its minimization for a given large \({d}_{1}\) is a heavy computational task that requires an exhaustive search in a hyper-parameter space that its values vary among layers, with an increasing number of epochs and complex scheduling. For instance, preliminary results of an incomplete optimization for \({d}_{1}=27\) and \({d}_{2}=72\) using at least \(500\) epochs indicate \(\epsilon \sim 0.137\), which is close to the expected result of the extrapolated power law (Fig. 1B).

Figure 1
figure 1

Learning in generalized LeNet architecture. (A) Generalized LeNet architecture for the CIFAR-10 database (\(32\times 32\times 3\) pixels input’s size) consisting of five layers, two convolutional layers including max pooling and three fully connected layers. The first and second convolutional layers consist of \({d}_{1}\) and \({d}_{2}\) filters, respectively, where \(\frac{{d}_{1}}{{d}_{2}}\simeq \frac{6}{16}\). (B) The test error, \(\epsilon\), as a function of \({d}_{1}\) on a log–log scale, indicating a power-law scaling with exponent \(\rho \sim 0.41\), Eq. (3) (Supplementary Information). The activation function of the nodes is ReLU.

The conservation law (Eq. 2) was found to govern the convolutional layer sizes of the original VGG-16 architecture, which consists of 16 layers23, except for the fifth convolution set, where the number of filters is bounded by \(512\) (Fig. 2A with \(d=64\)). The nth \((n\le 4)\) convolution set has \(d\cdot {2}^{n-1}\) filters, where the convolutional layer size is \(\frac{m}{{2}^{n-1}} \;\; (n\le 5)\). The minimization of \(\epsilon\) for VGG-16 and the CIFAR-10 database (Fig. 2A with \(m=32\)) as a function of \(d\) results in a power law with a similar exponent to LeNet (Fig. 1B), \(\rho \sim 0.4\) (Fig. 2B, Supplementary Information). The results allude to the universal behavior of power-law scaling (Eq. 3) which is independent of the architecture details, where \(d\) is the number of filters in the first convolutional layer. Additionally, the exponent, \(\rho ,\) does not necessarily increase with the number of convolutional or hidden layers. Interestingly, the standard VGG-16 network (\(d=64\) in Fig. 2A), with batch normalization but without dropouts, results in \(\epsilon \sim 0.065\) (Supplementary Information), which is identical to the reported test error with significant dropouts24. Hence, the advantage of dropouts in the minimization of \(\epsilon\) might be questionable in this case.

Figure 2
figure 2

Learning in generalized VGG-16 architecture. (A) Generalized VGG-16 architecture consisting of 16 layers, where the number of filters in the nth convolution set is \({d\cdot 2}^{n-1}\) (\(n\le 4)\) and the square-root of the size of the filter is \(m\cdot {2}^{-(n-1)}\) (\(n\le 5),\) where \(m\times m\times 3\) is the size of each input (\(d=64\) in the original VGG-16 architecture). (B) The test error, \(\epsilon\), as a function of \({d}\) on a log–log scale, for the CIFAR-10 database (\(m=32\)), indicating a power-law scaling with exponent \(\rho \sim 0.4\), Eq. (3) (Supplementary Information). The activation function of the nodes is ReLU.

A shallow network’s ability to achieve any small \(\epsilon\), based on the extrapolation of the power-law scaling (Fig. 1B), is accompanied by a significant reduction in computational complexity per epoch compared with a DL architecture (Fig. 2A). Complexity is measured as the number of multiplication-add (MAdd) operations per input during a forward and BP step25,26. It is calculated as a function of the number of filters, \({d}_{1}\) and \(d\) in Figs. 1 and 2, respectively (Fig. 3A, Supplementary Information). In both cases, the number of operations per step scale as a quadratic polynomial with the number of filters and are derived from the following argument: When the number of filters in a convolutional layer is doubled, its computational complexity increases by a factor of four because its consecutive convolutional layer is also doubled. The origin of the linear terms in the quadratic polynomials (Fig. 3A) is mainly attributed to the input size of the first fully connected layer, which increased linearly with the number of filters. Hence, the number of weights increases linearly with the number of filters, whereas the number of weights in the successive fully connected layers remains constant and is independent of the number of filters (Figs. 1A, 2A).

Figure 3
figure 3

A comparison of learning complexity between generalized LeNet and VGG-16 architectures. (A) Complexity of a feedforward and BP step for a single input of LeNet (green) and VGG-16 (red) measured for several \(d ({d}_{1})\) values (open circles) and the quadratic polynomial fits. Complexity is measured in Giga multiplication-add operations (GMAdd). (B) Complexity as a function of \(\epsilon (d)\) for LeNet (green) and VGG-16 (red) for several values of small \(\epsilon\) (open circles) obtained from the power-law scaling (Figs. 1B, 2B) and the fitted power-law scaling (in dashed boxes), obtained from the last three small values of \(\epsilon\) (dashed lines). (C) The ratio between the complexity of LeNet and VGG-16 for several values \(\epsilon\) (open circles connected by a dashed line), obtained from the extrapolated \(\epsilon ({d}_{1})\;\mathrm{ and }\;\epsilon (d)\) for LeNet and VGG-16 (Figs. 1B, 2B), respectively, and a direct measure of the complexity25,26 (Supplementary Information).

The computational complexities as a function of the error rates were calculated using the power law extrapolation of \(\epsilon \left(d\right)\) (Figs. 1B, 2B). The results indicate that the complexity increases with \(1/\epsilon\), as a power law with an exponent \(\rho\) close to \(5,\) \(\sim 4.85\) for LeNet and \(\sim 4.94\) for VGG-16 (Fig. 3B). Since error rates in both cases (Figs. 1B, 2B) are approximated by

$$\epsilon \propto \frac{1}{{d}^{0.4}} ,$$

therefore, \(d\propto {\epsilon }^{-2.5}\) and in the leading term (Fig. 3A)

$$Complexity\propto {d}^{2}\propto {\epsilon }^{-5}.$$

A direct calculation of the computational complexity ratio per step between LeNet and VGG-16, based on \(\epsilon \left(d\right)\) (Figs. 1B, 2B), indicates that it is less than \(0.6\) for at least \(\epsilon \ge 0.005\) (Fig. 3C). As it is extremely sensitive to the similar estimated values of \(\rho\) for both LeNet and VGG-16 (Figs. 1B, 2B), further extrapolation toward vanishing \(\epsilon\) is unclear. Nevertheless, the lower complexity per epoch of shallow architectures serves as an example of the potential advantages of brain-inspired architectures. We note that the entire computational learning complexity is proportional to the number of training epochs and the classification of an input depends on a forward step only.

Under parallel computation, the required number of clock steps in a feedforward or BP realization is bounded from below by the number of layers. Decreasing this lower bound using a mechanism similar to that of carry-lookahead27, developed for the addition and multiplication of large numbers, is practically inapplicable for such complex architectures. This is another expected advantage of learning based on brain-inspired shallow architecture.

The power-law behavior (Eq. 3) is demonstrated to govern both shallow and DL architectures, where the number of filters obeys the conservation law (Eq. 2). The following two questions were examined: The first question concerns the robustness of the power law (Eq. 3) for architectures that deviate from the conservation law (Eq. 2). The second question is whether Eq. (2), which controls the number of filters in the convolutional layers is indeed the optimized choice to minimize \(\epsilon .\)

The power-law scaling for LeNet, which deviates from the conservation law (Eq. 2) is defined as follows:

$$\frac{{d}_{2}}{{d}_{1}}=constant,$$
(4)

which differs from \(\frac{16}{6}\). For a smaller constant, \(\frac{4}{3}\), the error rates were increased by a larger pre-factor \(A,\) as shown in Eq. (3); however, \(\rho\) remained similar \(\sim 0.4\) (Fig. 4A). For a larger constant, \(\frac{16}{3},\) the slope decreased, \(\rho \sim 0.35\) (Fig. 4A, Supplementary Information). The results first indicate the robustness of the power law for various constants (Eq. 4) which alludes to its universal behavior. Second, for a smaller constant and any given \({d}_{1},\) the error rates were enhanced. For a large constant (Eq. 4) and sufficiently large \({d}_{1}\) error rates were also enhanced because \(\rho\) decreased, but for small \({d}_{1}\) values, the error rates decreased. The results indicate that the conservation law (Eq. 2) with a constant that is expected to be approximately \(\frac{16}{6}\), asymptotically minimizes \(\epsilon\) for a large \({d}_{1}\). Similar trends were obtained for VGG-16, where the number of filters in the \({n}{\text{th}}\) convolution set (\(n\le 4)\) increased as \(constan{t}^{\left(n-1\right)}\), whereas in the original architecture \(constant=2\) (Fig. 2A). For \(constant=1.5\), the error rates increased with a larger pre-factor, \(A,\) where \(\rho\) remained similar \(\sim 0.4\) (Fig. 4B, Supplementary Information). For \(constant=2.5\), \(\rho \sim 0.32\), indicating once more that the error rates increased asymptotically compared to \(constant=2,\) but for small \(d\), the error rates could be decreased. The results for VGG-16 indicate the robustness of the universal power-law behavior for various constants, as shown in Eq. (4), where a \(constant\) close to \(2\) minimizes \(\epsilon .\)

Figure 4
figure 4

Conservation law indicating the optimal ratio between the depth of filters and their convolutional layer size. (A) Success rates and their standard deviation as a function of \({d}_{1}\) for the generalized LeNet architecture where \(\frac{{d}_{2}}{{d}_{1}}=\frac{16}{3}\) (green) and \(\frac{{d}_{2}}{{d}_{1}}=\frac{4}{3}\) (blue) with fitted (color coded) power law (Supplementary Information), and the results for \(\frac{{d}_{2}}{{d}_{1}}=\frac{16}{6}\) (yellow, as shown in Fig. 1B) are presented for reference. (B) Similar to (A), for the generalized VGG-16 architecture, where the number of filters in the nth convolution set (\(n\le 4)\) is \(d\cdot {2.5}^{n-1}\) (green) and \(d\cdot {1.5}^{n-1}\) (blue) with fitted (color coded) power law (Supplementary Information), and results for \(d\cdot {2}^{n-1}\) (yellow, as shown in Fig. 2B) are presented for a reference.

The following theoretical justification may explain why the conservation law (Eq. 2) leads to the minimization of error rates: Its purpose is to preserve the signal-to-noise ratio (SNR) along the feedforward convolutional layers such that the signal is repeatedly amplified. The noise of each large convolutional filter is expected to be proportional to the square-root of its size, \(m,\) and its signal to \({m}^{2}\). Consequently, the SNR is proportional to \(m\), and for the entire convolutional layer composed of depth \(d\) is \(m\cdot d\). Hence, to compensate for the shrinking of the convolutional layer size along the feedforward architecture, its depth must be increased accordingly. Indeed, preliminary results indicate that doubling the number of filters in the fifth convolution set of VGG-16 (with \(d=16)\), such that the number of filters in all convolutions (\(n\le 5)\) is \(16\cdot {2}^{n-1}\) decreased \(\epsilon\) by \(\sim 0.015\) compared with the standard VGG-16 architecture (Fig. 2, Supplementary Information for enhanced VGG-16). This supports the argument that maintaining the same SNR along the entire deep architecture enhances success rates. Nevertheless, further extended simulations on various architectures and databases are required to support the accuracy of the suggested conservation law, particularly because the convolutional layer sizes are small and far from the thermodynamic limit. Additionally, it is important to examine how the sensitivity of \(\rho\) and the conservation law are related to the properties of the cost function and the details of BP dynamics.

Discussion

Minimizing error rates for a particular classification task and database has been one of the primary goals of machine learning research over the past few decades. As a result, more structured DL architectures consisting of various combinations of concatenated convolutional and densely connected layers have been developed. Typically, further significant minimization of error rates requires deeper architectures, where such an architecture with some modifications can achieve reasonably high success rates for several other databases and classification tasks. This study suggests that, using the extrapolation of the power-law scaling (Eq. 3) traditional shallow architectures can achieve the same error rates as state-of-the-art DL architectures. The preferred architecture can reduce the space–time complexity for a specific training algorithm on a given database and hardware implementation. A theoretical framework is presented for constructing a hierarchy of complexity between families of artificial feedforward neural network architectures, based on their power-law scaling, exponent \(\rho\), and pre-factor \(A\). It is possible that the optimal architecture among several ones depends on the desired error rate (Fig. 4). Contrary to common knowledge, shallow feedforward brain-inspired architectures are not inferior, and they do not represent, as thought, an additional biological limitation28. They can achieve low error rates such as DL algorithms, even with significantly low computational complexity for complex classification tasks (Fig. 3). We note that the presented power law as a function of the depth of the architecture differs from the power law behavior for SRs as a function of the dataset size29,30,31,32,33.

Architectures that maximize \(\rho\) and its upper bound are not yet known. Preliminary results indicate that for a specific architecture, \(\rho\) may increase when the number of weights grows super-linearly with the number of filters. This can be achieved using a fully connected layer, in which the number of input and output units is proportional to the number of filters. Another possible mechanism is the addition of a super-linear number of cross-weights to the filters. This represents a biological realization because cross-weights result as a byproduct of dendritic nonlinear amplification17,29,34,35. Nevertheless, these possible enhanced \(\rho\) mechanisms significantly increase computational complexity and are mentioned for their potential biological relevance, limited number of layers, and the natural emergence of many cross-weights.

Advanced GPU technology is used to minimize the running time of the DL algorithms. Indeed, our single-epoch running time using the CIFAR-10 database and VGG-16 with \(d=4\) is only a factor \(\sim 1.5\) compared with LeNet with \({d}_{1}=6\), where both cases have similar success rates. However, shallow architectures with the same error rates as advanced deep architectures require more filters per convolutional layer, and consequently, a significantly increased number of fully connected weights. Above a critical number of filters, depending on the GPU properties, an epoch’s running time is significantly slowed down and can even increase by a few orders of magnitude. Similarly, the running time in our case of VGG-16 with \(d=400\) is \(\sim 60\) times slower than that with \(d=8\), and LeNet with \({d}_{1}=2304\) is \(\sim 900\) times slower than that with \({d}_{1}=6\). Hence, efficient realization of competitive error rates of shallow architectures to advance DL architectures requires a shift in the properties of advanced GPU technology. Additionally, it is expected to achieve a significant reduction in computational complexity for a desired error rate and a specific database (Fig. 3).

The power law behavior is presented in this work only for CIFAR10 database and its universal behavior must be confirmed in further research on other datasets. We note that this mission is difficult for the MNIST and CIFAR100 datasets. For MNIST, SRs exceed 0.99 even with LeNet and its power law extrapolation towards unity including error bars is improbable. On the other hand, for CIFAR100 the reported VGG16 success rates36 are around ~ 0.74, hence, the power law observation requires higher success rates such that finite size effects are minimized. However, extending the initial depth above \(d=64\) is beyond our computational capabilities.

The observation of power law as a function of the filter’s depth must also be generalized to other architectures beyond LeNet and VGG16. We note that this mission requires careful optimization of systems’ hyper-parameter space independently for each initial depth, resulting in a demand for high computational power. In addition, results are presented only for stochastic gradient descent algorithm21,22 and the robustness of the power law behavior needs to be verified on other more advanced optimizers37, as well as its possible extension to other deep learning tasks, i.e. segmentation38,39.

Additionally, for large dimension image inputs shallow networks exhibit degradation in SRs and it is commonly accepted that deep architectures are required to enhance SRs. The exhibited results suggest that increasing depth of shallow network will enhance SRs also for large dimension image inputs where the computational complexity is proportional to input size, as for deep architectures. However, its verification requires further research.

Finally, the theoretical origin of the universal power-law scaling (Eq. 3), governing shallow and DL architectures, has not yet been discovered. The following theoretical framework may provide a starting point for investigating this general phenomenon. The teacher–student online scenario is one of the analytically solvable cases exemplifying power law behavior cases40. In the prototypical realizable scenario, the teacher and student have the same feedforward architecture, for example, a binary or a soft committee machine41,42,43, but different initial weights. The teacher supports the students with a random input–output relation, and the student updates its weights based on this information and its current set of weights. The generalization error (test error) decays as a power law with the number of input–output examples, which is normalized to the size of the input. This work differs from online learning because the size of the non-random training dataset is limited, and a training example is repeatedly presented as an input without an online scenario. However, assuming a power-law scaling (Eq. 3), an architecture with an infinite number of filters \(d\to \infty\), exists such that the test error vanishes. This architecture is the teacher’s counterpart and represents a learning rule in the online scenario. A student with fewer filters attempts to imitate the teacher and results in a generalization error, which is expected to decrease with an increasing number of filters. It is currently impossible to find an analytical solution for the shallow and deep architectures that are shown as a function of the number of filters. The question is whether a toy model, where a filter may be represented by a perceptron with a nonlinear output unit, can be solved analytically to show that the generalization error decays as a power law with the number of filters.