Power-law scaling to assist with key challenges in artificial intelligence

Power-law scaling, a central concept in critical phenomena, is found to be useful in deep learning, where optimized test errors on handwritten digit examples converge as a power-law to zero with database size. For rapid decision making with one training epoch, each example is presented only once to the trained network, the power-law exponent increased with the number of hidden layers. For the largest dataset, the obtained test error was estimated to be in the proximity of state-of-the-art algorithms for large epoch numbers. Power-law scaling assists with key challenges found in current artificial intelligence applications and facilitates an a priori dataset size estimation to achieve a desired test accuracy. It establishes a benchmark for measuring training complexity and a quantitative hierarchy of machine learning tasks and algorithms.


Introduction
Phase transition and critical phenomena have been the central focus of statistical mechanics, since the beginning of the second half of 20th century.The thermodynamic properties near the critical point of second-order phase transitions were explained using power-law scaling and hyperscaling relations, depending on the dimensionality of the system 1,2 .The concept of power-law implies a linear relationship between the logarithms of two quantities, that is, a straight line on a log-log plot.It arises from diverse phenomena including the timing and magnitude of earthquakes 3 , internet topology and social networks [4][5][6] , turbulence 7 , stock price fluctuations 8 , word frequencies in linguistics 9 and signal amplitudes in brain activity 10 .
Deep learning algorithms are found to be useful in an ever-increasing number of applications, including the analysis of experimental data in physics, ranging from classification problems in astrophysics 11 and high-energy physics data analysis 12 to imaging in noise optics 13 and learning properties of phase transitions 14 .This work indicates that deep learning algorithms behave asymptotically similar to critical physical systems.A basic task in deep learning is supervised learning, where a multilayer network (e.g.Fig. 1a) learns to produce the correct output labels to the input data based on a training database of examples, input-output pairs.A simple example of this is the large Modified National Institute of Standards and Technology (MNIST) database consisting of 60,000 training handwritten digits and 10,000 test digits 15 , without any data extension 16,17 .
The weights of the selected feedforward network are adjusted using back-propagation algorithm, which is a gradient-descent-based algorithm, to minimize the cost function, thereby, quantifying the mismatch between the current and desired outputs 15 .
The performance of the algorithm is estimated using test error, measured on a dataset that was not observed during the training.The test error is expected to decrease with increasing information and increasing dataset size, and to vanish asymptotically in a sufficiently complex network, e.g.enough number of weights, hidden layers and units.
The disappearance of the test error with a power-law scaling is the focus of our study, which sets a priori estimation of the required dataset size to achieve the desired test accuracies.The robustness of the power-law scaling phenomenon is examined for training with one and many epochs, that is, for the number of times each example is presented to the trained network, as well as for several feedforward network architectures consisting of a few hidden layers and hyper-weights 18 , that is, input crosses.The result of the optimized test errors with one training epoch is in the proximity of state-of-the-art algorithms consisting of a large number of epochs, which has an important implication on the rapid decision making under limited numbers of examples 19,20 , which is representative of many aspects of human activity, robotic control 21 , and network optimization 22 .The current applicability of the asymptotic test accuracy to such realities using an extremely large number of epochs is questionable.This large gap between advanced learning algorithms and their real-time implementation can be addressed by achieving optimal performance based on only one epoch.Finally, the comparison of the power-law scaling, exponents and constant factors, stem from various learning tasks, datasets, and algorithms is expected to establish a benchmark for a quantitative theoretical framework to measure their complexity 23 .
The first trained network that is employed comprises 784 inputs representing 28×28 pixels of a handwritten digit in the range [0, 255] with additional 10,000 input crosses per hidden unit (see Supplementary Appendix A), two hidden layers comprising 100 units each, and 10 outputs representing the labels (Fig. 1a

Momentum strategy: Power-law with many epochs
The commonly used learning approach is the backpropagation (BP) strategy given by:  +1 =   −  ⋅     (1)   where a weight at discrete time-step t, W t , is modified with a step-size η towards the minus sign of the gradient of the cross entropy cost function, C,  1) and (3), respectively.Here we used the momentum strategy 26 : where the friction, μ, and the regularization of the weights, 1-α, are global constants in the region [0, 1] and η is a constant representing the learning rate.In addition there are biases per node associated with the induced field on each node We minimize the test error for each dataset size over the five parameters of the algorithm with  0 ~0.65,  ∼ 0.50 (Fig. 1b), and its extrapolation to the maximal dataset, 6,000 examples per label, indicates a test error of ~0.008.Note that the saturation of the minimal test error is achieved after at least 150 epochs (see Supplementary Appendix B).

Accelerated strategy: Power-law with many epochs
An accelerated BP method is based on a recent new bridge between experimental neuroscience and advanced artificial intelligence learning algorithms, in which an increased training frequency has been able to significantly accelerate neuronal adaptation processes 24 .This accelerated brain-inspired mechanism involves time-dependent stepsize, η t , associated with each weight, such that coherent consecutive gradients of weight, that is, with the same sign, increase the conjugate η.The discrete time BP of this accelerated method is summarized for each weight by where A and β are constants, different for each layer, representing the amplitude and gain, respectively.In addition, there are biases per node similar to eq. ( 4) where  0 is The minimization of the test error of this accelerated method over its 11 parameters where  ,  stands for the value of the output label j in output layer L and in replica s (j=0, 1, ….9).The minimized test error of the soft committee of the momentum strategy is ~ 0.007 with  ∼ 0.52 (Fig. 1c), which is in close agreement with state-of-the-art achievements obtained using deep neural networks 27 .

Power-law with one epoch
A similar minimization of the test error, , is repeated for one epoch, where each example in the training set is presented only once as an input to the feedforward network (Fig. 1a).For the momentum strategy it is found that ~0.49and its extrapolation to the maximal dataset (i.e., 6,000 examples per label) results in ~0.021 (Fig. 2a), and for the brain-inspired accelerated strategy in ~0.017 and ~ 0.49 (Fig. 2b).For the soft committee of the momentum strategy it is found that ~0.015 with slope, ~ 0.48 (Fig. 2a).The test error is reduced even further using soft committee of the accelerated strategy, where ~0.013 with slope, ~ 0.49 for 6,000 examples per label (Fig. 2b).
Results of one epoch are in the proximity of the test error using many epochs, where the best test error for many epochs ~0.007 has to be compared with ~0.013 for one epoch.
These results strongly indicate that rapid decision making, which is representative of many aspects of human activity, robotic control 28 , and network optimization 22 , is feasible.

Power-law with several hidden layers
The robustness of the power-law phenomenon for the test error as a function of dataset size (Figs. 1 and 2) is examined for similar feedforward networks without input crosses, and with up to three hidden layers with 100 hidden units each (Fig. 3a).For one hidden layer, the minimization of  for one epoch and for the momentum strategy indicates ~0.3, and its extrapolation to 6,000 examples per label results in  = 0.053 (Fig. 3b).Using two layers the exponent increases to ~0.34 with  = 0.049 (Fig. 3c), and for three layers to  = 0.385 with  = 0.048 (Fig. 3d).These results confirm the existence of the power-law phenomenon in a larger class of feedforward networks and different learning rules as well as the possible increase of the power-law exponent with increasing number of hidden layers (Fig. 3b-d).Asymptotically for very large datasets, increasing the number of hidden layers is expected to minimize , since  increases.However, for a limited number of examples, one layer minimizes  (Fig. 3b-d), as the constant  0 in eq. ( 5) is smaller for one layer.Particularly, the power-law scaling indicates that the crossing of  between one and two layers occurs at ~480 examples per label, whereas the crossing between two and three layers occurs at ~4100 examples per label.This trend stems from the limit of small training datasets and one training epoch, which prevents enhanced optimization of the many more weights of networks with more hidden layers.The asymptotic test error,  = 0.049, of a network with two hidden layers (Fig. 3c) has to be compared with ~ 0.021 which is achieved for the same architecture with additional input crosses (Fig. 2a).The significant improvement of ~0.028 is attributed to the additional input crosses.This gap also remains under soft committee decision where for two layers without input crosses and the maximal dataset, 6,000 examples per label, ~ 0.039 (Fig. 4a), which is much greater than ~ 0.007 (Fig. 1c).We note that ~0.31 (Fig. 4a) is expected to slightly increase beyond ~0.34(Fig. 3c) using better statistics.

Discussion
The power-law scaling enables the building of an initial step for theoretical framework for deep learning by feedforward neural networks.A classification task, which is characterized by a much smaller power-law exponent, , is categorized as a much harder classification problem.It demands a much larger dataset size to achieve the same test error, as long as the constant  0 (eq.( 5)) is similar.Similarly, one can compare the efficiency of optimal learning strategy by two different architectures for the same dataset and number of epochs (Figs. 2 and 3) or a comparison of two different BP strategies for the same architecture (Fig. 1).Our work calls for the extension and the confirmation of the power-law scaling phenomenon in other datasets 23,[29][30][31][32] , which will enable to build a hierarchy among their learning complexities.It is especially interesting to observe whether the power-law scaling will lead to a test error in the proximity of state-of-the-art algorithms for other classification and decision problems as well.
The observation in which the test error with one training epoch is in the proximity of the minimized test error using a very large number of epochs paves way for the realization of deep learning algorithms in real-time environments, such as tasks in robotics and network control.A relatively small test error, for instance less than 0.1, can be achieved for a small datasets consisting of only a few tens of examples per label only.
Finally, under the momentum strategy and many training epochs, the minimal saturated test errors of one, two, and three hidden layers and without input crosses are found to be very similar (Fig. 4b).The test error, ~0.017, at the maximal dataset size and ~0.4 has to be compared to 0.008 with additional input crosses and ~0.5 (Fig. 1b).For three layers,  is slightly greater than for one or two layers, but within the error bars.This gap diminishes when the optimized test error for the three layers is obtained under an increased number of epochs, and through the construction of weighs one can show that  of two layers is achievable with three layers(see Supplementary Appendix F).
Furthermore, the similarity of , independent of the number of hidden layers and for many training epochs (Fig. 4b), is supported by our preliminary results, wherein the average  of one hidden layer with input crosses and many training epochs is comparable with the one obtained with two hidden layers (Fig. 1b).These results may question the advantage of deep learning based on many hidden layers in comparison to shallow architectures.It is possible that this similarity in the test errors, independent of the number of hidden layers, is either an exceptional case or a larger number of hidden layers enables an easier search in the BP parameters space, which achieves proximity solutions of the minimal test error.However, for the same examined architectures and for one epoch only, the test error and the exponent of the power-law are strongly dependent on the number of hidden layers (Fig. 3).(For details of the parameters, see Supplementary Appendix B) (For details of the parameters, see Supplementary Appendix C) where W ij 2 is the weight from the i th unit in the first hidden layer to the j th unit in the second hidden layer, and b j 2 is the bias induced on the j th unit in the second hidden layer.z j,m 2 represents the field for the second layer.Each time we calculate the field, z j,m 2 , we subtract the accumulative average field for the second layer of the previous m-1 examples, where Amp 2 is a constant representing the amplitude of reduction.Note that z j,m 2 is not modified for m = 1.
The output of the j th unit in the output layer, a j 3 , is calculated as following: where W ij 3 is the weight from the i th unit in the second hidden layer to the j th output unit, and b j 3 is the bias induced on the j th output unit.The backpropagation method computes the gradient for each weight with respect to the cost function.The weights and biases are updated according to the advanced acceleration method 1 : where t is the discrete time-step, W are the weights, 1- is a regularization constant and η is defined for each weight.

Optimization:
The selection of the optimized parameters.For a given architecture and number of epochs, the optimization procedure first evaluates the test error over a rough grid of the adjustable parameters followed by fine-tuning grids with higher resolutions.For example, the α parameter in the range (0, 1) was first estimated under a rough grid Δα = 0.1.Next, the selected range for further optimization (0, 0.1), for instance, was estimated under a resolution Δα = 0.01, and finally under a resolution of Δα = 0.0001.The maximal resolution was selected such that the test error for a desired resolution was unaffected by selecting a higher resolution.All other tunable parameters were optimized similarly.Note that the training error practically vanishes.For the momentum strategy and small dataset sizes, a search over the entire selected grid was possible.However, for large dataset sizes and for the acceleration strategy consists of 11 parameters an optimization of the test accuracy over a grid was beyond our computational capabilities.We note, that in order to obtain a meaningful optimization procedure, we need to average each measured point over 20-50 different samples, otherwise, the optimization procedure is dominated by stochastic fluctuations.
In cases where a complete optimization over a grid was impossible, we optimized sequentially each parameter over its grid.Nevertheless, we confirmed that a few different sequential orders of the optimized parameters result in the same optimized test accuracy and set of parameters.
The optimization is performed independently for each examined dataset size, number of examples and number of epochs.Results for the committee systems are based on the optimized selected parameters for a single system.The optimized parameters are summarized in the following tables.
We note that cross validation was confirmed using several validation databases consisting each of The parameters used in this figure are the same as in Figure 1b.
).The presented dataset of examples for the algorithm involves the following initial preprocessing and steps(see Supplementary Appendix A): (a) Balanced set of examples: The small dataset consists of an equal number of random examples per label 24 .(b) Input bias: The bias of each example is subtracted and the standard deviation of its 784 pixels is normalized to unity.(c) Fixed order of trained labels: In each epoch, examples are ordered at random, conditioned to the fixed order of the labels.(d) Microcanonical set of input crosses: Each hidden unit in the first layer receives the same number of input crosses, in which each cross comprises two input pixels.(e) Forward propagation: A standard sigmoid activation function is attributed to each node 25 and in the forward propagation the accumulative average field is dynamically subtracted from the induced field on each node of the hidden layers.

where
stands for the desired labels of the m th examples,    stands for the current 10 outputs of the output layer L, and the first summation is over all M training examples.The second summation is the overall weights of the network, and  and  are constants defined in eqs.(

((
, , ,  1 ,  2 ) (where Ampi are the amplitudes associated with each hidden layer in the forward propagation, see Supplementary Appendix A).The minimized averaged test error, , for number of examples per label in the range[9,120] indicates a power-law scaling  /)

Figure 1 .
Figure 1.Power-law scaling for the test error with many epochs.(a) Scheme of MNIST handwritten digit, which is digitized and fed into the trained network including input crosses (red background).(b) Optimized test error, , using the architecture in (a), for limited datasets comprising 9, 15, 30 and 60 examples/label and their standard deviations obtained from 50 samples.Momentum strategy (light-blue circles) and acceleration strategy (black triangles).(c) Test error for soft committee decision with   = 50 (eq.8).

Figure 2 .
Figure 2. Power-law scaling for the test error with one epoch.(a) Test error and its standard deviation as a function of number of examples per label for one epoch onlywhere the trained network is the same as in Fig.1a.Results for the momentum strategy (orange) and for the soft committee,   = 50, (blue), where each point is averaged over at least 100 samples.(b) Similar to (a) using the accelerated BP strategy, eqs (6) and(7).

Figure 3 .Figure 4 .
Figure 3. Power-law scaling for the test error with several hidden layers and one epoch.(a) Scheme of the trained network on the MNIST examples consisting of three hidden layers having each 100 units and an output layer.In the case of one/two hidden layers only, two/one hidden layers are removed.(b) Minimized test error for 30, 60, 120, and 240 examples/label for one hidden layer (a) using the momentum strategy and one epoch only.The average of each point and its standard deviation are obtained from at least 100 samples.(c) Similar to (b) with two hidden layers in (a).(d) Similar to (b) with three hidden layers in (a).(For details of the parameters, see Supplementary Appendix D)

Back propagation: 2 i
We use the cross entropy cost functionC = -1 M ∑ [y m ⋅ log(a m ) + (1-y m ) ⋅ log(1-a m )]where y m stands for the desired labels and a m stands for the current 10 output units of the output layer and η and α are constants defined in eqs.(1) and (3) in the main text, respectively.The summation is over all M training examples.The second summation is over all weights of the network.Note that for the accelerated strategy, η = η t in the above cost function.
A d and β d are constants representing the amplitude and the gain between the d th and d+1 th layers, d=1,2 and 3. η is initialized as:η 0 = A d ⋅ tanh(β d ⋅ ∇ W C first ), where ∇ W C first is the first computed gradient.V is initialized as: V 0 = -|η 0 | ⋅ ∇ W C first .Test accuracy: The network test accuracy is calculated based on the MNIST dataset for testing, containing 10,000 input examples.The test examples are modified in the same way as the examples in the training dataset.
1 ,  2 ,  3 ,  1 ,  2 ,  3 , , , ,  1 ,  2 ) (see Supplementary Appendix A) is a The test error is further minimized using a soft committee decision based on several replicas, , of the network, which are trained on the same set of examples but with different initial weights.The result label, j, for the test accuracy is given by

Figure 1c optimized parameters: Momentum strategy -Classifications Examples/ label Epoch N c Committee Success rate committees Std committee
10,000 random examples with the same statistics for each label as in the test set.Averaged results were in the same STD of the reported test errors.In addition, preliminary results also indicate that databases consisting of random selected examples, with different fluctuations for each label, also result in similar test errors.