Introducing principles of synaptic integration in the optimization of deep neural networks

Plasticity circuits in the brain are known to be influenced by the distribution of the synaptic weights through the mechanisms of synaptic integration and local regulation of synaptic strength. However, the complex interplay of stimulation-dependent plasticity with local learning signals is disregarded by most of the artificial neural network training algorithms devised so far. Here, we propose a novel biologically inspired optimizer for artificial and spiking neural networks that incorporates key principles of synaptic plasticity observed in cortical dendrites: GRAPES (Group Responsibility for Adjusting the Propagation of Error Signals). GRAPES implements a weight-distribution-dependent modulation of the error signal at each node of the network. We show that this biologically inspired mechanism leads to a substantial improvement of the performance of artificial and spiking networks with feedforward, convolutional, and recurrent architectures, it mitigates catastrophic forgetting, and it is optimally suited for dedicated hardware implementations. Overall, our work indicates that reconciling neurophysiology insights with machine intelligence is key to boosting the performance of neural networks.

By design, the modulation factor for GRAPES is bounded in the interval [1,2]. This range has been empirically determined by varying the upper and lower bounds and choosing the interval that allows for the most significant improvements with respect to standard SGD. Supplementary Table 1 shows the test accuracy for different ranges of the modulation factor.

Range of modulation
Optimized learning rate Test accuracy [%] [1,2] 0.05 98.53±0.06 [1,3] 0.01 98.53±0.06 [1,5] 0 We have shown that GRAPES implements a dynamic learning schedule for each weight. Here, we demonstrate the stability of GRAPES by analytically proving its convergence properties. The proof relies on the online learning framework proposed in [1], similarly to the investigation of the convergence of Adam optimizer in [2]. In online convex programming, an algorithm addresses a sequence of convex programming problems, each consisting of a convex feasible set W ⊆ R n , which is the same for all problems, and a convex cost function f t (w) : W → R, which in principle is different for each problem. Given an arbitrary, unknown sequence of T such convex cost functions f 1 (w), f 2 (w), ..., f T (w), at each time step t the algorithm must predict the parameter vector w t before observing the cost function f t . After the vector w t is selected, it is evaluated on f t . Since the convex functions can be unrelated to one another and the nature of the sequence is unknown in advance, we evaluate the convergence of the GRAPES modulation using the regret function [1]. The regret function computes a difference between the proposed online algorithm and an "offline" algorithm. In the online algorithm, for each convex function a decision is made before the cost is known. On the other hand, the "offline" algorithm, prior to making a prediction, has knowledge about the sequence of convex cost functions, and makes a single choice to minimize the cost function f (w) = T t=1 f t (w). The regret function is defined as the difference between the cost of the online algorithm and the cost of the offline algorithm. It is computed as the sum of all the previous differences between the online prediction f t (w t ) and the best fixed point parameter f t (w * ): where w * = argmin w∈W T t=1 f t (w). We show that GRAPES modulation has O( √ T ) regret bound, similarly to standard SGD. Theorem We express the learning rule obtained by applying the local GRAPES modulation to SGD as: where: • g t = ∇f t (w t ) is the gradient of the cost fuction with respect to the parameter w t at time step t • η t is the learning rate at time step t • M t is the modulation factor at time step t. We define M max as the maximum value obtained by the modulation factor. Due to the constraints of the modulation factor M max ≤ 2 Assume that the GRAPES modulation applied to SGD during training can select only the parameters w i belonging to a convex feasible set W . Assume that the programming problem consists of an arbitrary, unknown sequence of convex cost functions f 1 (w), f 2 (w), ..., f T (w) such that f i (w) : W → R. Assume that: 1. The feasible set is bounded, i.e., ∃k ∈ R : The feasible set is non-empty, i.e., ∃w ∈ W 4. The cost functions are differentiable, i.e., ∀t, f t is differentiable. We define the gradient of the cost function as g t = ∇f t (w t ) 5. The cost functions have bounded gradients, i.e., ∃k ∈ R : ∀t, ∀w ∈ W, ||∇f t (w)|| ≤ k We define ||∇f || = max w∈W,t∈{1,2,...} ||∇f t (w)|| 6. ∀t, ∀w ∈ R, ∃ an algorithm A, so that, given w and ∇f t (w), it can produce argmin w∈W ||w i − w j || Assume furthermore that the learning rate follows η t = t −1/2 . Under these assumptions the GRAPES modulation applied to SGD achieves the following guarantee, for all T ≥ 1.
Corollary Under the same conditions of the Theorem, GRAPES modulation applied to SGD achieves the following guarantee, for all T ≥ 1.
Under the same conditions, standard SGD achieves the following guarantee, for all T ≥ 1. and

Main steps of the convergence analysis
First we show that, as f t are convex functions, the regret function has an upperbound and can be expressed as Definition 1 For a convex feasible set W , a function f : W → R is convex if for all w i , w j ∈ W , for all λ ∈ [0, 1], Using Lemma 2 we have where g t q , w t q and w * q are the k-th components of g t , w t and w * respectively. This leads to Now we use the weight update rule w t+1 = w t − η t M t g t and the learning rate dynamics η t = t −1/2 to show that We start by manipulating the weight update rule: Since, by assumption, ||g t || = ||∇f t || ≤ ||∇f ||, we can upper bound the expression above as: By rearranging we obtain: By substituting equation 18 in equation 7 we can write: We apply a change of variable to the second term of equation 19 as t ′ = t + 1 to obtain: We then insert the result of equation 20 into equation 19: Now we use the assumption that the feasible set W is bounded, i.e., ∀t||w t − w * || ≤ ||W || to find: Next, we consider the assumptions η t = t −1/2 and ∀t, We then use equation 26 and the condition ∀t, M t ∈ [1, 2) in equation 25: Supplementary Note 3: Model performance with smaller learning rates than the optimized ones Figures 4 and 5 show the performance of SGD and GRAPES on models trained with small learning rate η = 0.001. In Figure 4 we compare the performance of GRAPES and SGD on models of increasing complexity in terms of depth and layer size. In Figure 5  A further possible origin of the the benefits of GRAPES relies on the adjustment of the error signal. The inhomogeneous distribution of the local modulation factor, combined with the propagation to upstream layers in the propagating version, allows GRAPES to greatly enhance a subset of synaptic updates during training. Hence, small groups of synapses are enabled to strengthen or weaken their weights to a much larger extent with respect to SGD. We compared the layer-wise weight distribution of networks trained with SGD and with GRAPES, both initialized with a normal distribution. After training with SGD, the weight distribution is still close to a Gaussian. Instead, after GRAPES-optimization we observed that, particularly in the first hidden layers of deep networks, the weight distribution does not follow a Gaussian shape, is wider, and is characterized by long tails.
Recent electrophysiological studies have revealed that the amplitudes of EPSPs between cortical neurons are not distributed as Gaussians but obey a long-tailed pattern, typically lognormal [3,4]. Such firing pattern implies that some synapses are very strong while many synapses are weak ("strong-sparse and weak-dense" networks) [5]. Heavy-tailed distributions were shown to lead to important network properties: faster transient responses, higher dynamical range and lesser sensitivity to random fluctuations in synaptic activity [6]. This suggests that the longtail distribution found at many physiological and anatomical levels in the brain is fundamental to structural and functional brain organization [3]. From preliminary investigation, GRAPES seems to convey the network weights toward a more biologically plausible distribution. Fig. 6 illustrates this phenomenon. This implies that the properties of faster learning and greater robustness to noise exhibited by networks trained with GRAPES could stem from the modulation of the gradients towards adjusting the weights to a long-tailed distribution. Training ANNs on hardware accelerators implies certain performance degradation due to a number of hardwarespecific constraints, such as noisy synaptic updates, limited range of synaptic weights [7], and their update frequency and resolution [8]. We empirically demonstrate that GRAPES mitigates the accuracy degradation. Inspired by the approach in [9], we investigate the effect of granularity and stochasticity associated with weight-updates. First we establish a reference performance for a 1-hidden layer network, trained on the MNIST data set with full precision (FP) arithmetic (64-bits) and no noise. Specifically, after FP training, the network achieves a test accuracy of 97.08% with SGD and 97.23% with GRAPES. Then, we apply fixed n-bit granularity and stochasticity on the weight update as described below in Simulation details. Figure 7(a) reports the classification accuracy for different granularity levels and noise amplitudes. In the absence of noise, the accuracy is close to the reference FP value for 4,6,8-bits precision, while for 2-bits granularity an accuracy drop of 1.5% is observed. As noise with increasing amplitude is added, the model accuracy progressively deteriorates. Such degradation is robustly mitigated when GRAPES is applied. Figure  7(b) shows the test curve for a noise standard deviation of σ = 1.5ϵ. For all weight granularities, GRAPES leads to a higher classification accuracy over the entire training period.
We remark that the implemented hardware constraints share many aspects with biological circuits: the synaptic transmission is affected by noise, the signal is quantized and neurons have a limited fan-in/fan-out. Interestingly, GRAPES retains many similarities with biological processes, such as synaptic integration, synaptic scaling and heterosynaptic plasticity. We therefore envision that the brain might exploit such mechanisms in order to overcome the limitations due to the mentioned constraints, and, in fact could turn out to endow the networks with the ability to take advantage of noise to improve the performance. Our analysis is consistent with previous work providing evidence that synaptic integration combined with the intrinsic noise stochasticity of Poisson trains enhances the computational capabilities of spiking networks on pattern classification tasks [10]. In conclusion, our findings suggest that incorporating GRAPES in on-chip training algorithms could pave the way for pivotal progress in learning algorithms for bio-inspired hardware and, in particular, for neuromorphic chips.
Simulation details We train 1-hidden layer networks on MNIST data set. The hidden layer has 250 sigmoid neurons, the output activation is softmax and the loss is cross-entropy. No dropout is introduced. The networks are trained for 10 training epochs with fixed learning rate η = 0.4. The weight distribution after training of the FP network lies in the range [-1,1]. To investigate the effect of fixed n-bit granularity, we assume a similar final weight distribution range as that from the floating-point simulation. Hence, we cover the weight range [-1,1] in 2 n − 2 steps, with n ∈ {2, 4, 6, 8}. We use the n-bit granularity for the forward pass. We perform the backward pass on a floating point copy of the network and apply the granularity after the update. The stochasticity is applied on the weight update as white noise with mean zero and variance σ = kϵ, where ϵ = 1/(2 n − 2) and k ∈ {0.0, 0.5, 1.0, 1.5}. Both the final test accuracy and the test curves are averaged over five runs. Test accuracy and convergence rate on the MNIST dataset for networks trained with BP and SGD optimizer, comparing the results for the local and propagating version of GRAPES. The reported result is the average and standard deviation of best test accuracy over five runs. "DO" stands for dropout, η is the learning rate. Both the local and the propagating versions of GRAPES always outperform the classic SGD both in terms of accuracy (acc) and slowness (s). The propagating version allows for the best improvements.

Supplementary
Supplementary Supplementary Table 3. Test accuracy and convergence rate on the MNIST dataset for networks trained with BP and SGD optimizer, with and without GRAPES modulation. In the "Def" column we indicate whether the elastic deformation scheme [11] was used. "DO" stands for dropout. The learning rate is η = 0.001 with ReLU activation and η = 0.01 with tanh activation. The column marked with "augm η" indicate that the learning rate has been uniformly multiplied by a factor equal to the mean of the local modulation factor of GRAPES. The reported result is the average and standard deviation of the best test accuracy over five runs. The GRAPES modulation in most cases outperforms the classic SGD. Note that, even in the case where GRAPES is used with a smaller learning rate than SGD (network configuration with 10 hidden-layers and tanh activations, where η = 0.01 for SGD and η = 0.001 for GRAPES; see Supplementary Table 10 for details on the learning rate), it provides a better convergence rate.  Supplementary Table 9. Test accuracy on the CIFAR-10 and CIFAR-100 datasets for residual networks trained with BP and Adam optimizer, with and without GRAPES modulation, for learning rates larger and smaller than the optimized one . We trained the models with learning rates η = 1e − 3 and η = 1e − 1, respectively smaller and larger than the optimized learning rate η = 1e − 2. The results confirm that η = 1e − 2 is the optimal learning rate for both SGD and GRAPES. The reported results are for a single run.