Synaptic metaplasticity in binarized neural networks

While deep neural networks have surpassed human performance in multiple situations, they are prone to catastrophic forgetting: upon training a new task, they rapidly forget previously learned ones. Neuroscience studies, based on idealized tasks, suggest that in the brain, synapses overcome this issue by adjusting their plasticity depending on their past history. However, such “metaplastic” behaviors do not transfer directly to mitigate catastrophic forgetting in deep neural networks. In this work, we interpret the hidden weights used by binarized neural networks, a low-precision version of deep neural networks, as metaplastic variables, and modify their training technique to alleviate forgetting. Building on this idea, we propose and demonstrate experimentally, in situations of multitask and stream learning, a training technique that reduces catastrophic forgetting without needing previously presented data, nor formal boundaries between datasets and with performance approaching more mainstream techniques with task boundaries. We support our approach with a theoretical analysis on a tractable task. This work bridges computational neuroscience and deep learning, and presents significant assets for future embedded and neuromorphic systems, especially when using novel nanodevices featuring physics analogous to metaplasticity.

1: a 0 ← x Input is not binarized 2: for l = 1 to L do For loop over the layers 3: z l ← W b l a l Matrix multiplication 4: a l ← γ γ γ l · z l −E(z l ) √ Var(z l )+ε + β β β l Batch Normalization [2] 5: if l < L then If not the last layer 6: a b l ← Sign(a l ) Activation is binarized 7: end if 8: end for 9:ŷ ← a L 10: returnŷ, cache Supplementary Algorithm 2 Backward function of the BNN reproduced from [1]. W b = (W b l ) l=1...L are the binary weights, θ θ θ BN = {(γ γ γ l , β β β l ) | l = 1...L} are Batch Normalization parameters. BackBatchNorm(·) specifies how to backpropagate through the Batch normalization [2]. L is the total number of layers and the subscript l when specified is the layer index. 1 |a l |≤1 is the derivative of Hardtanh taken as a replacement for back propagating through Sign activation.
Input: C,ŷ, W b , θ θ θ BN , cache. Output: (∂ W C, ∂ θ C). 1: g a L ← ∂C ∂ŷ Cost gradient with respect to output 2: for l = L to 1 do For loop backward over the layers 3: if l < L then If not the last layer 4: g a l ← g a b l · 1 |a l |≤1 Back Prop through Sign 5: end if 6: (g z l , g γ l , g β l ) ← BackBatchNorm(g a l , z l , γ γ γ l , β β β l ) See [2] 7: ← a b l−1 g z l 9: end for 10: ∂ W C ← {g W b l | l = 1...L} 11: ∂ θ C ← {g γ l , g β l | l = 1...L} 12: return (∂ W C, ∂ θ C) The optimization is performed using Adaptive Moment Estimation (Adam) algorithm [3]. As the sign function is not differentiable in zero and the derivative is zero on R * , during error backpropagation the derivative of hardtanh function is used as a replacement for the derivative of the Sign function. The activation function is the sign function except for the output layer. The input neurons are not binarized. We use batch normalization [2] at all layers as detailed in Alg. 1. The following derivation for layer l, Var(z) + ε + β l = γ l Var(z) + ε z − E(z) − β l Var(z) + ε γ l a = Sign(γ l )Sign z − E(z) − β l Var(z) + ε γ l shows that because the Sign function is invariant by any multiplicative constant in the input, the only task dependent parameters we need to store for an inference hardware chip is the term between square brackets, along with the sign of γ l . The amount of task dependent parameters scales as the number of neurons and is order of magnitudes smaller than the number of synapses. Adam optimizer updates the hidden weight with loss gradients computed using binary weights only. We use a small weight decay of 10 −7 in the Adam optimizer to make zero floating values more stable. However, consolidated weights are not subject to weight decay, as we implement weight decay as a modification of the loss gradient, which is gradually suppressed by f meta .

Supplementary Note 2: Training parameters
The batch normalization layers parameters were not learned for the Fashion MNIST experiment whereas they were learned for the CIFAR-10 experiment.
The batch normalization parameters are set to β = 0, γ = 1 for the Fashion MNIST experiment. The performance of the BNN with learned batch normalization parameters was inferior, as batch normalization parameters appear to overfit to the subsets of data. In the CIFAR-10 experiment the performance was higher with learned batch normalization parameters. The architecture of VGG-7 network consists of 6 convolutional layers of 3 × 3 sized

Supplementary Note 3: Implementation of Synaptic Intelligence
In this Supplementary Note, we discuss the implementation of the synaptic intelligence algorithm [4], designed for continual learning in full precision neural networks. The algorithm consists in optimizing the loss function when learning the task µ, where L µ is the loss function associated with the current task and c ∑ k Ω µ k ( θ k − θ k ) 2 is a "surrogate loss" [4] compelling the current parameters θ k to stay close to the parameters θ k optimized for previous tasks. Ω µ k is the importance factor for parameter θ k and is updated between each task by ∆ ν k is a normalization factor equal to the total parameter change over the latest learned task, and ξ is a small constant number avoiding any division by zero. ω ν k is computed in an online fashion by approximating the path integral of the parameters and can be interpreted as the parameter specific contribution to changes in the total loss.
As a control experiment, we reproduce the results of [4] for the permuted MNIST benchmark in Suppl. Fig. 2a, with c = 0.1 and ξ = 0.1. In the case of binarized neural networks, we tried several ways of computing the importance factor Ω µ k by employing either the binarized weight or the hidden weight for ω µ k and ∆ ν k . The best performance was achieved by using the binarized weight values for ω Synaptic Intelligence a applied to full precision neural networks with two hidden ReLU layers of increasing size ranging from 512 to 4,096 for the permuted MNIST benchmark, results reproduced from [4]. b The best performing adaptation of Synaptic Intelligence to binarized neural networks (see Suppl. Note .) The curves are averaged over five runs and shadows stand for one standard deviation.

Supplementary Note 4: Use of a Metaplasticity Function f meta Featuring a Hard Threshold
In this note, we present a control experiment where the modulating function f meta is a hard threshold function such The hyperparameter m is, in this case, the threshold value above which f meta is zero. The value of m is obtained by hyperparameter tuning and set to m = 0.4.
We observe that the performance is degraded modestly when using such threshold mechanism, in accordance with the theoretical evidence that high hidden weights correspond to important binarized weights for consolidation. The most degradation is observed in the regime where the neural network exhibits the highest capacity in number of tasks (network with 4,096-wide layers trained with nine or ten tasks).

Supplementary Figure 2.
Comparison of different choices for f meta , when ten training ten permuted MNIST tasks. This plot shows the comparison between two classes of f meta functions. The bullets represent the metaplastic BNNs with the function class introduced in the body text with m = 1.35, while the squares denote an f meta function with a hard threshold above which a weight is irreversibly consolidated. The threshold value is tuned to be 0.4. The colors denote increasing network sizes. The curves are averaged over five runs and shadows stand for one standard deviation.

Supplementary Note 5: Mathematical proofs
Definition 1 (Quadratic Binary Task): Consider the loss function: with a symmetric definite positive matrix H ∈ R d×d . Gradients are given by g(W) = H · (W − W * ). We assume the following optimization scheme: where sign returns the sign of a vector component-wise.
Lemma 1 (Condition for hidden Weight confinement ): Let W h optimize a quadratic binary task according to the . Let B ∞ be the unit ball for the infinite norm and B ∞ its closure. Then: Proof of Lemma 1. We first prove Eq. (7). Let us assume that W * / ∈ B ∞ so that there exists at least one component i ∈ 1, d such that |W * i | > 1. Since H is symmetric definite positive, it is invertible. Taking the euclidian scalar product between H −1 e i and the update (W h t+1 − W h t ) yields: where we have used at the fourth equality that H −1 is also symmetric. Since |W * i | > 1, the sign of sign(W h i,t ) −W * i is constant (and = 0), so the component of W along H −1 e i is expected to diverge. More precisely, let us assume Summing Eq. (8) from time step 0 to t yields: showing that lim t→+∞ H −1 e i , W h t = +∞. Consequently there exists j ∈ 1, d such that lim t→+∞ e j , W h t = +∞ and therefore lim t→∞ W h t ∞ = +∞. Similarly if W * i < −1, we show that:

7/23
giving the same conclusion as above.
We now prove Eq. (6). Let us assume that W * ∈ B ∞ , i.e. ∀i ∈ 1, d , |W * i | < 1. We have: so that : We want to show that if W h t is large enough in norm · H −1 , Eq. (11) will be met. First note that, because the dimension is finite there exist two constants α > 0 and β > 0 such that ∀x ∈ R d , and also that: Then, by triangular inequality: Denoting (e α ) α and (λ α ) α the eigenbasis of H and their associated eigenvalues, we have by Cauchy Schwarz inequality: so that: Thus the right hand side of Eq. 11 is bounded. Also note that: So far we have shown that the left hand side of Eq.11 is lower bounded by a constant ( = 0) times the infinite norm of W h t , while the right hand side is bounded. Therefore to ensure Eq. (11) it suffices that: .
And thus to ensure Eq. (11) it suffices that: .
And because the update ∆W h t is bounded in norm · H −1 , an absolute upper bound of W h t is : Thus we have proven that W * ∈ B ∞ ⇒ ∃C > 0, ∀t ∈ N, W h t ∞ < C Lemma 2 (hidden Weight Trajectory): Let W h optimize a quadratic binary task according to the dynamics Then: , the dynamics of W h t defined in Eq. (5) simply rewrites component-wise: By Lemma 1, components W i such that |W * i | < 1 are bounded.

9/23
For components i where |W * i | > 1, ∆W h i,t has the sign of W * i since Eq. (14) rewrites: so that W h i,t necessarily ends up having the same sign as W * i , hence there exists t 0,i ∈ N such that : By definition of t 0,i , W h i,t and W * i have opposite sign before t 0,i so that: Therefore, summing Eq. (14) between 0 and t yields : Theorem 1 (Importance of hidden Weights in a quadratic binary task): Let W optimize a quadratic binary task according to the dynamics Then, for any component i such that |W * i | > 1, the variation of loss resulting from flipping sign( Proof of Theorem. 1 Proof. Using Eq. (4), the loss reads:

10/23
Using Lemma 2, for all components i such that |W * i | > 1, there exists t 0,i such that for all t > t 0,i , sign(W h i,t ) = sign(W * i ) and therefore 1 2 Then, the increase in energy if a binary component in the |W * i | > 1 sum is switched is : Using the explicit form of W h i,t in Eq. (18) along with Eq. (20), we get: Since W h i,t has the same sign as W * i for t being large enough, multiplying both sides for the last equation and dividing by t yields:

Supplementary Note 6: Comparison Between the Hidden Weights of Binarized Neural Networks and the Weights of Full Precision Networks
In this supplementary note, we illustrate on a 2-D optimization task how hidden weights in a binarized model differ from the usual full precision weights, and why the former are good candidates for synaptic consolidation. The color map of Suppl. Fig. 3 denotes a 2-D landscape where the darker the color, the higher the cost. The global minimum is denoted in red by W * . The binarized model in Supp. Fig. 3a has two parameters given by the sign of two hidden weights W h x and W h y . The binarized model can thus be in four different states given by the corners of the unit sphere for the infinite norm B ∞ . If we consider a case where W * is not a corner of B ∞ , the binarized model cannot converge. Instead, hidden weights keep being updated by gradients evaluated in the binarized parameters. We show in Supplementary Note 5 that the vector W h will diverge if W * is outside B ∞ , even if we increase the dimension of the problem. We intuitively see in Suppl. Fig. 3a that because W * y is between -1 and 1, and W * x > 1, the binarized value of W h x is more important than the binarized value of W h y with respect to optimization. On the other hand, Suppl. Fig. 3b shows the same optimization problem solved by a full precision model. The model converges to W * and contrary to the binarized case, the knowledge of the final state of W x and W y cannot be leveraged to learn a second task.

Supplementary Note 7: Learning Rate Decay
In this supplementary note, we investigate the performance of a learning rate decay scheduler to see how it compares to our metaplastic binarized neural network approach. We study the setting of learning six permuted MNISTs and investigate learning rate schedulers where the learning rate is divided by a constant factor between each task. We list in Table 4 the performance for several values of initial learning rates and dividing factors. For instance, an initial learning rate of 10 −2 and dividing factor of 10 means that the six tasks are learned respectively with the learning rates : 10 −2 , 10 −3 , 10 −4 , 10 −5 , 10 −6 , and 10 −7 .
(initial LR, dividing factor)  Table 4. Permuted MNIST experiment with learning rate decay. The accuracy for each task is averaged over five runs and standard deviation is given between parenthesis. Best settings are in bold font.

Supplementary Note 8: Increasing Synapse Complexity for Steady-State Continual Learning
In this supplementary note, we show that one limitation of the metaplasticity model presented in the main article can be alleviated by considering a more complex synaptic model inspired by the metaplasticity model of Benna and Fusi [5]. The metaplasticity model presented in the main article has an aging property that prevents it from learning new tasks after learning a finite number of tasks (depending on the capacity of the network). For instance, Supplementary Figure 4(b) shows the accuracies of ten tasks learned by the metaplasticity model introduced in the main article. While the network successfully learns up to seven tasks, the last three tasks are not properly learned because all the weights have been consolidated. The type of continual learning achieved by this network is therefore non steady-state, consistently with much of the machine learning literature on continual learning [6], but unlike the brain.
The metaplasticity model introduced in [5] describes synapses with several hidden variables interacting over a wide range of timescales through diffusion processes. The slowest variable features a leakage term, allowing the possibility to reach a steady-state type of consolidated learning, where the newest memories can replace the firstly trained ones. Here, we propose training binarized neural networks with synapses featuring not a simple hidden weight, but a collection of them interacting over a wide range of timescales in a way inspired by [5]. This approach can have several benefits. First, the hidden weights tend to evolve stochastically in conventional binarized neural networks due to the stochastic nature of data batches. This means that in our original metaplasticity approach, if a hidden weight gets carried too far away from zero because of the noise, it will be consolidated. The more complicated synapses inspired by [5] can provide a cleaner signal to perform weight consolidation and constitute promising candidates to solve the issue of noise-induced consolidation. Second, and more importantly, thanks to the leakage on the slower variable, we hope to provide the binarized neural network with a truly steady-state form of continual learning.
In our model, each synapse features four hidden variables (W h 1 , W h 2 , W h 3 , and W h 4 ), which evolve according to : ) · sign(W h 4 (t)) > 0 but rapidly with respect to the long timescale. We introduce f meta to accommodate for this timescale asymmetry. Another difference comes from the sequential synaptic updates, which follow the gradient of a loss function and are therefore highly correlated on shorter time scales. For this reason, the influence of the slowest variable on W h 1 through the diffusion chain cannot effectively protect from the correlated gradients of the new task. We thus add a unidirectional feedback connection parameterized by α (Suppl. Fig. 4 (a)) between the slowest variable and W h 1 to provide better consolidation. The two modifications of the model allow W h 4 to be more stable on the longer timescales of our setup, while allowing to W h 1 react on its shorter ones. Our results, presented in Supplementary Figure 4(c), and discussed in the main article, show that binarized neural networks featuring such complex synapses can learn tasks sequentially similarly to our simpler synapse model, and in addition, new tasks can be learned while older tasks are gradually forgotten. The histograms of hidden variables ( Supplementary Figure 4(e)) also evidence that weights do not accumulate to high values.
We then show in Suppl. Fig. 5(a) how the model performs when a sequence of 20 tasks is learned. In this situation, the system reaches a "true" steady state. This is observed by plotting the distributions of the hidden variables in Suppl. Fig. 5(b), superimposed over the three most recent tasks. We find that the capacity of the model in this true steady state regime is reduced compared to the more transient regime observed during the first ten tasks in Suppl. Fig. 4, as the accuracy of the last learned tasks drops more rapidly in Suppl. Fig. 5(a) than in Suppl. Fig.  4(c). This result is in accordance with the literature on this type of truly steady-state learning [5,7,8]. (c) Same architecture but with our new algorithm with four hidden variables. The model is still able to learn several tasks sequentially but older tasks are gradually forgotten and new tasks can always be learned. The curves are averaged over five runs and shadows stand for one standard deviation. (d) Trajectories of the hidden variables as a function of training iterations. The deeper the hidden variable, the slower and smoother it behaves, providing a cleaner signal for consolidation. (e) Distribution of the hidden variables after learning 10 tasks. Unlike the distribution presented in the body text, hidden weights do not accumulate to ever increasing values and new tasks can always be learned. Figure 5. Steady-state regime (a) Test accuracy on the ten most recent tasks when learning a long sequence of tasks. The stationary regime exhibit graceful forgetting where oldest tasks are forgotten and new tasks can always be learned. The curves are averaged over five runs and shadows stand for one standard deviation. (b) Distribution of hidden variables for each layer (horizontally) and each hidden variable (vertically). Distributions are superimposed over the three most recent tasks. We observe that the steady state have been reached.

Supplementary
Finally, we investigate the impact of removing the feedback process linking the slowest hidden variable to the first one, in multiple situations. We let the hidden variables evolve only through the main connections and remove the feedback process: we set α = 0 and f meta = 1 in Suppl. Fig. 4(a). The results are listed in the Supplementary Table 5 for 21 values of the parameters of the synapses, covering cases with more hidden variables and/or slower time scales. In all these situations, we observe some memory signal for Tasks 8 and 9, with varying accuracy depending on the parameter choice. However, the accuracy of Task 7 is always back to near-random guess, suggesting that catastrophic forgetting remains strong in the absence of our model modifications. This result is consistent with our interpretation that the influence of the slowest (last) hidden variable over the fastest one through the main connections is too weak to protect the first variable from the strongly correlated gradients related to the current task.

Supplementary Note 9: Sequential Training of the MNIST and Fashion-MNIST Datasets
To test the ability of our binarized neural network to learn several tasks sequentially, we train a binarized neural network sequentially on two tasks in a more difficult situation than permuted MNISTs. When learning permuted versions of MNIST, the relevant input features do not overlap extensively between tasks which makes it easier for the network to learn sequentially. For this reason, we now train a binarized neural network with two hidden layers of 4,096 units to learn sequentially the MNIST dataset and the Fashion-MNIST dataset [9] which consists of fashion items images belonging to ten classes. Suppl. Fig. 6 (d)). Curves are averaged over five runs and shadows correspond to one standard deviation.

Supplementary Note 10: Sequential Training of the MNIST and USPS Datasets
In this note, we investigate the sequential training of two closely related tasks: the handwritten digits of the MNIST (Supp. Fig. 7(a)) and of the United States Postal Services (USPS, Supp. Fig. 7(b)) datasets. This situation differs from permuted MNIST (Fig. 2 in the main body text), sequential Fashion-MNIST / MNIST (Suppl. Note 9) and incremental CIFAR-10/CIFAR-100 (Suppl. Note 11), where the incrementally trained tasks were always largely uncorrelated in nature. We compare the accuracy of a metaplastic binarized neural network trained sequentially on MNIST and USPS, with the two networks trained independently on each task and each featuring half the number of hidden neurons (Fig. 4(c) in the main body text) and half the number of parameters (Fig. 4(d) in the main body text) of the metaplastic network. This choice allows verifying in this situation whether a metaplastic network performs better than a network partitioned into two parts, with each partition trained on one task, independently from the other. As the MNIST dataset is much larger than the USPS one, we follow the training protocol introduces in [10] and [11], where 2,000 training examples are used for MNIST and 1,800 for USPS. For this reason, we focus on relatively small neural networks. The small network is a convolutional neural network with three layers of 4 × 4 kernels with increasing feature maps of 6-10-15 (4,056 parameters) while the neural network with twice more neurons (Fig. 4(c)) has feature maps 12-20-30 (14,832 parameters). We can choose the dimensions to be 10-15-20 to obtain a network with approximately twice more parameters (8,160) ( Fig. 4(d)).

Supplementary Note 11: Class Incremental Learning on CIFAR-10 and CIFAR-100 Features
In this note, we investigate a setting of class incremental learning, where a network learns different subsets of classes of the CIFAR-10 and CIFAR-100 datasets sequentially. We focus on the sequential training of the fully-connected layers of a convolutional neural network. This choice is motivated by the fact that the ability to extract features from visual input does not change across time presumably: for instance, one does not usually forget how to recognize shapes, but rather we can forget abstracted concepts.
To extract relevant features from the CIFAR-10 and CIFAR-100 datasets, we therefore use the convolutional layers of a ResNet-18 network [12], pretrained on the ImageNet dataset, and available in the PyTorch 1.1.0 library. This choice ensures that the feature extractor is fairly general, without having been trained on CIFAR images. We create a feature-extracted dataset of CIFAR-10 and CIFAR-100 by resizing CIFAR images from 32 × 32 to 220 × 220 pixels, and applying random crops of a 200 × 200 window, as well as random horizontal flips. We then perform ten passes through the training set of each dataset, resulting in 500,000 training images for each training sets. We perform only one pass through the test sets and do not apply data augmentation (we only resize the test images to 220 × 220 pixels and center-crop them to 200 × 200 pixels). The features obtained by this procedure are 512-dimensional vectors.
The architectures we use for learning the extracted features are binarized multilayer perceptrons of dimensions 512-2048-10 for CIFAR-10, and 512-2048-2048-100 for CIFAR-100. The results are shown in Supplementary Fig. 8 for datasets split into two subsets of classes. The subsets for CIFAR-10 are chosen by grouping together similar classes : subset 1 consists of vehicle classes and the horse class, while subset 2 consists of the remaining animal classes. For CIFAR-100, the subsets of classes are chosen randomly.
We consider three settings for CIFAR-10 and CIFAR-100. Figs. 8(a) and (d) show the training results for a non-metaplastic setting. We see that, when the network starts learning the second subset of classes, it forgets the first subset of classes rapidly and entirely.
The results for a metaplastic network with task dependent thresholds are shown in Figs. 8(b) and (e). The metaplasticity parameter m was optimized in each case by hyperparameter grid search. Learning in this situation is highly successful. For CIFAR-10, at the end of learning, accuracy on both subsets approaches the maximum accuracies at the end of subphases of Fig. 8(a). For CIFAR-100, accuracy on the first subset approaches the maximum one reached in Fig. 8(d). The accuracy on the second subset is also very high, but remains below the maximum one reached in Fig. 8(d). These results highlight the applicability of our metaplasticity approach to datasets more sophisticated than MNIST. However, this situation does not correspond to a truly incremental task learning situation, as the output is computed given information on the subset at hand.
For this reason, in Figs. 8(c) and (f), we use the technique of "instance normalization" [13] to avoid the task dependency through neurons thresholds. In this situation, during testing, it is not necessary for the network to know to which subset of classes the presented image belongs. The metaplasticity parameter m was again optimized in each case by hyperparameter grid search. We see that incremental learning is achieved, successfully, with final accuracies that do not, however, match the ones seen with task-dependent thresholds in Figs. 8(b) and (e), highlighting the difficulty of this training situation.

21/23
Supplementary Figure 8. Class Incremental Learning on CIFAR-10 features (a,b,c) and CIFAR-100 features (d,e,f) with the following settings : (a,d) Non-metaplastic (b,e) Metaplastic with task dependent neurons activation thresholds through Batch Normalization (c,f) No dependency on task (Instance Normalization). The curves are averaged over five runs and shadows stand for one standard deviation.