Accurate deep neural network inference using computational phase-change memory

In-memory computing using resistive memory devices is a promising non-von Neumann approach for making energy-efficient deep learning inference hardware. However, due to device variability and noise, the network needs to be trained in a specific way so that transferring the digitally trained weights to the analog resistive memory devices will not result in significant loss of accuracy. Here, we introduce a methodology to train ResNet-type convolutional neural networks that results in no appreciable accuracy loss when transferring weights to phase-change memory (PCM) devices. We also propose a compensation technique that exploits the batch normalization parameters to improve the accuracy retention over time. We achieve a classification accuracy of 93.7% on CIFAR-10 and a top-1 accuracy of 71.6% on ImageNet benchmarks after mapping the trained weights to PCM. Our hardware results on CIFAR-10 with ResNet-32 demonstrate an accuracy above 93.5% retained over a one-day period, where each of the 361,722 synaptic weights is programmed on just two PCM devices organized in a differential configuration.


SUPPLEMENTARY FIGURES
. ResNet-34 network architecture for ImageNet classification. 1 It has 32 convolution layers with 3 × 3 kernels, 3 convolution layers with 1 × 1 kernels, a first convolution layer with 7 × 7 kernels and a final fully-connected layer. The network has 21,797,672 parameters. The first convolution layer downsamples the input by using a stride of 2 pixels, followed by a maxpooling layer with kernel size of 3 × 3 and stride of 2 to downsample the feature maps to the resolution of 56 × 56 pixels. Each residual connection with 1 × 1 convolution and first layer of ResNet blocks 2, 3, 4 downsample the input by using a stride of 2 pixels. A global average pooling layer before the final fullyconnected layer downsamples the 7 × 7 input to 1 × 1 resolution. The final fully-connected layer computes the output prediction corresponding to 1,000 classes.  Figure 2. Impact of different techniques on training with additive noise. a, Additive noise training of the ResNet-32 network on CIFAR-10 dataset by using combinations of different training techniques with η tr = η inf = 3.8%. Our training methodology (in black) that implements all the training techniques, achieves the best performance out of all other possible combinations. b, Influence of weight clip scale parameter α on the test accuracy on CIFAR-10 for different amounts of equal training η tr and inference η inf noise. From this experiment we observe that clip scale of 2 is optimal for ResNet-32 network. c, Test accuracy improvement from using each of the three training techniques in additive noise training of ResNet-32 network on CIFAR-10 dataset for η tr = η inf = 3.8%. Supplementary Figure 3. Impact of injected noise during training on accuracy with PCM. Test accuracy of ResNet-32 on CIFAR-10 after transfer to PCM synapses for different values of relative weight noise η tr used for training. Although a value of η tr = 3.8% was used in the results presented in the main manuscript, which was determined from hardware characterization, a broad range of values of η tr result in a similar accuracy after transferring the weights to PCM. This shows that only a rough estimate of η tr is necessary to obtain satisfactory results on PCM. The error bars represent the standard deviation over 25 inference runs averaged over 10 training runs.    Figure 5. Effect of PCM nonidealities on CIFAR-10 accuracy retention. Test accuracy of ResNet-32 on CIFAR-10 simulated with different parameters of the PCM model. All those simulations contain the experimentally measured programming noise applied at T 0 = 27.36 s (see Supplementary Note 2). In the legend, "model" denotes that the parameters of the PCM model described in Supplementary Note 2 are used for the corresponding nonideality. a, Mean accuracy with GDC over 25 inference runs, the error bars corresponding to one standard deviation. The results demonstrate that 1/ f noise is mainly responsible for the random accuracy fluctuations over time. Drift variability and its dependence on the target conductance are responsible for the monotonous accuracy decrease over time. It can be also seen that the dependence of the mean drift exponent µ ν on the target conductance state has a more detrimental effect on accuracy than random drift variability (σ ν = 0) alone. b, Mean accuracy with AdaBS over 25 inference runs, the error bars corresponding to one standard deviation. AdaBS can almost fully compensate for the dependence of the drift exponent on the target conductance, and only a slight drop of ∼ 0.2% over 1 year is observed when introducing random drift variability.  Figure 6. Effect of random conductance variations on drift compensation methods. Test accuracy of ResNet-32 on CIFAR-10 as a function of relative standard deviation of conductance variations with AdaBS, GDC, and no compensation method. The aim of this simulation is to define a clear criterion as to how much random conductance variations can be mitigated by the drift compensation methods to be able to recover satisfactory network accuracy. We use a generic model for random conductance variations from the ideal target synaptic conductance G l T,i j as G l i j = G l T,i j × N (1, σ 2 rel ). σ rel is thus defined as the relative standard deviation of the actual conductance G l i j with respect to the ideal target conductance G l T,i j . Using such a generic model allows us to define a universal criterion, independent of the device technology. Moreover, this model approximates fairly well the random conductance variations due to drift variability and 1/ f noise in PCM, where the magnitude of the variations is proportional to the programmed conductance value (see Supplementary Equations (16) and (17) of Supplementary Note 2). We performed inference simulations with ResNet-32 on CIFAR-10 using this model of device conductance to represent the weights. We employed GDC, AdaBS, and no compensation during inference with different values of σ rel . The network was trained by injecting noise with η tr = 3.8%. The results clearly show that AdaBS compensates much better for random conductance variations compared with GDC, which performs almost the same as when no compensation technique is used. We define the criterion as the maximum relative conductance variation σ rel that can be tolerated to obtain an accuracy higher than 90%. By this definition, GDC can tolerate up to 20% variations, whereas for AdaBS it is 32.5%, an improvement of 1.6×. In terms of PCM drift variability, this corresponds approximately to a tolerable drift exponent standard deviation σ ν of up to 0.013 for GDC, compared with 0.02 for AdaBS, when considering a time span of 6 orders of magnitude from programming (e.g. from 30 seconds to 1 year). The error bars represent the standard deviation over 25 inference runs. Additive noise training, tr =3.8%

Supplementary
Conventional training, tr =0% Supplementary Figure 7. Effect of stuck devices on inference accuracy with additive noise training. Test accuracy of ResNet-32 on CIFAR-10 as a function of stuck synapses during inference. We compare the accuracy obtained with weights trained using either the additive noise training technique (with η tr = 3.8%) or conventional training (e.g. no noise). For simulating the stuck device behavior during inference, we randomly selected a fraction (denoted as percentage of stuck synapses) of synapses for each layer. Each stuck synapse had its weight set to 0, −W max or W max (W max is the maximum absolute weight value of a layer). Note that the additive noise training algorithm that we employed did not take into account any of the stuck devices during training; the stuck devices were introduced only during inference. The results show that additive noise training is more robust to stuck faults compared to conventional training. With conventional training, the accuracy drops to 80% with just 1% of the devices stuck. On the other hand, using additive noise training allows the accuracy to remain above 90% for up to 4% stuck devices. This shows that additive noise training is effective in making the network robust to weight perturbations during inference, even for perturbations that are quite different from the generic noise applied during training. ..

Supplementary
Drift correction Figure 9. PCM-based deep learning inference simulator. Schematic representation of the implementation of a fullyconnected layer in our TensorFlow simulation framework. The regular TensorFlow matrix multiplication operation is replaced with a custom operation which takes into account the device model presented in Supplementary Note 2. Each custom matrix multiplication is also associated with configurable quantization of crossbar input/output to simulate digital-to-analog (DAC) and analog-to-digital (ADC) conversions. The drift correction module implements the drift correction techniques GDC (Supplementary Figure 3) and AdaBS (Supplementary Note 3). A similar structure holds for the convolution layers once the convolution kernels are re-arranged to form a matrix as described in Figure 1b of the main manuscript.

Supplementary Note 1: Fast initial convergence on ImageNet
Before starting the additive noise training of the ResNet-34 on the ImageNet dataset with η tr = 3.8%, we performed an additive noise inference with η inf = 3.8% on the pretrained FP32 weights and achieved a top-1 accuracy of only 1.2%. This accuracy is very low and shows a priori that it is very hard to train such a network with additive noise. However, repeating the inference on the ResNet-34 network after only a 100 mini-batches of additive noise training leads already to a top-1 test accuracy of 65.9% as shown in Supplementary Figure 10. The main reason for this quick recovery in accuracy is the updating of the batch normalization statistics (µ and σ 2 ). The estimates of the statistics computed during the training phase become very inaccurate as soon as additive noise is injected in the network. Just by performing batch normalization statistics updates for 100 mini-batches without any other weight update, we are able to recover up to already 48.  We also recorded the evolution of the parameters of ResNet-34 during the first 100 training updates (parameter update by backpropagation and batch normalization statistics update). We computed the evolution of the L 2 -norm of the difference between the parameters and their initial value. Supplementary Figure 12 shows the L 2 -norm values, confirming that the parameters changing the most during additive noise training are the weights of the first and last layers, the batch normalization parameters of the first layer and of the layers at the beginning of each ResNet block. It also suggests that the additional 17.62% gain in top-1 accuracy during the first 100 training updates that does not come from the batch normalization statistics update is mainly due to the weight updates from the first and last layers. The

Conductance drift
The conductance measured from the PCM is observed to drift over time according to the relation, where G(t 0 ) is the conductance measured at time t 0 and ν is the drift coefficient 4 . The drift is attributed to the structural relaxation of an amorphous volume created after each programming event.
For a reliable estimate for the drift coefficient, we iteratively programmed 10,000 PCM devices to target conductance values, G T , approximately at 23 µs and measured their temporal evolution until 10 5 s, a time span of approximately 10 orders in magnitude. We obtained the drift coefficients of conductance evolution by fitting them to Supplementary Equation (1). The mean and standard deviation of the extracted drift coefficients are plotted as a function of the target conductance at 23 µs in Supplementary  Figure 13.  (2) and (3) are also shown.

Read noise
PCM is known to exhibit 1/ f γ noise 5 . The power spectral density, S G , of the colored noise is given by, where G is the device conductance and Q is a factor determined by the phase configuration within PCM. The variance of the read noise, σ 2 nG , can be estimated by integrating S G over the frequency range of measurement. For γ = 1, where Q s = √ Q. The f max is determined by read pulse duration (T read = 250 ns) in our experimental platform and f min is determined by the time over which the noise is integrated. To estimate read noise variance from the device, the 10,000 devices, iteratively programmed at 23 µs, were read 50 times within a time window of T = 90 s, which was further repeated until 10 5 s.
The fifty reads over the interval T were used to determine the read noise standard deviation as a function of time and conductance. f max = 1/2T read and f min = 1/(T + T read ). The average value of γ for the conductance range of operation was approximated to be 1.21 from an independent measurement. Q s can be estimated from this information as a function of target conductance (Supplementary Figure 14a) and time (Supplementary Figure 14b). The behavior is captured using the relation, where K = 0.0710, α = 0.618 and t 1 = 1.089 × 10 5 s. ν q is observed to be dependent on the target conductance as shown in Supplementary Figure 14c. The fit line for ν q is given by, The match between the estimated and modeled read noise for different target conductance values is shown in Supplementary  Figure 14d. While the read noise data points in Supplementary Figure 14d are based on noise integrated in a fixed time window of 90 s, for inference experiments we are interested in the read noise with respect to the initial programmed conductance values. Hence, read noise needs to be integrated from the point of iterative programming to the time instance at which we are performing the inference. 6 The read noise predicted by the model based on this growing time window is also shown in Supplementary Figure  14d.

Mapping the model parameters from 23µs to 20 s
Inference experiments in the main article were performed based on the target conductance values defined at approximately 20 s. Since, the drift coefficients and read noise parameters are expected to be independent of conductance drift, we mapped the model Supplementary Equations (2, 3, 6, 7) to target conductance values defined at 20 s using the drift model. The modified equations in terms of the target conductance defined at 20s are shown below (also in Supplementary Figure 15).

Programming noise
The conductance drift and slow evolving 1/ f γ noise makes accurately determining the programming noise of the PCM quite complex. Hence, the programming noise was determined so as to compensate for the difference between the variability predicted by the read noise and drift model and those exhibited by the device. We implemented two incremental programming noises in a state-dependent manner at two initial time points (T 0 , T 1 ). The standard deviation of the programming noises as a function of target conductance values is shown in Supplementary Figure 16. The noise applied at T 0 is the experimentally measured programming noise at 27.36 s reported in Figure 3b of the main manuscript. The noise applied at T 1 captures additional shortterm relaxation effects occurring after iterative programming 7 . The succeeding temporal evolution of the device was predicted using drift and read noise model (Supplementary Equations (8, 9, 10, 11)).

Model response
Using the model equations, PCM conductance evolution as a function of time was simulated as follows: where f min = 1/(t + T read ). For all t ≥ T 1 , the modeled conductance, G(t) = G drift (t) + N (0, σ nG ).

Supplementary Note 3: Adaptative batch normalization statistics (AdaBS) update technique
Drift in a PCM device has a dependence on the initial target conductance value of the device and varies across different devices in an array. As a result, the decrease in conductance values occurs at different rates across devices. The read noise of the PCM devices also depends on the initial target conductance, the actual value of the PCM conductance, and has a 1 f γ frequency dependence. All these effects corrupt the weight distribution and hence the activation distribution of a DNN layer, causing the accuracy to degrade over time. The batch normalization layer can be leveraged to correct the activation distribution out of a PCM crossbar array, as it can adapt the optimal statistics (i.e. mean and variance) of its inputs (i.e. outputs of a crossbar) required for normalization. We demonstrate an advanced drift correction technique adaptive batch normalization statistics (AdaBS) that can improve the accuracy retention over time and outperforms the global drift compensation (GDC) method shown in Supplementary  Figure 4.

Methodology
Supplementary Figure 17 shows the batch normalization layer operation both in training and inference phases, along with the added AdaBS calibration. The batch normalization layer has different behavior in training and inference phase of a DNN. During the training of a DNN, a batch normalization layer normalizes its input to zero mean and unit variance by computing the mean (µ B ) and variance (σ 2 B ) over a batch of images. The normalized input is then scaled and shifted by γ and β . During the training phase, γ and β are learned through backpropagation. In parallel, a global running mean (µ) and variance (σ 2 ) are computed by exponentially averaging µ B and σ 2 B respectively, over all the training batches. After training, the estimates of the global mean and variance µ and σ 2 are then used during the inference phase. When performing forward propagation during inference, the batch normalization coefficients µ, σ 2 , γ, and β are used for normalization, scale, and shift without computing mini-batch mean (µ B ) and variance (σ 2 B ). The calibration phase of AdaBS consists in recomputing and updating µ and σ 2 for every layer where batch normalization is present. We recompute µ and σ 2 by feeding a randomly sampled set of mini-batches from the training dataset. In recomputing µ and σ 2 , hyper-parameters such as mini-batch size (m) and momentum (p) need to be carefully tuned to achieve the best network accuracy.
For AdaBS calibration, we observed that using an optimal value of the momentum is necessary to achieve good inference accuracy evolution over time. For this, we have developed an algorithm to estimate the optimal value of momentum by an empirical analysis. Consider the exponential accumulation of any statistics S of a batch normalization layer as given by Supplementary  Equation (18). Too high momentum value will result in more contribution from the initial statistic S 0 and lower contribution from the statistics computed over the n mini-batches used to recalibrate the final statistic S. On the other hand, with a too low To find the optimal number of calibration images as a fraction of the training dataset of the ResNet-32 network, we fixed the mini-batch size to m = 200 images. We performed AdaBS calibration on the ResNet-32 network with the weights read from the PCM hardware inference experiment. Supplementary Figure 19 shows the accuracy evolution with AdaBS for different fractions of the training dataset, with the optimal momentum computed using Supplementary Equation (20). From these results, we observe that 5% of the training set is sufficient for the AdaBS calibration of ResNet-32 network. The accuracy retention with 5% of the training set is comparable to that obtained when using the entire training set.

Computational and memory overhead of AdaBS
Better accuracy retention is obtained at the cost of additional digital operations AdaBS requires during the calibration phase. Supplementary Figure 20 shows the amount of digital computations required in the calibration and inference phases. The forward propagation operations are the number of operations all the batch normalization layers require for 10k CIFAR-10 images in ResNet-32. Calibration phase operations are the computations required to calibrate the coefficients of all the batch normalization layers in ResNet-32. In the forward propagation, both GDC and AdaBS do not incur any additional computations. However, during calibration phase, GDC uses fewer computations for computing calibration coefficients compared with AdaBS. The main intuition behind the GDC method is to compute a measure of change in the spread of the conductance values in a crossbar array by estimating ratio of 1-norm of conductance matrix at two distinct time instances. Due to this, the GDC calibration is independent of the input dataset used by the network. As a result, GDC has a smaller computational overhead during the calibration phase compared to AdaBS. AdaBS requires more computations during calibration as it computes the first and second order moments of the crossbar outputs over a set of images. A batch normalization layer can be implemented with different statistical estimation for center and spread instead of mean and variance with small impact on the network accuracy 8 . In our computation overhead estimates, we therefore considered two versions of batch normalization as follows: • L2-BN: Original batch normalization layer with mean and variance as statistical estimation of center and spread, respectively, for the batch normalization layer inputs.
• L1-BN: A batch normalization layer with mean and mean absolute deviation as statistical estimation of center and spread, respectively, for the batch normalization layer inputs.
Out of these two implementations, L1-BN requires fewer multiplication operations to compute the estimation of spread, i.e. absolute mean deviation, compared to that of L2-BN. We show in Supplementary Figure 21 that with L1-BN implementation the ResNet-32 network accuracy and its evolution are not affected compared to L2-BN implementation.
Besides the computational overhead, AdaBS requires additional memory resources to store the calibration images. For CIFAR-10, it needs 32 × 32 × 3 × 2600 = 8 MB, and for ImageNet 224 × 224 × 3 × 1300 = 196 MB. Clearly, this amount of data is too large to be stored in on-chip memory. Hence, it would have to be stored either in off-chip DRAM or non-volatile memory (Flash, PCM, RRAM, etc.). A simple analysis of the overhead from the off-chip data transfers can be made assuming off chip links with energy consumption close to 1pJ/bit 9,10 . This results in an energy per image equal to 24.6 nJ for CIFAR-10 and 1.2 µJ for ImageNet. For the overall sizes of the calibration sets for the two datasets, we obtain 64 µJ/calibration for CIFAR-10 and 1.3 mJ/calibration for ImageNet. Further, using 13Gbps and a conservative estimation of a realistic bandwidth 9,10 , we obtain a transfer time equal respectively to 3.2 ms and 78.3 ms for CIFAR-10 and ImageNet. This overhead is reasonable given that the calibration is expected to be performed at time intervals larger than 10 seconds. Moreover, it can be performed less frequently as time elapses because the conductance changes due to drift are a logarithmic function of time. Finally, note that the AdaBS calibration could be potentially done as well using the images seen by the device during inference, which would avoid resorting on pre-stored calibration images.

Supplementary Note 4: Comparison with other noise injection methods for training
To compare the effectiveness of the additive noise injection technique proposed in this manuscript with other noise injection methods, we considered the four noise injection methods described below. We used the same training procedure and hyperparameters as for the experiments presented in the main manuscript, unless specified otherwise.

Noise injection on the input activations
We studied noise injection by inputting noise on the input activations of every layer during training. As shown in Supplementary Equation (21), for any arbitrary layer with weight matrix W , noise is added to the input activations X of the layer, leading to a noisy output Y . The noise is Gaussian distributed with zero mean and variance σ 2 in .
We defined the ratio of the standard deviation of additive noise injected on activations (σ in ) to the spread of the input activation (c × σ X ) to be the same as η tr as given by Supplementary Equation (22).
It follows that where σ X is the standard deviation of the input activations and c is tunable hyper-parameter for a given layer.
Since there is no evident prior value for c, adapting its value for each convolutional layer such that Y is equal in distribution both for the cases of additive noise training on weights and input activations, is a reasonable choice. For this, we performed a Kolmogorov Smirnov test to compare the distributions of Y between both methods, and chose the value of c that led to the smallest Kolmogorov Smirnov score. We experimentally found out that the value of c is the same for the layers within a ResNetblock. The values of c that gave the highest accuracy on ResNet-32 after transfer to PCM synapses with η tr = 3.8%, found in a completely ad-hoc manner, were c = 4 for the first convolution layer, c = 4 for block 1 and 2, c = 2 for block 3, and c = 4 for the last fully-connected layer. We also had to clip weights of the first convolution layer to the range [−0.8 × σ W , 0.8 × σ W ].

Noise injection on the preactivations
Noise injection on preactivations has been often proposed for training DNNs for deployment on analog mixed-signal hardware 3,11 . Noise is added to the preactivations of the layer as per Supplementary Equation (24), for any arbitrary layer with weight matrix W , inputs activations X and output Y . The noise is Gaussian distributed with zero mean and variance σ 2 preact .
Again, for consistency, we defined the ratio of standard deviation of additive noise injected on preactivations (σ preact ) to the spread of the preactivation (d × σ X ×W max ) to be the same as η tr as given by Supplementary Equation (25).
It follows that where σ X is the standard deviation of the input activations, W max is maximum absolute value of the weights of a given layer, and d is tunable hyper-parameter for a given layer. The variance is computed by matching the distributions of Y between additive noise injection on weights and additive noise injection on the preactivations for different values of d using the Kolmogorov Smirnov test, as explained earlier for additive noise on input activations. The values of d that gave the highest accuracy on ResNet-32 after transfer to PCM synapses with η tr = 3.8%, found in a completely ad-hoc manner, were d = 2.5 for the first convolution layer, d = 6 for block 1, 2 and 3, and d = 7 for the last fully-connected layer.

Noise injection on the network training dataset
We investigated a simple noise injection technique of injecting additive noise on the training dataset. This method can be seen simply as data augmentation, but since the data noise will flow through the whole network, it could help make the network more robust to weight perturbations during inference. As shown in Supplementary Equation (27), for an input training image X image , noise is added on every pixel of the image. The noise is Gaussian distributed with zero mean and variance σ 2 data . The noisy imageX image is used for training the network.X We defined the variance of the additive noise on input dataset to be related to η tr as given in Supplementary Equation (28) where g is tunable hyper-parameter. The optimally tuned value of g that gave the highest accuracy on ResNet-32 after transfer to PCM synapses with η tr = 3.8% was g = 0.3. This value cannot be deduced from any noise characterized on the PCM hardware.

Multiplicative noise injection on the weights
We studied multiplicative noise injection on the weights by multiplying the weights of every layer with Gaussian distributed noise during training. As shown in Supplementary Equation (29), weights are multiplied by a Gaussian distributed noise term for any arbitrary layer with weight matrix W , input activations X, and output Y . The noise term has unity mean to preserve the mean value of the weights and variance σ 2 prod .
Similar to additive weight noise, there is a range of values of σ prod over which the network achieves close to baseline accuracy after training. Even if the range of σ prod values that the multiplicative Gaussian noise can take is broad enough, it is hard to map the observed PCM noise to σ prod . One could think of a linear fit of the observed standard deviations of Figure 3b of the main manuscript, and setting the bias of the fit to 0. In this case, however, the noise would be underestimated and result in sub-optimal accuracy. The optimal value of σ prod that gave the highest accuracy on ResNet-32 after transfer to PCM synapses was 0.1.

Results
Supplementary Figure 22 summarizes the results obtained with the four aforementioned methods for ResNet-32 on CIFAR-10, compared with the additive noise on weights method proposed in the main manuscript. As seen in Supplementary Figure  22a, the training procedure with all four methods could be adjusted such that a similar accuracy after transfer to PCM synapses is obtained. However, as described above, all four methods require one or multiple noise scaling factor hyperparameters to tune in order to reach satisfactory accuracy after transfer to PCM synapses. The values of these hyperparameters cannot be deduced from hardware observations in a straightforward manner. Therefore, in practice, it would be required to test trained weights multiple times on hardware for different hyperparameter values in order to achieve satisfactory performance, which is undesirable. In contrast, the methodology proposed in the main manuscript, which uses additive noise on weights, enables to estimate the magnitude of the weight noise to inject during training η tr from a simple one-time hardware characterization. Moreover, the exact value of η tr does not have to be determined in a very precise manner, because there is a wide range of values that lead to similar accuracy after transfer to PCM synapses (see Supplementary Figure 3). Therefore, a single representative value of η tr could be used for training a network and deploying it on multiple chips, as long as a similar device technology and programming algorithm are used, and chip-to-chip variations are not too significant. Finally, we also found that training methods based on weight noise achieve a better accuracy retention over time (see Supplementary Figure 22b), suggesting that weight noise mimics the behavior of the PCM hardware better. Supplementary Figure 22. a, Comparison between different methods of adding noise to the ResNet-32 network during training. The hyperparameters for all methods were optimized as to give the highest accuracy on CIFAR-10 after transfer to PCM synapses, except for additive noise on weights where η tr = 3.8%, given by the hardware characterization, was used. b, Test accuracy over time of ResNet-32 on CIFAR-10 with GDC computed with the PCM model for the five different training methods. Methods based on weight noise achieve a better accuracy retention.