Improving the robustness of analog deep neural networks through a Bayes-optimized noise injection approach

Analog deep neural networks (DNNs) provide a promising solution, especially for deployment on resource-limited platforms, for example in mobile settings. However, the practicability of analog DNNs has been limited by their instability due to multi-factor reasons from manufacturing, thermal noise, etc. Here, we present a theoretically guaranteed noise injection approach to improve the robustness of analog DNNs without any hardware modifications or sacrifice of accuracy, which proves that within a certain range of parameter perturbations, the prediction results would not change. Experimental results demonstrate that our algorithmic framework can outperform state-of-the-art methods on tasks including image classification, object detection, and large-scale point cloud object detection in autonomous driving by a factor of 10 to 100. Together, our results may serve as a way to ensure the robustness of analog deep neural network systems, especially for safety-critical applications.

In addition to the ReRAM drifting model used in the paper, we have extended the evaluation on other circuit simulation platforms to model the non-idealities of the memristor to show that our method can be generalized to different kinds of noise regardless of the hardware implementation details.Specifically, we deployed the baseline algorithms and our BayesFT optimization methods on MemTorch [1].This is an open-source simulation platform for memristive deep learning systems, which integrates directly with the PyTorch library.In this simulation, we used the default setting of 8-bit ADC resolution and the quantization method is linear.
In the simulation, we used the Voltage ThrEshold Adaptive Memristor (VTEAM) model [2], which is a general and accurate model to describe the behavior of voltagecontrolled memristors.
In order to better demonstrate the generalization of our BayesFT method, we selected three different groups of parameters in [2] for the experiment, which correspond to three kinds of experimental devices, namely Pt-Hf-Ti memristor [3], ferroelectric memristor [4], and metallic nanowire memristor [5].
The actual Ron and Roff parameters in the simulation software are sampled stochastically from a normal distribution, with a mean value of Ron and R of f in the reference parameters, and a variance σ which can be changed.As σ increases, the device-device variability increases, and the inference accuracy of an analog neural network decreases.By comparing the degree of accuracy degradation at the same variance, we can compare the performance of different algorithms.
Based on the above experimental settings, we applied our BayesFT optimization method and baseline algorithms to classify MNIST with LeNet, because convolutional neural networks are used more often in image classification tasks.
As shown in Fig. S1 (b), the results for the three different settings show the same trend.The accuracy of BayesFT-DO shows great robustness, while the accuracy of FTNA drops rapidly as the resistance variance increases.The performance of ERM and ReRAM-V is intermediate between the BayesFT method and the FTNA method.Note that the Ron and R of f values of the three settings are of different orders of magnitude, so the scales of σ, i.e., the scale of the horizontal axis, is different.

Results of PCM devices
In addition to ReRAM, our method also works on phase-change memory (PCM) devices.We conducted experiments on the IBM Analog Hardware Acceleration Kit [6](IBM-aihwkit), which provides state-of-the-art statistical models of PCM arrays that can be used to simulate various sources of noise present in real hardware when performing inference.
The model simulated three different sources of noise from the PCM array: programming noise, read noise, and temporal drift.After training the network in software, the trained parameters are mapped to target conductance values G T .The conductance values are then programmed on a PCM device in hardware using a closed-loop iterative write-read-verify scheme.There will be a difference between G T and the actual conductance value Gprog, which is characterized by the programming noise.The programming noise is described by ( 1) and (2).
After the parameters have been programmed, the conductance value of the PCM device drifts with time.Knowing the conductance at time tc of the last programming pulse, gprog, the conductance at time t is: where ν is sampled from N (µν , σν ).Read noise occurs when the devices are read after they have been programmed.The standard deviation of the read noise σ nG at time t is: where t read = 250ns is the pulse width applied when reading the devices.Qs measured from the PCM device as a function of g T is given by: Qs = min 0.0088/g 0.65 T , 0.2 The final simulated PCM conductance from the model at time t is given by: The noise and drift described above are long-term non-linearities.Besides that, the parameter noise, which is short-term noise, is applied for each MAC computation, while not touching the actual parameter matrix but referring it to the output: where ξ ∼ N (0, σ), thus the parameters are all Gaussian distributed.σ is determined by w noise.
We deployed the baseline algorithm ERM and our method BayesFT-DO, BayesFT-Ga, and BayesFT-La on IBM-aihwkit.We evaluated different neural network architectures including LeNet and three-layer MLP on MNIST, and AlexNet, VGG11, ResNet18, and Preact-ResNet with different numbers of layers on CIFAR-10.Different w noise was applied for all groups of experiments.The results are shown in Fig. S2.
For all eight sets of experiments, we can observe that the accuracy of BayesFT-DO decreases more slowly than ERM, BayesFT-Ga, and BayesFT-La as w noise increases, indicating that BayesFT-DO yields the best robustness.For example, on MLP, when w noise increases to 0.4, the accuracy of ERM decreases to 49.57%, while the accuracy of BayesFT-DO remains as high as 73.33%.On AlexNet, when w noise increases to 0.01, compared to the ERM in which the accuracy drops to 32.37%, the accuracy of BayesFT-DO is still as high as 75.80%.
In the CIFAR-10 experiments, BayesFT-Ga and BayesFT-La achieve slightly higher accuracy than ERM.For VGG11, BayesFT-Ga and BayesFT-La show much better robustness than ERM.When w noise increases to 0.02, the accuracy of ERM decreases to 15.81%, while the accuracy of BayesFT-DO is 77.49%,BayesFT-Ga is 71.63%, and BayesFT-La is 72.20%.This is because the VGG11 network structure with batch normalization is used in the ERM method, while our methods remove most of the normalization layers, thus improving the robustness.

Supplementary Note 2: Theoretical proof of robustness
In this section, before we dive into the mathematical proof, we first introduce an intuitive example as illustrated in Figure S3 of the basic idea of our method.The aim is to prove that there exists a region around the original parameter point where the loss value is upper bounded by a constant, or equivalently the likelihood value is lower bounded by a constant.Due to the highly non-convex property of the likelihood function landscapes with regard to parameters θ in the DNN, we take a different approach that does not rely on strong convexity assumptions of the DNN.Considering a randomized version of the analog DNN with injected noise to actively perturb the original parameter, by increasing the number of injected random noise samples, the worst possible likelihood value can be determined with higher confidence via concentration inequalities.With no assumptions related to the shape of the likelihood functional, the likelihood functional has to travel through all sampled points to ensure the consistency of our random measurement.With this constraint on the likelihood functional, we can then conclude that within a closed set around the original parameter, the analog DNN can maintain the correct output (i.e.above the lower bounded constant).Therefore, as long as the magnitudes of the analog noise are within some ranges, the robustness of the analog DNN can be guaranteed.This indicates that analog DNN can be equivalent to digital DNN under some conditions, which could enables analog DNN to be deployed for life-critical systems.Besides, during the DNN optimization phase with injected noise layers, we optimize on a region centered on the original parameter instead of a single parameter, which can improve the generalization ability of analog DNN by avoiding sharp minima.
In the following part, we provide a formal theoretical justification of the proposed algorithm.We first introduce the proposed randomized version of the analog DNN model: where x is the input data; θ 0 is the original unperturbed parameters of the analog DNN; π 0 is the random noise injected in the analog DNN; ⊗ is the operator for applying the random distribution, which can be + for Gaussian distribution or × for Dropout.In the following parts, we write f π0 (θ 0 , x) as f π0 (θ 0 ) for brevity.For simplicity and avoiding loss of generality, we consider the case of a 0-1 classification problem.We want to prove that if f π0 > 1 2 , then for any perturbation on the analog DNN's parameter δ within some ranges, the prediction of the analog DNN remains the same (i.e.f π0 (θ 0 ⊗ δ) > 1  2 ).To derive a tractable lower bound for f π0 (θ 0 ⊗ δ) and deduce δ, we relax f in the functional space F = { f : f (θ) ∈ [0, 1]} which is the set of all functions bounded in [0, 1], along with a equality constraint at the original function f : Theorem 1 (Lagrangian) Denote by π δ the distribution of η ⊗ δ, solving Inequality 8 is equivalent to solving the following problem: where D F (λπ 0 , π δ ) is: Theorem 1 is the Lagrangian inequality which can be proved by the minmax theorem.According to Theorem 1, for different distributions of the injected noise, we can instantiate π 0 as the type of the distribution and derive the maximum allowable perturbation range B that for any δ ∈ B, the prediction performance of analog DNN remains the same.In the following part in Supplementary Notes 2, we examine three different distribution types, that are, Gaussian distribution, dropout (Binomial distribution), and Laplace distribution, and derive B respectively with mathematical proofs.
Then also define Note that F is a concave function with respect to λ, we have: = min We have lower bound L ≥ 1 2 if and only if r ≤ σΦ −1 (f π0 (θ 0 )), i.e., the analog DNN is robust within the perturbation range B = {δ : ∥δ∥ 2 ≤ σΦ −1 (f π0 (θ 0 ))}.This reveals that the robustness to analog noise is determined by two factors: 1)the magnitude of noise injected during training (σ).2)the analog DNN's self-healing ability f π0 (θ 0 ).Larger injected noise might result in a more robust analog DNN but could also affect the accuracy it can achieve.

Dropout
In this case, π 0 (η) is instantiated as a Binomial distribution with a probability p of η setting to zero; ⊗ is set as ×.Then, for ∥δ∥ 2 ≤ r, the prediction confidence lowerbound is: Where Θ is the dimension of analog DNN's parameters.From the above inequality, we can deduce that as long as the dropout rate p is large enough and the randomized version of the analog DNN model have high confidence on the corrected prediction (f π0 (θ 0 ) is close to 1), we can obtain provable correct prediction results regardless of the analog noise.This explains the high robustness of BayesFT-DO over a range of tasks.In addition, it reveals an inherent trade-off between robustness and accuracy.Selecting higher p values can decrease the discrepancy in the performance between the model deployed on GPU and analog devices, but it could also result in degraded accuracy due to the accumulation of dropped-out neurons.