Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators

Analog in-memory computing—a promising approach for energy-efficient acceleration of deep learning workloads—computes matrix-vector multiplications but only approximately, due to nonidealities that often are non-deterministic or nonlinear. This can adversely impact the achievable inference accuracy. Here, we develop an hardware-aware retraining approach to systematically examine the accuracy of analog in-memory computing across multiple network topologies, and investigate sensitivity and robustness to a broad set of nonidealities. By introducing a realistic crossbar model, we improve significantly on earlier retraining approaches. We show that many larger-scale deep neural networks—including convnets, recurrent networks, and transformers—can in fact be successfully retrained to show iso-accuracy with the floating point implementation. Our results further suggest that nonidealities that add noise to the inputs or outputs, not the weights, have the largest impact on accuracy, and that recurrent networks are particularly robust to all nonidealities.


Introduction
The ever-increasing compute needed to train and use DNNs [72] have made hardware latency and energy-efficiency a growing concern.However, conventional processor architectures (e.g.CPUs, GPUs, etc.) incessantly transfer data between memory and processing through the 'von Neumann bottleneck,' inducing time and energy overheads that significantly degrade latency and energy-efficiency.Numerous hardware concepts have been introduced to accelerate DNN training and/or inference [77,35,68], by approximating MVMs and other arithmetic with custom floating-point representations such as bfloat16 [84] or DLFloat [2], or with reduced-precision fixed-point arithmetic to quantize synaptic weights and activations [76,15,16,67].Model (FP32), then re-trained in a hardware-aware manner, by adding nonidealities and noise sources into the forward path and using SGD to improve the robustness to such generic nonidealities.HWA training is only performed once -no specific device or chip characteristics, such as failure maps, are taken into account during HWA training, so resulting models remain widely-deployable.This HWA-trained model is then programmed onto AIMC multiple times (here in simulation) and DNN accuracy is evaluated over time, taking into account conductance drift of PCM devices and read noise accumulation [56] .compression and sparsification techniques can further reduce compute requirements by pruning weights and/or activations [3,27].
AIMC using non-volatile memory (NVM) elements is a promising mixed-signal approach for DNN acceleration [11,71,10], with weights stored using crossbar arrays of tuneable conductive elements.This enables approximate MVM computation directly in-memory, by applying activation vectors (as voltages or pulse durations) to the crossbar array, and then reading out analog physical quantities (instantaneous current or accumulated charge) [52,12,53].As a 'non-von Neumann' architecture, AIMC performs MVM operations at the location of the stored weights, in a highly-parallel, fast, and energy-efficient manner [12] -but only approximately.
The success of reduced-precision digital accelerators proved that DNNs can tolerate surprisingly coarse approximations of the underlying MVMs.While naive direct weight-quantization invariably leads to DNN accuracy loss, original model accuracies can often be recovered when DNNs are re-trained in a quantization-aware manner, even for aggressive reductions in precision.Weight-quantization into as few as 2-4 bits can often be tolerated without significant accuracy reduction [42,15,54].This observation led to the development of quantization-aware training (QAT) methods, now commonly used when deploying DNNs onto reduced-precision digital hardware (e.g.[1]).
In general, since reducing MVM precision decreases the representational power of each DNN layer as compared to a full FP implementation, performance naturally suffers once the function approximation becomes too coarse for the task at hand [54].In practice, QAT is known to have limits, and MVM minimum-precision requirements vary across each DNN topology.For instance, the first and last layers of CNNs are invariably implemented with high precision FP, even in studies claiming CNN iso-accuracy at very low fixed-point precision [76,15].
Up to now, it has been unclear how and to what degree DNNs can be re-trained to maintain accuracy on emerging AIMC technology.The successes of QAT cannot be directly translated onto AIMC, since the MVM approximations arise from fundamentally different concepts.In AIMC, weights are represented by conductances that are physical properties of NVM devices.In many materials, such as PCM [9,45], resistive random-access memory (ReRAM) [33,34], conductive-bridge RAM (CBRAM) [49,21], or electro-chemical randomaccess memory (ECRAM) [47,59], these conductances are effectively continuous physical quantities, and stored weights are not quantized.
That said, effective AIMC weight-precision is impacted by various nonidealities, including thermal and 1/f noise, randomization during physical switching induced by electrical and thermal fluctuations, material inhomogenities [14], and device-to-device variability introduced during device fabrication or operation.These issues cause both MVM read-out [56] and the writing or programming of the conductances [60,79,51] to be erroneous and non-deterministic.Worse yet, conductances can evolve over time after programming [6,4,7].Finally, any nonlinearities within the analog circuitry performing summation will further degrade MVM precision.Such imperfections include 'IR-drop' voltages on wires and transistors, restrictions on input (output) dynamic-range imposed by discretization and saturation of the digital-to-analog converter (DAC) (analog-to-digital converter (ADC)) components, and random noise or variability in the circuitry.
Whereas QAT gets challenging as precision is deterministically reduced, MVM approximation in AIMC is tied to non-deterministic signal-to-noise ratio.A number of previous studies have shown that noise-aware training -simple injection of noise onto weights or activations during DNN training -can make DNNs more robust for AIMC deployment [79,36,87,24,38,73].However, such studies have typically been limited to one or two exemplary DNNs of a particular type (e.g., CNNs) using only a limited subset of nonidealities such as NVM noise.Other critical AIMC system aspects such as output noise, saturation, and circuit nonlinearities have been neglected.Moreover, since each study makes different hardware and NVM-device choices, it is difficult to generalize, compare, or combine them.Thus more realistic and standardized AIMC crossbar-models -which can support comparison of AIMC accuracy for hardware-aware trained DNN models across studies -are needed.
Although some promising, small-sized DNN prototype demonstrations exist [83,39,86,20,57,5,88], it remains unclear how robust the AIMC deployment of realistically-sized AI workloads will be.How will the various nonidealities of AIMC hardware impact the DNN accuracy, across all the various topographies and thus application domains?And how much of the lost accuracy could be recovered by hardware-aware training?Which crossbar-array design choices will be most effective in maintaining accuracy?And if necessary, what degree of improved deviceto-device uniformity might be required -through better NVM device-fabrication -in order for AIMC to succeed on all DNN models?A systematic study comparing the various DNN topographies in terms of robustness to AIMC nonidealities is needed.
In this paper, we establish a robust HWA framework by extending and improving existing training methods for AIMC to include previously neglected nonidealities (see Fig. 1 for an illustration).We define a standard inference model for PCM-based AIMC that can readily be extended to other types of NVM devices.We explore the functional suitability of AIMC across application domains by assessing the robustness of a wide set of DNN topographies.Finally, we estimate the individual impact of various AIMC nonidealities and gauge their relative importance for consideration in future hardware designs.Functions for our standard evaluation process are provided in an open-source IBM Analog Hardware Acceleration Toolkit (AIHWKit) [66], enabling future studies on noise robustness for AIMC to build seamlessly upon our work.
We find that various DNNs and AI workloads -ranging across image classification using CNNs, text-prediction and speech-to-text conversion using RNNs, and natural language processing using transformer networks -can actually be robustly deployed on AIMC given proper HWA training.We show -for the first time -iso-accuracy inference results (within 1% of the FP reference) using hardware-calibrated PCM models, for five out of the eleven AI workloads tested, even after 1 hour (or more) of conductance drift.
However, precision requirements are heterogeneous, and not all architectures reach this isoaccuracy target easily even after extensive HWA training, pinpointing the need for continued device improvement.We find that CNNs are typically much less robust to various nonidealities and design choices of AIMC hardware.Interestingly, RNNs -already well-suited for AIMC Figure 2: Our AIMC crossbar-model assumes that each array or "tile" approximates matrix-vector multiplication (MVM) y ≈ W x, using a scalar α (and vectors γ and β) to scale into (from) quantized inputs (outputs) so that input, weight, and output ranges can remain fixed.Negative weights are programmed onto a different conductance for current subtraction, and output noise is fully represented (see Eqs. 3 and 6).
given their large, dense MVMs [32] -also seem to be the most robust to the finite signal-tonoise ratio (SNR) of AIMC hardware.We further show that among various nonidealities tested, the sensitivity to additive system noise at the output of each crossbar-array is the most critical for achieving good accuracy.

Analog IMC standard MVM model
Our standard AIMC crossbar-model (see Figs. 2 and 3, and Eqs. 3 and 6 in Methods) encapsulates the critical nonidealities incurred during MVM operations, including the fixed dynamicranges of physical inputs (limited by maximum pulse-duration), weights (limited by maximum conductance), and outputs (limited by maximum output current).Floating point inputs x are mapped to the fixed input range by a global scalar, α, which is optimized for each crossbar during HWA training, then held fixed for inference.Such optimization avoids issues created if α is chosen poorly (Fig. 3E).Similarly optimized scales (γ i ) and offsets (β i ) map ADC-counts of each output column to MVM-outputs y i (see Eq. 3) that can be passed to subsequent digital compute for auxiliary operations (activation functions, etc.) [32].We further assume a number of nonidealities, such as PCM programming errors and drift(Fig.3A), IR-drops within the array (Fig. 3C), weight read noise, and system noise (Fig. 3B and Eq. 6).Each of these parameters has been carefully calibrated to existing PCM hardware [56].All parameter settings of the AIMC crossbar-model are summarized in Table A.1.
We quantify MVM errors in computing y with respect to the ideal outcome y through ǫ M , the ratio of the l 2 -norm of the deviation (y − y) relative to the l 2 -norm of the ideal outcome y (see Eq. 20). Figure 3B shows that, even after including PCM drift, the effective MVM error of the standard AIMC crossbar-model we will use throughout the paper roughly corresponds to 4bit fixed-point quantization of weights or inputs.weight matrices and uniform random inputs -and the ideal (FP32) expected results reveal significant deviations, due primarily to weight-programming errors and PCM conductance drift (shown here without any mean-drift compensation [4]).(B) Short-term noise sources induce cycle-to-cycle noise for repeated MVM calculations even with the same programmed weight matrix.(C) "IR-drops" due to finite wire resistance result from input-position dependency of the accumulated AIMC column-currents, and can cause outputs to further deviate from the ideal MVM.In an extreme case, an expected 0 outputthe correct result when a linearly-graded weight matrix (ranging from -1 to 1 in order) is read with a constant input on all rows -can actually deviate drastically due to these position-dependent IR-drops.
The deviation induced by these IR-drops increases as the fraction of ordered weights increases from 0 (completely unordered, typical case) to 1 (fully-ordered weights from -1 to 1, extreme case).(D) The MVM-error ǫ * M (Eq.20) of our standard PCM-based AIMC inference model (using the nonideal MVM of Eqs. 3 and 6 together with hardware-calibrated PCM conductance noise and drift Eqs.7-14; dotted lines) ≈15% roughly corresponds to fixed-point quantized digital (solid lines) at ∼ 4 bits.(E) Correlations of MVM deviation, ỹi − yi, vs. desired MVM output yi = wij xj illustrate the importance of proper input scaling α, for wij ∼ N (0, 0.246) and xj ∼ N (0, 1).Red dots mimic the weight-toactivation correlations that SGD learning will produce, using xj = ρw kj + (1 − ρ)xj.Low α = 1 leads to input clipping (panels a, e, i) and ǫ * M exceeding 35% (gray text).Intermediate α values can still lead to saturated outputs for correlated inputs, even without input clipping (panels f,j); excessive α values reduce clipping but increase ǫ * M dramatically (panels d, h, l).We optimize α during HWA training, then keep it fixed during AIMC inference, minimizing ǫ * M regardless of input correlation (panels c, g, k).

DNN accuracy impact when directly using AIMC
To test the effect of AIMC nonidealities on a variety of AI workloads, we consider 11 mediumto large-scale DNNs of various topologies as a benchmark set (see Table 1).These cover a wide spectrum of target-applications (image classification, natural language processing, speechto-text), network topologies (convolutional, recurrent, transformer with attention), model sizes (from 0.3M to 108M parameters), crossbar utilization (from 4% to 86%), total number of MVMs per data input (from 2.2K to 240K), MVM sizes (from 0.1G to 96.5G flops), average weightmatrix reuse-factor per data-input (from 17 to 1285), and network depth (up to 121 layers).Our benchmark set thus covers a wide variety of network topologies and challenges for AIMC.
For comparison, we first directly map weights produced by standard SGD-training in FP 32 onto our standard AIMC crossbar-model and evaluate the resulting test error, to measure the accuracy drop (with respect to the FP 32 model) due to all the various AIMC nonidealities.For each crossbar, we first clip raw weight-values to 2.5× the standard deviation of the weight distribution, to disregard outliers and compactify the weight distribution.Output scales γ i are initially estimated according to the absolute maximum value for each column (see Eq. 21).To adjust the digital parameters of our standard AIMC crossbar-model for these directly-mappedfrom-software weights, we use our HWA-training flow -but without any weight noise injection, with weight learning rates set to zero, and for only 1000 batches.As expected, such "direct" mapping of DNNs onto AIMC, without any additional retraining of the weights, generally results in significant increases in test error (accuracy drop) in comparison to the floating point reference (Table 2).Direct comparison of accuracy values between DNNs is complicated by the fact that these various AI tasks exhibit different worst-case (random guessing) and best-case (well-trained DNN model) accuracies.To quantify and compare accuracy drop across different topologies, we therefore define a normalized relative accuracy A 1h * , which re-scales the AIMC test error ǫ 1h test (at 1 hour PCM drift) by the distance between the original FP 32 test error and the "chance" test error from random guessing, as follows: Thus a value of A 1h * = 100% means that the AIMC DNN achieves the same accuracy as the FP 32 reference model (no accuracy drop at all), while a value of A 1h * = 0% implies that the AIMC crossbar-model is so inaccurate that it is indistinguishable from random guessing.
Ideally, deploying a given DNN in an AIMC system should have no impact on model accuracy.We define our iso-accuracy target as A 1h * > 99%, allowing less than a 1% drop in accuracy, as judged relative to the distance between the FP 32 reference accuracy and the chance (random guessing) accuracy-floor.Table 2 shows that direct AIMC mapping fails to achieve this isoaccuracy target for almost all of the DNNs tested, establishing both the challenge posed by the nonidealities existing in AIMC (as compactly encapsulated by our standard crossbar-model, Figs. 2 and 3), as well as the need for HWA training methods that can greatly improve the robustness and reduce these accuracy drops.), as scaled to the range between the FP32 reference test-error and the test-error obtained by random guessing.Nearly all models fail to achieve iso-accuracy, as defined by > 99% in this normalized accuracy and indicated in bold font.Note that for BERT and Albert only one GLUE task (MRPC) is used here.

HWA training improves AIMC accuracy for all DNNs
Building on previous approaches (e.g.[36,38,24]), we set out to retrain these 11 DNNs in a hardware-aware (HWA) manner.In our methodology for HWA training followed by delayed inference (Fig. 1), each DNN is retrained with noise injection using SGD.But in contrast to earlier approaches, we incorporate a much more comprehensive and realistic set of softwaresimulated AIMC nonidealities, including dynamic-range limitations, weight-programming errors, PCM drift and system noise.Once a given DNN is trained and mapped to AIMC, inference is then gauged for noise and drift at various delays (1 second, 1 hour, 1 day, and 1 year) after programming the weights into the crossbar arrays.We also introduce a set of AIMC characteristics including input, output and weight-scales (see Fig. 3 and Methods), and introduce a new approach for optimizing these scaling-factors during HWA training for use during inference (see Sec. A.1).
As shown in Table 3, our HWA training approach significantly improves achievable accuracy for AIMC across the full set of benchmark DNNs results.The normalized accuracies (relative to the FP 32 model) at one hour after programming are all higher than 96% (A 1h * , towards right edge of Table 3).This represents a significant improvement over 'direct' weight mapping without retraining shown earlier (Table 2), while establishing a new state-of-the-art in HWA training, as revealed by detailed comparisons on ResNet-32 with CIFAR10 (see Table A.2).
Table 3 indicates that five out of the 11 AI workloads can be trained to reach the A 1h * > 99% iso-accuracy target, including the BERT transformer model as well as all workloads based on LSTMs (last 3 rows, see 'Type' column in Table 1).Most of the remaining workloads use CNNs and exhibit more-pronounced accuracy drops of up to 3.6% on AIMC, although one CNN does reach iso-accuracy (WideResNet-32 on Cifar100).
For some DNNs, we find that the regularization effect of the added AIMC nonidealities allows HWA-training to actually improve the attainable accuracy (compare test-errors at 1 sec  Rightmost two columns show the normalized accuracy, scaled between the FP reference and chance error, at 1 hour and 1 year after weight programming.Note that PCM drift is a post-programming physical effect that is initially rapid but then slows down logarithmically in time [45].This means that the multiplicative conductance-changes induced by drift between 1 second and 1 hour (time-since-programming increased 3600×), and between 1 hour and 1 year (time-since-programming increased 8760×) are actually quite similar.HWA training hyper-parameters (injected noise strength, etc.) were chosen to produce the best average accuracy across the four widely-spaced time-points shown here.Other choices could be made to focus just on performance in either longer or shorter periods of drift.Models deemed iso-accurate (A 1h * , A 1y * > 99%) are marked in bold.BERT and Albert results are averaged across eight GLUE tasks, as evaluated on validation datasets; SWB300 results are averaged over two benchmark tasks; results for Speech-SWB300 use HWA with distilling (see Methods).
after programming for WideRN-16 and BERT).Both RNNs and transformers are quite robust when subject to PCM conductance drift over longer periods as well.The rightmost column of Table 3 shows the long-term relative accuracy of the DNNs, A 1y * , for an hypothetical 1 year after programming without weight refresh.
While the RNNs and transformers remain near iso-accuracy over time, larger CNNs with higher resolution ImageNet inputs show the largest drop in accuracy.The deep DenseNet-121 (121 layers), as well as the large WideResNet-50 (69M parameters) models are clearly the most challenging for AIMC.That said, the resiliency to long-term drift is greatly improved by HWAtraining as compared to "direct" deployment without retraining.For instance, the HWA-trained models for both the Speech-SWB300 and LSTM-PTB models remain iso-accurate out to a year, unlike the directly-mapped models (Table 2).
In general, we find that CNNs are more difficult to train to iso-accuracy for AIMC deployment compared to RNNs and transformers.In terms of AIMC workload execution latency and system mapping [32], CNNs are already less well-suited for resistive crossbar arrays due to the uneven temporal re-use between layers and spatial under-utilization of the large analog tiles by the small kernel matrices (see Table 1), although some optimization and mapping tricks [65] are available.Our results here indicate that AIMC noise-robustness issues will pose additional challenges when implementing CNNs onto AIMC systems.

Sensitivity of HWA trained models to various AIMC nonidealities
To determine which nonidealities are particularly problematic for analog inference across DNNs, we "stress test" our HWA-trained models.For each individual nonideality, such as PCM programming-error or IR-drop, we vary its strength and evaluate the resulting inference accuracy across DNNs using our base HWA-trained model.Our standard AIMC MVM model exhibits ǫ * M ≈ 15% (see Fig. 3 and Eq.20), but combines many nonidealities.To estimate the relative accuracy impact due to each individual nonideality, we boost only that parameter value until MVM error increases to ǫ * M = 20%, and then re-measure DNN accuracy.Even at constant MVM error, each parameter changes a different aspect of the AIMC compute.For instance, output noise is applied at each MVM, whereas PCM programming errors are only applied during programming and then persist throughout inference.Other nonidealities such as IR-drop or ADC "S-shaped" nonlinearity change the shape of the MVM deviations (Fig. 4A), causing large outputs to incur very significant MVM error.As a result, even at an identical average MVM error of ǫ * M = 20%, the impact on DNN accuracy can be much more pronounced.Such nonidealities are particularly detrimental for DNN inference, and thus deserve additional attention in future hardware designs or HWA training methods.
To gauge the relative impact of each individually-boosted nonideality parameter, Fig. 4B shows the loss in normalized accuracy (A 1h ), defined not with respect to the FP 32 model error (A 1h * (Eq.1)), but with respect to our standard AIMC crossbar-model (at 1 hour drift).A value of 0% means that boosting this particular nonideality has no impact on accuracy, as compared to our standard AIMC crossbar-model.A value of 100% means that simply boosting this nonideality to the same MVM error of ǫ * M = 20% has degraded DNN accuracy to the level of random guessing.
Clearly, DNN accuracy reacts vastly differently to individual nonidealities.We observe that nonidealities that effectively add noise to the inputs or outputs -such as ADC and DAC resolution, additive output noise, and S-shaped non-linearity of the ADC -have the largest impact on the DNN accuracy, as normalized to impact on average MVM error.CNNs are the most sensitive DNN topology, while RNNs are the least sensitive (in particular the PTB-LSTM network).
Nonidealities that mostly affect weight-precision (all other nonidealities listed in Fig. 4B), have a much less severe impact on the DNN accuracy.In contrast to additive output noise, such weight-related nonidealities all scale with the input norm, and thus disappear when no inputs are given.Since it arises from large currents, IR-drop becomes negligible when either inputs or weights are reduced (in either amplitude or occurrence).Such weight-related nonidealities impact CNNs slightly more than RNNs or transformers.In particular, DenseNet-121 with small kernel matrices and a high tile re-use factor seems the most affected by weight disturbances.Fig. 4 shows it is not enough to focus only on weight-related nonidealities, as most previous studies have done, when investigating AIMC.
We use this sensitivity analysis to assess additional nonidealities which our standard AIMC crossbar-model assumes to be perfect.For instance, imperfect device yield -where some fraction of the weight conductances are "stuck" either at zero (PCM reset), at ĝmax (PCM set), or at some intermediate random value -is shown to have the same modest effect on DNN accuracy as other weight-related parameters.Weight asymmetry -a systematic difference in conductance for positive versus negative inputs such that −w(−|x|) = w(|x|) -is shown to have only modest impact on DNN accuracy.Interestingly, RNNs and transformers are the models impacted by such polarity-dependent device response, since the ReLU activations used in CNNs cannot create negative inputs.Finally, systematic PCM programming errors -applied once to the conductance values and then remaining constant through repeated MVMs -are shown to have a slightly larger effect than the cycle-to-cycle short-term PCM read-noise that gets redrawn for every MVM.

AIMC robustness of DNN topologies
To extract the specific sensitivities of each individual DNN, we find the threshold value x * at which each nonideality degrades accuracy to A 1h (x) = 99%, with respect to the standard AIMC crossbar-model.From scans of A 1h as each nonideality is increased (Fig. 5A), we use linear interpolation to identify x * from the intersection with the dotted line at A 1h = 99%.
The grid in Fig. 5B shows this threshold value x * , for each nonideality and each DNN.For example, considering just total PCM noise, even small increases beyond the current hardwarecalibrated values markedly degrade ResNet-18 (x * = 1.2× for A 1h = 99% ), while LSTM-PTB is not affected until this particular nonideality is significantly larger (x * = 3×).The colors ranging from red to green in Fig. 5 illustrate the relative sensitivity among the DNNs, obtained by scaling x * linearly between the minimal and maximal values across the 11 DNNs.For many of these nonidealities, yet again RNNs tend to be the most robust, followed by small CNNs on the CIFAR dataset.
Some nonideality parameters can be increased quite dramatically with respect to our standard AIMC crossbar-model baseline.For instance, DAC precision can be lowered from 8bit to 6bit without any retraining, with little accuracy impact across all DNNs -this could produce considerable energy savings and throughput improvement for AIMC designs.Also, IR-drop can be increased beyond than the baseline before becoming problematic, and short-term weight noise could be up to 3× larger, similarly informing future AIMC designs, both with and without PCM devices.While direct examination of Fig. 5 might suggest that IR-drop could be increased by 15× without issue, note that the assumptions inherent in our IR-drop calculations, concerning average rather than instantaneous currents, imply a small safety margin of perhaps 3× (see Methods).
We also estimated the effect of imperfect PCM device yield.Even the least robust model can tolerate 0.27% failed-at-zero devices (stuck in the reset state, at random locations), rising to 3-4% for some of the RNNs.However, DNN accuracies are more sensitive to devices stuck either at random intermediate conductance values or at ĝmax (in the set state).As few as 0.02% such failed devices would already cause a noticeable accuracy drop in some large CNNs.However, our analysis only assumes one pair of conductances per weight -since many existing AIMC designs provide multiple pairs of PCM devices per weight [39,57], such additional redundancy can potentially counteract such stringent device yield requirements.5.6 bit 6.0 bit 4.9 bit 4.9 bit 5.0 bit 5.0 bit 5.9 bit 5.7 bit 5.5 bit 4.8 bit 4.3 bit 6.0 bit 6.9 bit 6.9 bit 6.9 bit 7.0 bit 7.0 bit 6.8 bit 6.9 bit 6.5 bit 6.0 bit 6.

Impact of weight distributions on AIMC MVM fidelity
The MVM error of each AIMC tile is affected by the shape of the weight distributions in interesting ways.While weight-clipping might seem disadvantageous, directly programming a very "long-tailed" weight distribution by mapping its largest outlying weight-value to ĝmax can cause even larger problems.Such mappings tend to produce low average output currents which fail to employ the available ADC range, leading to larger MVM errors thanks to ADC quantization, output noise, and other nonidealities that remain stubbornly independent of the reduced output signal-levels.
To show this effect, we calculate the MVM error for different arbitrarily-constructed weight distribution shapes, obtained by sampling the generalized normal distribution, where we use α = 1 and µ = 0.As β increases, this distribution becomes more compact, moving through the Laplace (β = 1) and normal distributions (β = 2) along the way (see red curves above Fig.6A).Fig. 6A shows the MVM error ǫ M at 1 hour drift, for weight-values sampled from Eq. 2 as β increases from long-tailed (β ≤1) to compact (high β) weight distributions.Here we map weights directly to conductance values, with the maximum weight assigned to ĝmax ; inputs are uniformly distributed between (−1, 1).MVM error increases rapidly for longer-tailed distributions (β ≤ 1).
One simple measure of a distribution's shape is the kurtosis, obtained by dividing the fourth moment ( (x − µ) 4 ) of the distribution by its variance squared ([ (x − µ) 2 ] 2 ).In the plots and the remainder of this section we use the excess kurtosis -defined as the kurtosis minus 3, so that its value is 0 for normal distributions.Since kurtosis increases for long-tailed distributions, we find that lower kurtosis -and thus more compact weight distributions -means lower MVM error (Fig. 6B).Fortunately, our HWA training and conductance mapping approach tends to inherently produce more compact conductance distributions, for several different reasons.First, the individual digital scales γ i available for each MVM output (see Eq. 3) are initialized to scale conductances by the absolute maximal value of each weight matrix-column rather than by the overall maximum across the entire weight matrix.With each column individually scaled, the overall conductance distribution becomes more compact than the original weight distribution.During HWA training, these digital scales are optimized -which may lead the system to choose to clip some output columns -and any large weight deviations and outliers created during training are also clipped.Finally, since the AIMC nonidealities cause large weights and outputs to increase the errors that SGD is attempting to correct, HWA training should be expected to drive towards more compact weight distributions during retraining.
Indeed, we find that our HWA training and mapping scheme greatly increases the compactness of the conductance distributions for each layer, as indicated by the kurtosis values shown for our 11 DNN models in Fig. 6C.Hashed bars show kurtosis for direct mapping of the FP 32 model without HWA training, using a single global digital scale-factor per layer.Solid bars illustrate that our columnwise-scaled and HWA-trained models get mapped into conductance distributions that are significantly more compact, which helps reduce both MVM and DNN error.* as the most critical layers in these four CNNs are exempted from PCM noise, plotted as the fraction of noise-exempt layers is increased, in order from most-sensitive to least-sensitive.(C) Corresponding cumulative fraction of weight-parameters that are PCM-noise-exempted.For ResNet-50 and ResNet-18, reducing PCM noise in just a few layers (dotted vertical lines) allows an AIMC crossbar-model to achieve iso-accuracy (dashed horizontal line).For DenseNet-121, more than half of the network parameters need to be exempted, while for WideResNet, even the non-PCM nonidealities would need to be improved.

Improving AIMC fidelity of selected layers to reach iso-accuracy in large CNNs
Our results show that larger CNNs, particularly those using the ImageNet database, are the most challenging for AIMC.Even with HWA training, our standard AIMC crossbar-model cannot achieve iso-accuracy for these DNN models (Table 3).Clearly the fidelity of the MVMs must be further improved, either through better materials or through hardware design-choices.For instance, designers could dedicate multiple conductance pairs per weight [44] to reduce PCM programming errors, but at the cost of larger tile area and energy.Or designers could average the results from multiple passes through the tile to reduce the effects of cycle-to-cycle PCM read and additive output noise, but at significant cost to latency, throughput, and energy-efficiency.Given these unpleasant tradeoffs, such approaches should be used as infrequently as possible, ideally only on a small set of DNN layers that really require these extra resources, which can then allow the entire model to achieve iso-accuracy.Thus we need to determine which of the layers in ImageNet CNNs are the most sensitive to AIMC nonidealities, and then assess whether improving just a small subset of these layers would have sufficient impact.To do this, we sequentially introduce AIMC nonidealities at each layer of the HWA-trained DNNs individually, while turning off all nonidealities in all other layers (using FP 32 operations on their HWA-trained weight matrices).By repeating this process over the L layers with different overall PCM noise settings, we can determine the sensitivity and relative importance of single layers.
We first rank the layers according to accuracy impact for each DNN by exposing each layer to significant PCM noise with all other layers exempted from noise (Fig. 7A).Then, in order from most-to least-sensitive layer, we introduce this noise-exemption into multiple layers (Fig. 7B), causing normalized accuracy at 1 hour drift A 1h * with respect to the FP 32 model to increase as more and more model-parameters are made noise-exempt (Fig. 7C).Eventually the 99% isoaccuracy is achieved (dashed horizontal line) and then exceeded for most of these models.For Fig. 7A, the one layer being assessed sees 15× the usual PCM noise; for Fig. 7B, the layers not yet PCM-noise-exempted see our standard AIMC crossbar-model.While PCM-noise-exempt layers experience no long-term conductance noise, programming errors, or drift, they still are subject to the same cycle-to-cycle read noise, additive output noise, and DAC/ADC quantization choices in our standard AIMC crossbar-model.
For ResNet-18 and ResNet-50, we find that improving just a few layers can help achieve iso-accuracy (A 1h * ≥ 99%, dashed line in Fig. 7B).This involves only 7% and 2% of the model parameters, respectively (Fig. 7C).Improving MVM fidelity for such a limited number of parameters should prove less costly than across the full DNN.However, for the other two ImageNet DNNs, a considerably larger fraction of the model (> 60%) would need to be improved to reach iso-accuracy.Therefore, these DNNs will either need further advances in either HWA-training or the overall AIMC specifications, in order to support AIMC deployment without significant accuracy drop.

Discussion
We have introduced a new approach for successfully deploying DNNs onto realistic AIMC inference hardware, at or near iso-accuracy.Our standard AIMC crossbar-model incorporates well-known but hardware-calibrated nonidealities caused by the analog devices, such as read noise, programming errors, and conductance drift.Going well beyond previous studies, our model also includes nonidealities due to MVM circuit integration, such as system noise, DAC and ADC quantization, and dynamically-computed IR-drop.Finally, our model fully addresses the fixed dynamic-range constraints on inputs, weights, and outputs found in all AIMC systems, but previously neglected.
While a few aspects of this study are not directly applicable to hardware designed around non-PCM devices, our standard AIMC crossbar-model and our carefully-designed inference protocols can readily serve as the basic core for studying such systems.The intuition we have developed in terms of how various types of noise affect different families of DNN models is also readily transferable.As such, the present work establishes a baseline that can both guide -and be compared against -future AIMC simulation studies.To help make this even more straightforward, our standard AIMC crossbar-model has now been incorporated into our open source AIHWKit [66], which is based on the popular ML framework PyTorch [62].
Our AIMC crossbar-model is not a replacement for the detailed circuit simulations essential to hardware verification.We use many simplifications and abstractions of the various AIMC nonidealities, since our goal is quick and relatively realistic functional verification of larger DNN workloads.For instance, we assume noise sources are Gaussian, avoiding physicallymodeled distributions that would be more accurate but significantly slower.We also devised a method to rapidly approximate IR-drop which can adjust dynamically with the input.We chose to intentionally ignore static crossbar effects that would change the conductance value systematically [31,69], since read-write-verify conductance programming can readily adapt to such effects.
Some prior works proposed using on-chip or chip-in-the-loop training methods [36,83,31,88,85], which can greatly increase the attainable accuracy by addressing the specific fabrication variations found on that particular chip.However, we strongly believe that the time and cost of such individualized preparation is likely to be untenable for widespread deployment.Thus in this paper, we have focused on HWA training that can be general enough to be performed once per model per AIMC chip-family, greatly simplifying the deployment onto individual chips.That said, our HWA training approach could readily be combined with more sophisticated online compensation methods, with on-chip or chip-in-the-loop training, or with more than one device pair used per weight, including optimization of how weights are assigned across these conductances [50].
Since HWA training is performed in software before deployment, it has no first-order impact on the latency, throughput or energy-efficiency of AIMC hardware.However, as we have shown, HWA training is essential to understanding the tradeoffs between accuracy and these important system performance metrics.For instance, because of the sequential nature of layers of a deep network, shallower but wider layers should generally be preferable for AIMC, since higher utilization of large matrices stored on the crossbar arrays does not significantly change the runtime [25,65] and helps improve energy-efficiency.In terms of noise robustness, excessively deep DNNs have disadvantages.Among the ImageNet CNNs tested, DenseNet-121 showed the worst accuracy drop from its FP 32 model, while WideResNet-50 offered the best raw test-error at 1 hour drift (23.82%, versus 24.82% for ResNet-50, Table 3).Together with information on the latency, throughput, and energy-efficiency, this kind of information on available accuracy gain is critical when trying to decide which DNN model to deploy.
A few previous studies have attempted to improve the robustness of DNNs to nonidealities by noise-aware training, where multiplicative or additive Gaussian noise [38,36]) is added to weights or activations during training.Similarly, other studies seeking to prevent overfitting or to enhance robustness to adversarial attacks have injected noise into standard floating-point training as a regularization technique ( [74,81,26,37,58,63,48]).While all these methods qualitatively increase the noise robustness of DNNs, the quantitative benefits on real AIMC can neither be accurately reported nor fully optimized by these studies.Since our HWA approach keeps weights mapped in conductance units, a variety of realistic hardware-relevant constraints can be incorporated in a straightforward manner.These include the complexities of PCM programming, and the shallow input-output ranges, IR-drop and quantization affecting the MVM compute -aspects neglected in most previous studies.
We have tried distilling with the FP model as a teacher (similar to [90]) and found some benefits when HWA training time is limited.However, since the improvements offered by distilling disappeared at longer training times for most DNN models, we mostly report results without distilling.We did find that accuracy with distilling is significantly higher for the HMM Speech LSTM, and these results are shown in Table 3, implying that distilling can be helpful for some DNNs.
Rather than simple Gaussian weight noise [36], we use the expected weight noise distribution characterized from PCM measurements [56].We find that applying readout-noise on the weights -together with the correct incorporation of injected longer-term programming-noise when modifying the weight matrix during the backward pass -is crucial for achieving AIMC robustness.One drawback of our approach is that this type of noise injection is currently applied only once per mini-batch, which reduces the effectivity of the noise as batch-size increases.One possible improvement would be to sample the weight noise sources multiple times per mini-batch.Such an extension of our methods should further improve the noise robustness of the HWA-trained DNNs.

Conclusion
In summary, we show that comprehensive hardware-aware (HWA) training can greatly enhance the robustness of a variety of deep neural networks (DNNs) -including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers -to the unavoidable device and circuit nonidealities of emerging analog in-memory computing (AIMC) accelerators.In five of the 11 models studied, the techniques we introduce lead to software-equivalent accuracy, defined as 99% of the accuracy-performance offered by the original DNN model beyond random guessing.Across all models, HWA-training reduces the worst-case gap in raw model accuracy from 21.81% down to just 2.65%.
Through a systematic sensitivity analysis, we identify the nonidealities that are most critical for maintaining accuracy in future system designs.For instance, we observe that nonidealities that effectively add noise to the inputs or outputs -such as ADC and DAC resolution, additive output noise, and S-shaped non-linearity of the ADC -have the largest impact on DNN accuracy.We also show that certain DNN topologies, such as RNNs or shallower CNNs, can tolerate more AIMC nonidealities than others.
By making this standard AIMC crossbar-model available in the open-source AIHWKit [66], we make it possible for future advances in HWA training techniques to be readily compared to these results.By pinpointing the measures needed to compensate for imperfect AIMC hardware, the tools we have introduced here enable better understanding and optimization of the tradeoffs between model accuracy and desirable performance characteristics such as latency, throughput, and energy-efficiency.Continued coordination between HWA-training and architectural assessments may even lead to brand-new DNN topologies, specifically designed to maximize the benefits of AIMC hardware -accurate inference at high speed and low power.

Affine transform in tile periphery
We assume that each output column of the analog crossbar has a floating-point scale α i and offset β i available which implement together an affine transformation.We assume that conductances can be linearly mapped to weight values, so that we can normalize the analog weight values from -1 to 1, corresponding to −ĝ max , . . ., ĝmax (see conductance programming below).This affine transform then maps the column's physical output (e.g.current), as quantized using an ADC into integers within a certain range, to the value expected by the DNN for the next layer (e.g.activation).Note that such ADC conversion using a scale and bias per column is already available in prototypes [39] but has not previously been incorporated into studies on HWA training.
This digital periphery of an analog MVM can thus summarized at where F describes the analog aspects of the AIMC MVM (see "Analog MVM model"), and describes linear quantization to 2 q −1 values in −b, . . ., b centered around 0. One bin is discarded to force an odd number of bins on either side of zero.Here clip b a (x) constrains z between minimum a and maximum b, α is a scalar, per-crossbar value which determines the usable input range.This can either be a learned parameter which is then held fixed during inference (static input range), or can depend dynamically on the current input vector x (dynamic input range).While main results assume a static input range, we examine performance improvements for the dynamic option (Supplemental Material, Sec.A.1).The scales γ i determine the mapping of conductances to weight values, individually for each crossbar column i.During HWA we allow SGD to optimize this parameter, starting from values initialized as described below.β i is used to implement the bias of the MVM, which we implement in digital (FP) precision here.We assume q = 8bit quantization, and investigate lower precision as part of our sensitivity analysis.

Dynamic MVM range
A critical feature of our crossbar-model is that it fully encompasses the finite dynamic-range constraints on inputs, weights and outputs that will be present and unavoidable in any real AIMC implementation.Since both input and weights are normalized within −1, . . . 1 (in analog units), our output-bound setting of b out = 10 means that just 10 fully-on inputs, applied to rows containing maximal-value weights, would fully saturate the output.This is a conservative choice that works for modest-size crossbars and for our assumption that positive current contributions (produced by weight and activation pairs of the same sign) and negative contributions (weights and activations have opposite signs) cancel within the array.This mode is energy-efficient and minimizes IR-drops, but requires the ADC to be capable of measuring bipolar currents [39].If the crossbar is made much larger, or the positive and negative terms are integrated separately, this may increase energy usage and exacerbate IR-drops, but simplify the ADC design.Furthermore, such choices will likely alter the overall dynamic range limitations, calling for a reoptimization of b out .

Analog MVM model
Our basic model is illustrated in Fig. 3A.The analog MVM y = F(x) in Eq. 3 for the quantized, clipped and scaled input vector x ≡ quant q 1 (x/α) takes the following general form where analog weights wij (t) represent normalized conductances with programming errors, drift, and long-term noise up to time t eval applied (see 'Weight programming' below).We include a point-wise non-linear functions f NL i (x) to support special cases such as ADC nonlinearities; in our standard model, f NL i (x) ≡ x.Normal random numbers (ξ i , ξ ij ∼ N (0, 1) are drawn for each MVM, representing additive output noise with standard deviation σ out = 0.04, and shortterm weight noise σ w ( w) that depends on the current weight values (see 'Short-term PCM read noise'), respectively.Since the analog output values running from −10, . . . 10 get quantized into digital values from −127, . . .127 (8bit), this choice of σ out = 0.04 corresponds to almost exactly half of one ADC quantization bin.

Weight programming
We adopt a previously-described and -characterized weight-programming and drift model for PCM devices [56].We assume that the crossbar provides 1 pair of conductances per weight, where the first (second) member of the device-pair is programmed to a conductance between reset (0) and set (ĝ max ) to handle positive (negative) weights, with the non-active conductance programmed to reset.Only the active conductance is considered in our model.Although recent prototypes support two pairs per weight [57,39], having only one conductance-pair increases the weight density and thus compute efficiency, and poses a more difficult challenge in terms of accuracy and yield.
Each column w i of each weight matrix is mapped to a column of target conductances ĝi .We first initialize each affine scale coefficient using the maximum weight found in that column, γ i = max j |w ij |.This allows each weight to be mapped to a scaled target conductance, ĝij = ĝmax wij γi .In our HWA training approach (described below), after this initialization of target conductance and affine scales based on the FP 32 model weights, we then use SGD to further optimize both the mapped target conductances and scales γ i separately.Table 3 uses this learned weight-to-conductance mapping when evaluating AIMC inference performance.
In a real AIMC system, a positive ĝ value gets programmed onto a different physical device than if that particular ĝ had been negative.We here assume that only one of the two devices are programmed to particular target conductance whereas the other device is always at reset conductance (ĝ ij = 0).In this case, one can simplify and compute the MVM directly with signed conductances as done in our model.The programmed conductances g P ij differ from the desired target values ĝij as g P ij = ĝij + σ P (ĝ ij ) ξ due to programming noise, assumed to be Gaussian (ξ ∈ N (0, 1)).In turn, the standard deviation of this programming noise depends on the target conductance as where c 0 = 0.26348 µS, c 1 = 1.9650 µS, and c 2 = −1.1731µS, as obtained by fitting to extensive PCM hardware data [56].

Weight drift and read noise
Once a PCM device is programmed, the device exhibits both conductance drift and 1/f (longterm) read noise.As briefly described below, both are modeled in a statistical manner based on measurements of doped-Ge 2 Sb 2 Te 5 (d-GST) mushroom PCMs from a large device array integrated in 90nm CMOS technology [56].
PCM drift PCM conductance drift, attributed to post-programming structural relaxation, follows an empirical relation where g D (t eval ) is the conductance measured at time t eval after the programming (assumed to complete at t 0 = 20s [55]) and ν is the drift coefficient.The drift coefficients for each device are assumed to be normally distributed, that is , where the mean and standard deviation are empirically determined by fitting to experimental data.The data fits are expressed by a clipped linear function in logspace, that is (with Eq. 5) where here x ≡ ĝ ĝmax .The parameters for µ ν are given by a = −0.0155,b = 0.0244, y min = 0.049, and y max = 0.1.For σ ν the parameter are a = −0.0125,b = −0.0059,y min = 0.008, and y max = 0.045.The drift coefficient ν ij thus determined for each device are used to model the conductance at any time t eval using Eq. 8.
PCM read noise PCM is also known to demonstrate low frequency noise such as random telegraph noise (RTN) and 1/f γ noise with γ ∈ [0.9, 1.1].We follow the empirical noise model of [56], which assumes γ = 1 and arrives at a read noise standard deviation at time t eval of ( [56]) where Q s (ĝ) is measured to be with c 1 = 0.0088, c 2 = −0.65,c 3 = 0.2.This read noise is added to the post-drift conductance g D (t eval ) to arrive at the final PCM conductance where we set ĝmin = 0 here and ξ ∼ N (0, 1).The weight values wij of the crossbar array for Eq. 6 are then obtained by scaling and combining positive and negative parts wij = gij ĝmax sgn w ij (13) These long-term PCM effects are applied to all weights prior to the evaluation at time t eval and the weights are subsequently fixed during the evaluation of the test set.Short-term weight noise, redrawn for each MVM, is included separately in Eq. 6 as described in the following paragraph.
Short-term PCM read noise When evaluating the AIMC DNN at a time t eval , the analog weights W are established as described in Eq. 13.However, weights are often re-used multiple times during a single input, say across image-pixels in a CNN image or sequence-tokens in an RNNs or transformer model.Here short-term weight noise can cause small but perceptible cycle-to-cycle variations (Fig. 3B).
Modifying the weight matrix at each MVM would be highly inefficient for our HWA training software running on GPUs.To efficiently model such short-term read noise, we use the read noise definition Eq. 10 to set σ w in Eq. 6, but refer the resulting noise to the output yi .Assuming zero-mean independent normal distributions, we can sum the variances as implying that the weight dependence of the read noise can be approximated as ∝ | w|.Thus weight-noise σ w in Eq. 6 effectively adds ξ i σw i (with ξ i ∼ N (0, 1)) to the analog output yi .The parameter σ w 0 can be identified with c 1 ln( ∆t+tr 2tr ) for read noise accumulated over timeperiod ∆t (Eq. 10, [56]).Assuming a read duration of t r = 250ns and approximate waiting time between two consecutive MVMs (∆t) to be 100× longer, we find σ w 0 ≈ 0.0175.

Drift compensation
For evaluation times t eval long after NVM programming, the conductance drift Eq. 8 can be compensated in the digital domain without any expensive re-programming [46,4].This can be done by running a number of analog MVMs on some known test inputs {x k } immediately after weight programming and recording the overall output magnitude as i |.At time t eval , just before beginning inference, the same inputs can be applied to measure s t eval .We then correct the MVM outputs by adjusting the digital γ i (see Eq. 3) by s ref st eval accommodate the average conductance decrease due to drift.We assume one global drift compensation applied to all columns, although this could be done individually at each column if s ref | i can be measured sufficiently accurately.Other more sophisticated drift compensation and adaptive refresh methods including in-memory re-training could potentially be applied as well (e.g.[36]).

Crossbar tile size
The NVM crossbars available on an AIMC chip are of finite size, typically ranging from 256×256 (e.g.[39]) to 512 × 512 (e.g.[57]).We assume a tile-size of 512 × 512, and assume that enough crossbars are available to support separate crossbars for each weight matrix.Any weight matrix with input dimension > 512 is divided into roughly equal parts for programming on as many tiles necessary.Partially-used tiles have weights are situated at the bottom of the crossbar, to minimize interference and potential IR drop, and unused inputs are clamped to zero.
Each tile computes an MVM (Eq. 6) using its own periphery (Eq.3).Inter-tile summation is performed at FP precision (FP16), after affine-scaling but before being passed to subsequent digital compute such as activation functions.Because our AIMC nonidealities have no dependencies across output columns, the HWA training code does not need to explicitly break the compute along the output-dimension into tile-sized chunks.This helps the simulations run more efficiently on GPUs.

IR-drop
Ideally, the voltage along each long bitline in the crossbar would remain constant, so that conductances with the same value could contribute the same current, whether in the farthest or nearest row from where peripheral circuitry is holding the bitline voltage and measuring currents.In a physical crossbar, however, IR-drops imposed by finite wire resistance cause the bitline voltage to vary [13], especially as instantaneous currents get large.To keep the simulation-time reasonable, we make a number of approximations when modeling this effect.IR-drop is modeled independently for each crossbar-column, because any column-to-column differences will be implicitly corrected (to first order) when programming the weight with an appropriate read-write-verify scheme.
However, within each crossbar column, the current contributed by each weight depends on the local bitline voltage, which in turn depends on the other currents being generated elsewhere along the column by that particular input vector.This situation will evolve throughout the integration period due to the pulse-length modulation of those inputs as well as any resulting transients, including the response of the peripheral circuit establishing the bitline voltage.Here, for simplicity and speed of computation for large DNNs, we only consider the average integration current.
The steady-state bitline voltages vi can be computed by solving the equation system where g w is the wire conductance between the crosspoint nodes and g +/− i the weight programmed onto either the positive or negative conductance (with the other programmed into the reset condition, g = 0).The individual input voltages, v − i and v + i of spatially-ordered inputs i, are linearly prorated from the supply voltages (v ref ± V read ) to represent the time-averaged current.The analog output current y located at location i = 0 is given by g w (v 0 − v ref ), with V read = 0.2V.
This linear system (Eq.15) can be solved by inverting the unique coefficient matrix produced by a given input vector.To speed up the simulation and avoid inverting a 512 × 512 matrix for each MVM, we further approximate the solution with a quadratic equation.Thus, in our analog MVM (Eq.6), the IR-drop amount is computed from the normalized weights and inputs by where γ is the unitless product of the wire resistance between adjacent cross-points (assumed 0.35 Ω) and the maximal (set) conductance of the device (g max = 5µS), and n is the number of cross-points occupied by the weight matrix.We assume that smaller weight matrices are located at the lower edge of the crossbar to avoid excess IR-drop.We use Eq.18 to dynamically approximate the IR-drop across the 512 input channels in Eq. 6 when computing normalized MVM outputs y in all our results.Multiplying these normalized outputs by g max V read produces the (time-averaged) physical output currents.To amplify these IR-effects for the sensitivity analysis (Fig. 4), we simply multiply the IR-drop error ∆y IR-drop i by a varying scaling factor.For large inputs where current is flowing throughout the integration window, our estimations using time-averaged current are quite accurate.However, for small inputs where much of the current-flow occurs in a small portion of the integration window, instantaneous and average currents differ strongly, and IR-drop will be underestimated.We find that for a Normal distributed weight matrix and random but correlated inputs (as in Fig. 3E), IR-drop deviations are underestimated by roughly a factor of 5. Unfortunately, similar conditions arise across many of our DNNs.Fortunately, our sensitivity analysis (Fig. 4) finds that scaling our time-averaged IR-drop approximation by a factor of > 10× does not significantly impact the accuracy of the DNNs, so we can still conclude that DNNs are reasonably robust to IR-drop, albeit by a modest rather than large safety margin.Since IR-drop depends heavily on both on the hardware design (crossbar size, wire resistances, and absolute device conductances) and on the input and weight distributions, detailed circuit-based simulations using the intended workload(s) will remain a critical part of assessing new hardware designs.

PCM device yield
Emerging memory devices such as PCM exhibit imperfect yield, and some fraction of the devices in a given crossbar array will simply not switch properly [40,14].PCM devices can end up stuck-at-set (ĝ max ), stuck-at-reset (conductance set to 0) and stuck-at-random (stuck somewhere between 0 and ĝmax ).In our sensitivity analysis (Fig. 4), we vary the fraction of failed devices and randomly-select their locations.

S-shaped ADC output non-linearity
The output level might gradually saturate more gradually than the desired linear response due to non-linearity in the ADC [80,39].To estimate the impact of this for our sensitivity analysis (Fig. 4), we define f NL i in Eq. 6 with

PCM polarity
Depending on the hardware and unit-cell design, positive and negative inputs might not create perfectly symmetric read-currents.The measured conductance of a PCM-device can depend on whether read-current passes from top to bottom electrode, or vice-versa.This read-polarity dependence can cause weights to appear systematically altered for negative inputs as compared to positive inputs.Although the average effect can be corrected by adjusting read voltages, device-to-device or conductance-dependent variations can remain.To model this effect in our sensitivity analysis, we separate positive and negative inputs into two phases (setting a negative input to 0 in the positive phase and vice versa), and scale each weight in the negative phase by (1 + a ij ) where a ij ∼ N (0, σ a ).We then vary this new nonideality parameter σ a as "weight asymmetry std."

MVM error calculation
To quantify the fidelity of the analog MVM, we calculate the expected deviation of the analog MVM as compared to the ideal MVM as MVM-error ǫ M , defined by the relative normalized deviations (see e.g.[8]) where y k = W x k is the ideal MVM output to input vector x k using matrix W , and y is the actual AIMC output considering all hardware-related nonidealities as defined in Eq. 3.
The MVM-error is obviously zero if the AIMC is equal to the ideal outcome, but otherwise it depends on both the particular weight matrix W and set of input vectors x k used to estimate Eq. 20.To best reflect the impact of the nonidealities on the DNN, inputs x k should ideally be taken from the distribution of actual input activation vectors, and W should be the target weight matrix, for the specific DNN layer in question.
However, to quantify the MVM-error independent of the DNN in question, we calculate the standard MVM-error ǫ * M by using normal distributed weights, w ij ∼ N (0, 0.246) and uniform inputs x i ∼ U(−1, 1) with a tile size of 512 × 512.For our standard AIMC crossbar-model as described in 'AIMC standardized evaluation model', the standard MVM error is ǫ * M = 15% (not considering drift).

AIMC hardware-aware DNN training
Robustness to the nonidealities of AIMC inference hardware can be improved by hardwareaware (HWA) training -a DNN re-training method that applies expected nonidealities to the forward pass of the SGD, with the backward pass performed using regular FP precision.
Our HWA training approach is to use the general form of the expected analog MVM nonidealities as described in Eq. 6 together with the injection of the expected programming errors (but without any conductance drift).Further, we use the HWA training step to also establish the digital peripheral parameters of Eq. 3, in particular the static input range α (see 'Learning the input range') and the weight-to-conductance mapping γ i (see 'Learning of weight-to-conductance conversion factors').Additionally, we found that ramping up the injected programming error strength (see 'Re-training with weight noise injection'), fixed scales and individual learning rates per tile (see 'Learning of weight-to-conductance conversion factors'), weight clipping (see 'Weight mapping and clipping') and distilling (see 'Distilling with floating-point teacher') improved the robustness and achievable accuracy in the presence of AIMC nonidealities.
In general, the HWA training starts from an already FP-trained DNN, and hyper-parameters (learning rate, injected noise strength) are optimized.We verified the effectiveness of our new HWA training approach on the very same DNNs used in a previous study [36] and found, on average, a > 10% decrease in AIMC test error for long t eval times.This directly indicates the improvement of our approach over previous methods (see Table A.2).
In the following paragraphs, our new HWA training methods are presented in more detail.

Re-training with weight noise injection
Injecting noise to improve robustness to nonidealities was suggested by a number of studies (e.g.[36,24,38]), and has been one of the hallmarks of HWA training for AIMC.In previous studies, noise has been injected in multiple ways, such as output [24,36], input [36], or weight noise [36,38].Different types of weight noise distributions have been used, such as additive (scaled by the current maximal weight [36]) or multiplicative [38] Gaussian.Methods for injecting weight noise have differed across previous studies.For instance, Joshi et al. [36] added newly drawn Gaussian weight noise to the weight matrix reversibly for each image input (not mini-batch) only during the forward pass (and not during backward pass which was done with the actual weight matrix).However, it is more mathematically-correct to also apply these same weight perturbations during the backward pass (but not to the reference weights to which updates are applied), as is commonly done for weight regularization techniques such as drop-connect [82].Furthermore, although the exact noise injection method (input, output, or weight noise) does not seem to matter much [36], generic additive Gaussian noise does not conform with the expected AIMC noise structure.For instance, PCM programming errors are actually conductance-value dependent and not just additive.
Here, we improve on the earlier approaches in the following ways: First, rather than just a generic noise term, we apply all expected nonidealities and hardware design choices (given by Eq. 6) into the HWA retraining.This includes dynamic range limitations, system noise, and analog-digital conversions -all previously ignored.We inject weight noise in a mathematicallyconsistent way to both forward and backward passes, redrawing from random distributions once per mini-batch.We draw the weight noise from the (scaled) expected programming error distribution (see Eq. 7) instead of using generic additive or multiplicative Gaussian distributions.Finally, the scale of the injected weight noise is ramped up linearly over a number of epochs, which we found to improve the HWA training.See Sec.A.2 (Supplemental Material) for the detailed hyper-parameters and noise settings used for each DNN.

Learning of weight-to-conductance conversion factors
To achieve a good weight-to-conductance conversion, we train the γ i scale-factors in Eq. 3 using SGD.To improve the training of CNN models, it is beneficial to represent these scale-factors by γ i = γi κ, where both the column-wise γi and per-tile κ factors can be learned.We treat the learning of either factor as a hyper-parameter for a particular DNN.In case of not learning, γ i is initialized by the weight mapping described below (see 'Weight mapping and clipping') and κ is set to 1.
In case of CNNs, where the matrix-sizes vary widely, the learned values γi are uniquely scaled for each weight matrix by a fixed c aws value, which re-scales the learning rates per tile so that the trained parameters can all have similar magnitude ≈ 1.This auto-weight scaling factor, c aws , is set to the value suggested by the Xavier weight initialization [22,64], c aws = 3 n , where n is the input dimension of the weight matrix.
If κ is learned, we encourage the learning of larger outputs and weights by down-scaling the output range to [−1, 1] which typically improves the signal-to-noise ratio, thus κ = κ bout .Here b out is the fixed output bound of Eq. 3, and κ is a per-tile learnable scalar which is initialized to b out (and is subject to weight decay).
Note that during inference evaluation, the digital periphery can simply apply one scale-factor per output-column, since the various scale-factors described above can be re-combined after the completion of HWA training.

Weight mapping and clipping
Since we use the output scales γ i to keep the analog weights wij of Eq. 6 mapped in (normalized) conductance units (within −1, . . ., 1), the FP weights w ij of the trained DNN need to be mapped to conductances before initiating HWA training.For that we set initially wij ← so that γ i wij = w ij .We keep training from creating excessively large analog weights.w, by clipping after each update to this same range.In some cases (see Supplemental Information), we encourage learning of larger analog weights to maintain signal-to-noise ratio by remapping weights according to Eq. 21 once every epoch.

Learning the input range
The input range clipping bound c input in Eq. 3 is learned during HWA training.To encourage a smaller clipping value (and thus a more compact input distribution), a decay is introduced.To augment the gradient update for the clipping bound, we scale gradient updates by the current bound value.For small data sets (such as for transformer fine-tuning tasks), the HWA training is too short to learn the clipping bound value from scratch.In such cases, we initialize c input to the average absolute maximal value of the input vectors over a number of mini-batches before starting HWA training, subject to a cap (nominally max(c input ) = 10).

Distilling with floating-point teacher
If the model output dimension is large, such as for the LSTM models with large vocabulary size, the HWA training greatly benefits from distilling with the FP model.In knowledge distillation [29], an already trained "teacher" model augments the usual one-hot labels with expected class probabilities, which can drive a "student" model to a good solution more rapidly than when training only with the one-hot label vectors.We use the distilling applied at the last layer, with the FP model without any AIMC nonidealities as the teacher and the HWA training as described above as the student.The temperature controlling the distribution of pseudo-probabilities was fixed to 10, and training loss was weighted by a mixture of 75% from the distillation and 25% from the regular loss.

HWA training experiments
We applied and optimized the HWA training process described in this section to a variety of AI workloads -including text prediction, speech-to-text translation, and image classificationas listed in Table 1.In general, our HWA training approach addressed these DNNs similarly, since a too DNN-specific re-training approach would be impractical.In Sec.A.2, we detail any specific differences used in the HWA training of these DNNs, including learning-rates and injected noise strength.We select the last available rather than the best checkpoint, and we repeat experiments multiple times and average the results to obtain repeatable results.

A.1 Static input range learning versus dynamic scaling
In general it is difficult, if not impossible, to simultaneously achieve perfect utilization of the input activation range, weight range, and output activation range for every input and across every tile in the network.
In our hardware model, we assume that input ranges are statically clipped (and scaled) by a learnable input range α (see Sec. A.1).This input range is thus fixed and re-determined during inference, and is therefore not dynamically adjusted based on the actual values of each input vector.If dynamically adjusted, the input range might get optimized (eg.very small inputs might not be buried in the noise floor because of a dynamic scaling of the input range), and this could improve the signal-to-noise ratio of the AIMC MVM.For instance, it was suggested to scale the input range dynamically by the absolute maximum input value for each input vector to maximize the output range (e.g.so-called noise management in [23]).One drawback of dynamic scaling, however, is that determining the absolute max value of each input vector requires more on-chip digital computation than simply fixing a predetermined scaling factor.Another caveat of enlarging the input range too much is that larger inputs might saturate the output range.In our hardware model, we assume a relatively shallow input-to-output dynamic ratio (IO-ratio), that is we assume 10 fully-on inputs together with 10 fully-on weights (ie. at g max ) would saturate the output bound so that larger output values are clipped (compare to Fig. 3).
To test whether the trained input ranges in our static approach are sufficiently learned, we turn on the described noise management additionally so that too small inputs would have an improved SNR.To avoid a potential bound saturation when testing the effect of the dynamic input scaling, we also turned on the so-called bound management (which uses an dynamic downscaling of the input only if the output was clipped, see [23]).We found that across DNNs, the improvement in performance (as measured with A 1s ) is less than 0.1% for all DNNs except for the transformers, where the improvement is more pronounced but still only on the order of 1%.Altogether, we thus conclude that our HWA training method learned the static input range well and could adapt the weight and input code to the shallow IO-ratio assumption across DNNs.A dynamic input scaling mechanism seems not necessary to reach good accuracy for inference when using HWA training approach.However, such an approach might be beneficial to increase the flexibility of models that must be deployed with less or no HWA training.

A.2.1 LSTM on Penn Treebank Dataset
Here the DNN model is a two-layer LSTM with a hidden size of 650 for word-based prediction on the Penn Treebank Dataset (PTB) [78] using cross entropy loss.The encoder layer is implemented digitally, whereas the decoder layer is assumed to be on AIMC and consists of 10K classes -each corresponding to a word in the dictionary.We performed HWA training generally as described in 'AIMC hardware aware DNN training' with the following particularities.The HWA model is trained for 60 epochs using a conventionally FP trained model as a starting point.We scan several hyper-parameters and find the best HWA-trained model to have a base learning rate of 0.01, a learning rate decay of 0.95 (applied after each epoch if the test error is not improving on the validation set), dropout ratio of 0.5, injection of PCM programming error at 5× the nominal scaling, SGD momentum of 0.9, and maximum gradient norm of 10.We use a weight decay (L2 regularization) factor of 10 −5 .Finally, we randomly select 1% of devices to be stuck at ĝmin = 0 throughout training to provide some added robustness (i.e.drop-connect [82]).
The HWA model is trained for few (1-5) epochs on audio sequences of maximum length capped at 500 frames, using SGD with momentum as optimizer, with learning rate 1e-3 and the methods described in (see 'AIMC hardware aware DNN training').The injected weight noise scale was set to 1. Here, neither down-scaling nor auto-scaling was used, and IR-drop was set to 0 during training.

Figure 1 :
Figure 1: In our hardware-aware (HWA) training setup, DNNs are first trained in 32bit floating-point

Figure 4 :
Figure 4: Comparison of the relative impact of various AIMC nonidealities on DNN accuracy.(A) AIMC deviations (ỹ − y) from the ideal MVM output (y) are shown, for uncorrelated (blue dots) and weakly-correlated (ρ = 0.05) random activations (red dots), as a single nonideality is increased until standard MVM error reaches ǫ * M = 20%.All other parameters remain fixed to our standard AIMC crossbar-model (ǫ * M = 15%, Fig. 3).For instance, IR-drop needs to be scaled 11.7× to incur ǫ * M = 20%.Even at constant ǫ * M = 20%, MVM deviations are structured differently and thus the impact on DNN accuracy can vary significantly.(B) Grid shows loss in normalized accuracy (A 1h ) over the base HWAtrained model at 1 hour after programming when boosting a given nonideality to ǫ * M = 20%.Thus 0% means no accuracy impact despite the amplified nonideality, whereas 100% means a drop to chance level.For the HMM LSTM sensitivity is reported for a portion of the training set (instead of the benchmark set) directly on the LSTM output without the Hidden-Markov-Model to speed up computations.For the transformer models, only one GLUE task is evaluated (MRPC).
R e sN e t-3 2 : C IF A R -1 0 W id e R e sN e t-1 6 : C IF A R -1 0 0 R e sN e t-1 8 : Im a g e N e t R e sN e t-5 0 : Im a g e N e t D e n se N e t-1 2 1 : Im a g e N e t W id e R e sN e t-5 0 : Im a g e N e t B E R T : M R P C A L B E R T : M R P C L S T M H M M : S W B -3 0 0 L S T M : P T B R N N -T : S W B -3 0 0 DAC precision ADC precision ADC S-shaped non-linearity Additive output noise Total PCM noise PCM programming noise PCM drift variability Prob.failed at zero Prob.stuck at gmax Prob.stuck at random Short-term PCM read noise Weight asymmetry std IR drop

Figure 5 :
Figure 5: Sensitivities of individual AIMC nonidealities across DNNs.(A) As a single nonideality parameter is increased from the our standard setting, accuracy A 1h eventually drops to 99% (compared to accuracy of the standard AIMC crossbar-model).Four nonidealities are shown, with DNN line-colors matching the textlabel color in (B).(B) Grid shows x * , the threshold value at which that particular nonideality produces A 1h = 99% (DNN's curve crosses dotted-line in (A)).For instance, reducing DAC precision from 8bit down to 4.8bit, while maintaining all other parameters from the standard AIMC crossbar-model, causes exactly 1% additional accuracy loss in the LSTM-PTB model.Text-label colors at top match the lines in (A); grid colors reflect relative sensitivity index rS = x * −min x * max x * −min x * , with min and max values taken across all DNNs.rs = 1 (red) indicates the most sensitive and rs = 0 (green) the least sensitive DNN.RNNs are generally observed to be more robust to AIMC nonidealities than CNNs, even with the limited hyper-parameter tuning available for RNN-T due to its large number of MVM FLOPS and parameters.

Figure 6 :
Figure 6: HWA training reduces MVM error by creating more compact conductance distributions.(A)MVM error decreases as constructed conductance distributions, produced by a generalized normal distribution (Eq.2), are made more compact by increasing β.Example distributions in red at top show β = 1 (Laplace distribution), β = 2 (normal distribution), and even more compact distributions for higher β.'SEM' indicates standard-error of the mean.(B) Data from (A) is replotted as a function of the (excess) kurtosis of the distribution.According to the definition of excess kurtosis, a normal distribution (that is β = 2 in (A)) has a value of 0, and positive or negative values for longer tail distributions (ie.β < 2) or more compact distributions (ie.β > 2), respectively.Note that longer tail distributions (large kurtosis) lead to higher MVM error, while more compact distributions (lower kurtosis) reduce MVM error (C) Kurtosis of the conductance values per layer, comparing HWA trained models (solid bars), to FP32 weight data scaled by the overall absolute maximum weight (hashed bars).Column-wise scaling, and the tuning of both weights and scaling-parameters during HWA training, help lead to significantly more compact distributions with smaller kurtosis values.

Figure 7 :
Figure7: Layer-wise breakdown for ImageNet CNNs.(A) Bar-charts reveal the relative impact that different DNN layers have on AIMC accuracy when the PCM conductances in just that layer are made very noisy (overall PCM noise scale set to 15), while all other layers see only minimal PCM noise (overall noise scale set to 0).The height of each bar-segment, arranged in sequential DNN layer order, corresponds to the relative impact of that layer; colors simply delineate layer boundaries.Note that ResNet-50 and WideResNet-50 have very similar graphs since their layers only differ in width.(B) Accuracy A 1h * as the most critical layers in these four CNNs are exempted from PCM noise, plotted as the fraction of noise-exempt layers is increased, in order from most-sensitive to least-sensitive.(C) Corresponding cumulative fraction of weight-parameters that are PCM-noise-exempted.For ResNet-50 and ResNet-18, reducing PCM noise in just a few layers (dotted vertical lines) allows an AIMC crossbar-model to achieve iso-accuracy (dashed horizontal line).For DenseNet-121, more than half of the network parameters need to be exempted, while for WideResNet, even the non-PCM nonidealities would need to be improved.
which models a S-shaped saturation with variable slope scaled to approximately cover the full output range.Each of the d out outputs has an independent ADC and thus a slightly different (pre-determined) shape,ζ i = µ ζ (1 + σ ζ ξ) with ξ ∼ N (0, 1) and µ ζ = 14 .σ ζ is only varied in the sensitivity analysis ("ADC S-shaped nonlinearity"); for our standard AIMC crossbar-model, µ ζ and σ ζ are both set to 0, causing f NL i (z) = z.

Table 3 :
Test error in % ± standard error of mean (across 15-25 inference repeats per training trial and up to 3 training trails) for DNN deployment on AIMC crossbars after HWA training.