Neural-network training can be slow and energy intensive, owing to the need to transfer the weight data for the network between conventional digital memory chips and processor chips. Analogue non-volatile memory can accelerate the neural-network training algorithm known as backpropagation by performing parallelized multiply–accumulate operations in the analogue domain at the location of the weight data. However, the classification accuracies of such in situ training using non-volatile-memory hardware have generally been less than those of software-based training, owing to insufficient dynamic range and excessive weight-update asymmetry. Here we demonstrate mixed hardware–software neural-network implementations that involve up to 204,900 synapses and that combine long-term storage in phase-change memory, near-linear updates of volatile capacitors and weight-data transfer with ‘polarity inversion’ to cancel out inherent device-to-device variations. We achieve generalization accuracies (on previously unseen data) equivalent to those of software-based training on various commonly used machine-learning test datasets (MNIST, MNIST-backrand, CIFAR-10 and CIFAR-100). The computational energy efficiency of 28,065 billion operations per second per watt and throughput per area of 3.6 trillion operations per second per square millimetre that we calculate for our implementation exceed those of today’s graphical processing units by two orders of magnitude. This work provides a path towards hardware accelerators that are both fast and energy efficient, particularly on fully connected neural-network layers.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We acknowledge management support from B. Kurdi, C. Lam, W. Wilcke, S. Narayan, T. C. Chen, W. Haensch, R. Divakaruni, J. Welser and D. Gil, and discussions with P. Solomon, S. Kim, A. Sebastian, K. Hosokawa and S. C. Lewis. This work was performed as part of the ‘Neuromorphic Devices & Architectures’ project under the auspices of the IBM Research Frontiers Institute (https://www.research.ibm.com/frontiers). We acknowledge advice and support from H. Riel, S. Gowda, D. Maynard and the member companies of the IBM RFI.Reviewer information
Nature thanks G. C. Adam, R. Legenstein and the other anonymous reviewer(s) for their contribution to the peer review of this work.
Extended data figures and tables
Extended Data Fig. 1 Flow chart comparing eventual and currently implemented DNN acceleration approaches.
a, Comparison between an eventual analogue-memory-based hardware implementation and our mixed software–hardware experiment. Although we do not implement CMOS neurons, we mimic their behaviour closely. In both schemes, weight update is performed on only the 3T1C g devices, and these contributions are later transferred to the PCM devices (G+ and G−). Owing to wall-clock throughput issues in our experiment, we have to perform all of the weight transfers at once. By contrast, in an eventual hardware implementation, weight transfer would take place on a distributed, column-by-column basis. Ideally, transfer for any weight column would be performed at a point in time when the neural-network computation, focused on some other layer, leaves that particular array core temporarily idle. b, Guidelines for optimizing the choice of transfer interval, depending on the time constant of the capacitor and the dynamic range of g. Because training of one image is performed in 240 ns, training of 8,000 images is performed in 8,000 × 240 ns = 1.92 ms, which is a substantial fraction of the time-constant of the capacitor (5.16 ms). Despite allowing more of the dynamic range of g to be used, a longer transfer interval would probably suffer from poor retention of information in any volatile g device. However, even in the ideal case of an infinitely-long time constant, the transfer interval would still need to be limited, owing to the finite dynamic range of g. A long transfer interval would probably result in g values saturating owing to weight updates, leading to loss of training information before transfer. c, Guidelines for optimizing the choice of gain factor F. We define ‘efficacy of post-transfer tuning’ as the inverse of the overall residual error after g tuning. Bcause a larger gain factor F means more available dynamic range for each weight, larger F is desirable. However, large F also amplifies any programming errors on the PCM devices due to intrinsic device variability and limits the correction that g can provide during post-transfer tuning. The efficacy would definitely decrease monotonically, although perhaps not linearly as is sketched here. The value we chose (F = 3) represents a reasonable trade-off for the PCM and 3T1C devices used here. For other situations, F can be initially estimated as F = DR g /σ, where DR g is the g dynamic range and σ is the standard deviation of the PCM programming error. Additional optimization comes with neural-network training, which includes the weak effect of drift contribution.
Extended Data Fig. 2 Weight-update requests and resulting net weight change observed during neural network training.
a–d, Simulation results based on MNIST 20-epoch simulations for the 2PCM + 3T1C cell with full CMOS variability and transfer polarity inversion (matched with the experimental results; a, b) and for the 2PCM cell (c, d). a, c, Correlation between the aggregate weight update across 16,000 training images (for 2PCM + 3T1C, this corresponds to two consecutive transfer intervals) and the total number of pulses applied to obtain this weight update. b, d, Correlation between the aggregate number of pulses and the total number of programming pulses applied. The points chosen for Fig. 3 (±100, 1,000 for 2PCM + 3T1C and ±10, 50 for 2PCM) represent typical values requested by the backpropagation algorithm. Insets show vertical cross-sections at , where the aggregate sum of all individual weight changes ΔW is zero (sum of pulses is zero).
(Extension of Fig. 5.) a–f, Weight probability density functions (PDFs) and cumulative distribution functions (CDFs) of device conductances for MNIST-backrand (a, b), CIFAR-10 transfer learning (c, d) and CIFAR-100 transfer learning (e, f). Results are shown for the initial condition and increasing epochs, from 1 to 20. For the CIFAR-100 experiment only, we increased the transfer interval to 16,000 images to reduce the overall wall-clock time.
(Extension of Fig. 6.) a–d, Simulation results as in Fig. 6b, extended to all experiments performed: MNIST results (as in Fig. 6b; a), MNIST-backrand (b), CIFAR-10 transfer (c) and CIFAR-100 transfer (d). We introduce two parameters, xLR and δLR, to modify the crossbar-compatible weight-update scheme from its original conception10. The upstream neurons fire a number of weight-update pulses based on the x input signal, the global learning rate η and the xLR coefficient; downstream neurons fire pulses depending on the error signal, the global η and new δLR coefficient. xLR and δLR are both constant throughout training: xLR enables differentiation between upstream and downstream pulsing, but is constant across all layers; δLR enables careful tuning of the importance of δ for each weight layer. xLR modulation can provide substantial accuracy benefits for MNIST-backrand (b) and δLR modulation is beneficial for CIFAR-100 and particularly for MNIST (a, d). Although momentum and learning-rate (LR) decay are commonly used techniques33, their absence would not have greatly affected our experimental results. Example triage mostly provides a wall-clock advantage, but also a slight improvement in accuracy for CIFAR-10/100 transfer learning by avoiding ‘useless’ weight updates.
a, When the network classifies the output correctly (for example, the highest neuron output matches the highest ground truth), the safety margin is the positive difference between the correct neuron and the next-largest neuron. b, When the classification is incorrect, the safety margin is a negative number that indicates the gap by which the output neuron failed to be the highest neuron value. Preferably, we would like to calculate the safety margin for every image in each epoch, because safety margins change after each backpropagation. This is the choice made within our experiment; in a full-chip implementation of analogue-memory-based neural-network hardware accelerator with an effective minibatch size of 1, this would be fairly straightforward. Alternatively, either for minibatch-based training or for analogue hardware, we envision using a highly pipelined copy of the network designed for fast forward inference to compute safety margins using a recent copy of the network weights. These slightly ‘stale’ safety margins could then be used to implement example triage. c, Focus probability from 0% to 100% as a function of safety margin defined from −1 to 1. For all safety margins below some ‘acceptable’ threshold, the probability of choosing to perform backpropagation on this training example is 100%. As the safety margin increases above the acceptable threshold, the focus probability decreases linearly to a non-zero minimum focus probability, to ensure that some number of already well-learned images are also backpropagated despite their high safety margin. The mapping of safety margin to focus probability can be changed during training. In addition, reducing either the focus probability or the learning rate for examples with large negative safety margins (pink dotted line) avoids damage to overall generalization in pursuit of training examples that the network may never be able to successfully classify.
During training (shown here for MNIST), the cumulative distribution of the safety margin shifts to the right, as training improves performance on the training examples. The intercept at a safety margin of zero represents the training error. Example triage can be thought of as the realization that the network does not need to train on all of the examples in the far right of this cumulative distribution, but should instead focus on those at small positive safety margins and below, with only a few training examples chosen from among those at high safety margins. The farther the safety margin distribution moves to the right, the more of an acceleration factor that example triage can provide. Example triage can be considered a form of curriculum learning44 based on the safety margin, as a highly accurate analogue measure of the current degree of certainty of the neural network. However, a substantial difference is that curriculum learning focuses on the beginning of training, with the philosophy of starting with easy examples and moving to difficult training examples. By contrast, example triage becomes effective only once the network shows some degree of performance on the training set, and is then designed to skip over easy examples in favour of difficult training examples.
The measured cumulative distribution function of the conductances of 512 × 1,024 devices programmed from full reset state with eight-step set transition rampdown pulse sequences ranging from 1.7 ns to 550 ns in step-size (for example, from 13.6 ns to 4.4 μs in total duration) is shown. Even though the degree of control is worse for high conductances (above 20 μS), to the extent that the monotonicity of the mapping from duration to conductance is disrupted, the vast majority of conductances are programmed to conductances below 20 μS (see Fig. 4 and Extended Data Fig. 9).
Extended Data Fig. 8 Analysis of weight transfer from lower- to higher-significance conductance pairs.
a–c, Distributions obtained before and after the last transfer in the MNIST experiment: g and gshared distributions before transfer (a), the voltage on the capacitor of g (b) and the distribution of weights (c). gshared devices are implemented as an average of the read current from three 3T1C devices for every 128 dedicated g devices to help to reduce variability. Just before transfer, the voltages on both g and gshared are programmed to 0.5 V after their contribution to the weight has been extracted. d–f, Just after the PCM transfer, the polarity of g is inverted; the dedicated g devices are then tuned to correct the transfer error during PCM programming operation. This leads to a broad distribution of voltages on these capacitors, centred at lower voltages than just before transfer (e). During the long transfer interval, charge leakage in all capacitors (through both NFETs and the PFET) causes voltages to increase towards about 0.8 V. During post-transfer tuning, the lowest voltage available to the charge subtraction circuitry is increased so that no 3T1C device can be programmed below 0.25 V (cut-off visible in e). Because all 3T1C conductances below that capacitor voltage are effectively zero (see Extended Data Fig. 10a), if any device were allowed to return to the weight-update operations with such an extremely low capacitor voltage, the network would be forced to fire many positive weight updates before it could effectively change that weight. Although g and gshared show different shapes, the weight distribution is nearly the same as before transfer. The last transfer is shown not because it is the easiest but because it is the most important. The network has very little ability to recover from mistakes made during these last few transfers. However, data extracted for any of the other transfers throughout training would be almost indistinguishable from those shown here for the last transfer operation.
Correlation maps obtained from the last two transfers in the MNIST experiment illustrate a typical transfer operation. The target weight Wtransfer that we attempt to write into the PCM devices is not exactly the overall weight W, but instead Wtransfer = W − offset − [g(V = 0.5 V) − gshared(V = 0.5 V)]. The final two terms are the residual difference between the conductances of the g and gshared devices even when initialized to the same voltage, which allows the PCM devices to compensate partially for CMOS variability during transfer. The offset, equal to 2 μS, is added because g devices are not equally good at compensating positive and negative conductance errors. At the initialization voltage of 0.5 V, device conductance is relatively small (see Extended Data Fig. 10a), providing less dynamic range to move to smaller conductances and to correct PCM devices programmed to weights that are too positive. The initial 0.5 V was chosen carefully, to accommodate substantial ‘decay’ towards 0.8 V, providing much more dynamic range for increasing 3T1C conductance. A positive offset value strongly favours negative errors, allowing us to exploit the capability for g values to increase. When Wtransfer is positive but smaller than the offset we reset both PCM devices and use g to correct the residual error. a, Correlation between the weight portion encoded in PCMs before transfer, such as F(G+ − G−), with Wtransfer. Here we expect a difference because the neural-network training has changed the weights—we now need to checkpoint these weight changes from volatile storage on the 3T1C devices into non-volatile storage on the PCM devices. b, Correlation between the desired Wtransfer conductance differences and the actual F(G+ − G−) values obtained after PCM programming operation. With perfect devices and no offset, this should be a diagonal line along y = x. The variability we see is caused partly by PCM programming error (unintended), partly by the intentional offset and partly by CMOS initialization mismatch (where we are intentionally aiming for a ‘wrong’ PCM conductance difference to help to compensate for our flawed CMOS devices). c, Correlation between the weights before (Wpre) and after (Wpost) transfer, after post-transfer tuning of g to compensate for programming errors in b. The goal of the transfer operation is to obtain Wpost = Wpre, which would correspond to all points falling on the diagonal y = x. The effect of post-transfer tuning is clear by comparing the variability in b to the near-ideal behaviour in c. d–f, As in a–c, but for negative polarity transfer. Because the polarity of g is inverted, the offset is negative, and so the large dynamic range can be used to increase g to compensate for positive errors in PCM weight.
a–f, Monte Carlo circuit simulations of parameter variability in 3T1C cells: measured conductance versus instantaneous voltage on the capacitor VC (a); PDF of the measured conductance at VC = 0.5 V (b); change in voltage versus the instantaneous voltage for up pulses (c); PDF of change in up voltage at VC = 0.5 V (d); change in voltage versus the instantaneous voltage for down pulses (e); and PDF of change in down voltage at VC = 0.5 V (f). Each graph shows data from 1,000 trials. Bold lines in a, c and e and dotted lines in b, d and f show the nominal transistor response. a, b, Variability in the read transistor whose gate is tied to the capacitor; c–f, variability due to variation in threshold voltage in the PMOS pull-up/NMOS pull-down FETs.