Abstract
Realizing increasingly complex artificial intelligence (AI) functionalities directly on edge devices calls for unprecedented energy efficiency of edge hardware. Computeinmemory (CIM) based on resistive randomaccess memory (RRAM)^{1} promises to meet such demand by storing AI model weights in dense, analogue and nonvolatile RRAM devices, and by performing AI computation directly within RRAM, thus eliminating powerhungry data movement between separate compute and memory^{2,3,4,5}. Although recent studies have demonstrated inmemory matrixvector multiplication on fully integrated RRAMCIM hardware^{6,7,8,9,10,11,12,13,14,15,16,17}, it remains a goal for a RRAMCIM chip to simultaneously deliver high energy efficiency, versatility to support diverse models and softwarecomparable accuracy. Although efficiency, versatility and accuracy are all indispensable for broad adoption of the technology, the interrelated tradeoffs among them cannot be addressed by isolated improvements on any single abstraction level of the design. Here, by cooptimizing across all hierarchies of the design from algorithms and architecture to circuits and devices, we present NeuRRAM—a RRAMbased CIM chip that simultaneously delivers versatility in reconfiguring CIM cores for diverse model architectures, energy efficiency that is twotimes better than previous stateoftheart RRAMCIM chips across various computational bitprecisions, and inference accuracy comparable to software models quantized to fourbit weights across various AI tasks, including accuracy of 99.0 percent on MNIST^{18} and 85.7 percent on CIFAR10^{19} image classification, 84.7percent accuracy on Google speech command recognition^{20}, and a 70percent reduction in imagereconstruction error on a Bayesian imagerecovery task.
Similar content being viewed by others
Main
Early research in the area of resistive randomaccess memory (RRAM) computeinmemory (CIM) focused on demonstrating artificial intelligence (AI) functionalities on fabricated RRAM devices while using offchip software and hardware to implement essential functionalities such as analoguetodigital conversion and neuron activations for a complete system^{2,3,6,20,21,22,23,24,25,26,27}. Although these studies proposed various techniques to mitigate the impacts of analoguerelated hardware nonidealities on inference accuracy, the AI benchmark results reported were often obtained by performing software emulation based on characterized device data^{3,5,21,24}. Such an approach often overestimates accuracies compared with fully hardwaremeasured results owing to incomplete modelling of hardware nonidealities.
More recent studies have demonstrated fully integrated RRAM complementary metal–oxide–semiconductor (CMOS) chips capable of performing inmemory matrixvector multiplication (MVM)^{6,7,8,9,10,11,12,13,14,15,16,17}. However, for a RRAMCIM chip to be broadly adopted in practical AI applications, it needs to simultaneously deliver high energy efficiency, the flexibility to support diverse AI model architectures and softwarecomparable inference accuracy. So far, there has not been a study aimed at simultaneously improving all these three aspects of a design. Moreover, AI applicationlevel benchmarks in previous studies have limited diversity and complexity. None of the studies have experimentally measured multiple edge AI applications with complexity matching those in MLPerf Tiny, a commonly used benchmark suite for edge AI hardware^{28}. The challenge arises from the interrelated tradeoffs between efficiency, flexibility and accuracy. The highlyparallel analogue computation within RRAMCIM architecture brings superior efficiency, but makes it challenging to realize the same level of functional flexibility and computational accuracy as in digital circuits. Meanwhile, attaining algorithmic resiliency to hardware nonidealities becomes more difficult for more complex AI tasks owing to using less overparameterized models on the edge^{29,30}.
To address these challenges, we present NeuRRAM, a 48core RRAMCIM hardware encompassing innovations across the full stack of the design. (1) At the device level, 3 million RRAM devices with high analogue programmability are monolithically integrated with CMOS circuits. (2) At the circuit level, a voltagemode neuron circuit supports variable computation bitprecision and activation functions while performing analoguetodigital conversion at low power consumption and compactarea footprint. (3) At the architecture level, a bidirectional transposable neurosynaptic array (TNSA) architecture enables reconfigurability in dataflow directions with minimal area and energy overheads. (4) At the system level, 48 CIM cores can perform inference in parallel and supports various weightmapping strategies. (5) Finally, at the algorithm level, various hardwarealgorithm cooptimization techniques mitigate the impact of hardware nonidealities on inference accuracy. We report fully hardwaremeasured inference results for a range of AI tasks including image classifications using CIFAR10^{19} and MNIST^{18} datasets, Google speech command recognition^{20} and MNIST image recovery, implemented with diverse AI models including convolutional neural networks (CNNs)^{31}, long shortterm memory (LSTM)^{32} and probabilistic graphical models^{33} (Fig. 1e). The chip is measured to achieve an energydelay product (EDP) lower than previous stateoftheart RRAMCIM chips, while it operates over a range of configurations to suit various AI benchmark applications (Fig. 1d).
Reconfigurable RRAMCIM architecture
A NeuRRAM chip consists of 48 CIM cores that can perform computation in parallel. A core can be selectively turned off through power gating when not actively used, whereas the model weights are retained by the nonvolatile RRAM devices. Central to each core is a TNSA consisting of 256 × 256 RRAM cells and 256 CMOS neuron circuits that implement analoguetodigital converters (ADCs) and activation functions. Additional peripheral circuits along the edge provides inference control and manages RRAM programming.
The TNSA architecture is designed to offer flexible control of dataflow directions, which is crucial for enabling diverse model architectures with different dataflow patterns. For instance, in CNNs that are commonly applied to visionrelated tasks, data flows in a single direction through layers to generate data representations at different abstraction levels; in LSTMs that are used to process temporal data such as audio signals, data travel recurrently through the same layer for multiple time steps; in probabilistic graphical models such as a restricted Boltzmann machine (RBM), probabilistic sampling is performed back and forth between layers until the network converges to a highprobability state. Besides inference, the error backpropagation during gradientdescent training of multiple AI models requires reversing the direction of dataflow through the network.
However, conventional RRAMCIM architectures are limited to perform MVM in a single direction by hardwiring rows and columns of the RRAM crossbar array to dedicated circuits on the periphery to drive inputs and measure outputs. Some studies implement reconfigurable dataflow directions by adding extra hardware, which incurs substantial energy, latency and area penalties (Extended Data Fig. 2): executing bidirectional (forwards and backwards) dataflow requires either duplicating powerhungry and areahungry ADCs at both ends of the RRAM array^{11,34} or dedicating a large area to routing both rows and columns of the array to shared data converters^{15}; the recurrent connections require writing the outputs to a buffer memory outside of the RRAM array, and reading them back for the next timestep computation^{35}.
The TNSA architecture realizes dynamic dataflow reconfigurability with little overhead. Whereas in conventional designs, CMOS peripheral circuits such as ADCs connect at only one end of the RRAM array, the TNSA architecture physically interleaves the RRAM weights and the CMOS neuron circuits, and connects them along the length of both rows and columns. As shown in Fig. 2e, a TNSA consists of 16 × 16 of such interleaved corelets that are connected by shared bitlines (BLs) and wordlines (WLs) along the horizontal direction and sourcelines (SLs) along the vertical direction. Each corelet encloses 16 × 16 RRAM devices and one neuron circuit. The neuron connects to 1 BL and 1 SL out of the 16 BLs and the 16 SLs that pass through the corelet, and is responsible for integrating inputs from all the 256 RRAMs connecting to the same BL or SL. Sixteen of these RRAMs are within the same corelet as the neuron; and the other 240 are within the other 15 corelets along the same row or column. Specifically, Fig. 2f shows that the neuron within corelet (i, j) connects to the (16i + j)th BL and the (16j + i)th SL. Such a configuration ensures that each BL or SL connects uniquely to a neuron, while doing so without duplicating neurons at both ends of the array, thus saving area and energy.
Moreover, a neuron uses its BL and SL switches for both its input and output: it not only receives the analogue MVM output coming from BL or SL through the switches but also sends the converted digital results to peripheral registers through the same switches. By configuring which switch to use during the input and output stages of the neuron, we can realize various MVM dataflow directions. Figure 2g shows the forwards, backwards and recurrent MVMs enabled by the TNSA. To implement forwards MVM (BL to SL), during the input stage, input pulses are applied to the BLs through the BL drivers, get weighted by the RRAMs and enter the neuron through its SL switch; during the output stage, the neuron sends the converted digital outputs to SL registers through its SL switch; to implement recurrent MVM (BL to BL), the neuron instead receives input through its SL switch and sends the digital output back to the BL registers through its BL switch.
Weights of most AI models take both positive and negative values. We encode each weight as difference of conductance between two RRAM cells on adjacent rows along the same column (Fig. 2h). The forwards MVM is performed using a differential input scheme, where BL drivers send input voltage pulses with opposite polarities to adjacent BLs. The backwards MVM is performed using a differential output scheme, where we digitally subtract outputs from neurons connecting to adjacent BLs after neurons finish analoguetodigital conversions.
To maximize throughput of AI inference on 48 CIM cores, we implement a broad selection of weightmapping strategies that allow us to exploit both model parallelism and data parallelism (Fig. 2a) through multicore parallel MVMs. Using a CNN as an example, to maximize data parallelism, we duplicate the weights of the most computationally intensive layers (early convolutional layers) to multiple cores for parallel inference on multiple data; to maximize model parallelism, we map different convolutional layers to different cores and perform parallel inference in a pipelined fashion. Meanwhile, we divide the layers whose weight dimensions exceed the RRAM array size into multiple segments and assign them to multiple cores for parallel execution. A more detailed description of the weightmapping strategies is provided in Methods. The intermediate data buffers and partialsum accumulators are implemented by a fieldprogrammable gate array (FPGA) integrated on the same board as the NeuRRAM chip. Although these digital peripheral modules are not the focus of this study, they will eventually need to be integrated within the same chip in productionready RRAMCIM hardware.
Efficient voltagemode neuron circuit
Figure 1d and Extended Data Table 1 show that the NeuRRAM chip achieves 1.6times to 2.3times lower EDP and 7times to 13times higher computational density (measured by throughput per million of RRAMs) at various MVM input and output bitprecisions than previous stateoftheart RRAMbased CIM chips, despite being fabricated at an older technology node^{17,18,19,20,21,22,23,24,25,26,27,36}. The reported energy and delay are measured for performing an MVM with a 256 × 256 weight matrix. It is noted that these numbers and those reported in previous RRAMCIM work represent the peak energy efficiency achieved when the array utilization is 100% and does not account for energy spent on intermediate data transfer. Networkonchip and program scheduling need to be carefully designed to achieve good endtoend applicationlevel energy efficiency^{37,38}.
Key to the NeuRRAM’s EDP improvement is a novel inmemory MVM outputsensing scheme. The conventional approach is to use voltage as input, and measure the current as the results based on Ohm’s law (Fig. 3a). Such a currentmodesensing scheme cannot fully exploit the highparallelism nature of CIM. First, simultaneously turning on multiple rows leads to a large array current. Sinking the large current requires peripheral circuits to use large transistors, whose area needs to be amortized by timemultiplexing between multiple columns, which limits ‘column parallelism’. Second, MVM results produced by different neuralnetwork layers have drastically different dynamic ranges (Fig. 3c). Optimizing ADCs across such a wide dynamic range is difficult. To equalize the dynamic range, designs typically activate a fraction of input wires every cycle to compute a partial sum, and thus require multiple cycles to complete an MVM, which limits ‘row parallelism’.
NeuRRAM improves computation parallelism and energy efficiency by virtue of a neuron circuit implementing a voltagemode sensing scheme. The neuron performs analoguetodigital conversion of the MVM outputs by directly sensing the settled opencircuit voltage on the BL or SL line capacitance^{39} (Fig. 3b): voltage inputs are driven on the BLs whereas the SLs are kept floating, or vice versa, depending on the MVM direction. WLs are activated to start the MVM operation. The voltage on the output line settles to the weighted average of the voltages driven on the input lines, where the weights are the RRAM conductances. Upon deactivating the WLs, the output is sampled by transferring the charge on the output line to the neuron sampling capacitor (C_{sample} in Fig. 3d). The neuron then accumulates this charge onto an integration capacitor (C_{integ}) for subsequent analoguetodigital conversion.
Such voltagemode sensing obviates the need for powerhungry and areahungry peripheral circuits to sink large current while clamping voltage, improving energy and area efficiency and eliminating output timemultiplexing. Meanwhile, the weight normalization owing to the conductance weighting in the voltage output (Fig. 3c) results in an automatic output dynamic range normalization for different weight matrices. Therefore, MVMs with different weight dimensions can all be completed within a single cycle, which significantly improves computational throughput. To eliminate the normalization factor from the final results, we precompute its value and multiply it back to the digital outputs from the ADC.
Our voltagemode neuron supports MVM with 1bit to 8bit inputs and 1bit to 10bit outputs. The multibit input is realized in a bitserial fashion where charge is sampled and integrated onto C_{integ} for 2^{n−1} cycles for the nth least significant bit (LSB) (Fig. 3e). For MVM inputs greater than 4 bits, we break the bit sequence into two segments, compute MVM for each segment separately and digitally perform a shiftandadd to obtain the final results (Fig. 3f). Such a twophase input scheme improves energy efficiency and overcomes voltage headroom clipping at highinput precisions.
The multibit output is generated through a binary search process (Fig. 3g). Every cycle, neurons add or subtract C_{sample}V_{decr} amount of charge from C_{integ}, where V_{decr} is a bias voltage shared by all neurons. Neurons then compare the total charge on C_{integ} with a fixed threshold voltage V_{ref} to generate a 1bit output. From the most significant bit (MSB) to the least significant bit (LSB), V_{decr} is halved every cycle. Compared with other ADC architectures that implement a binary search, our ADC scheme eliminates the residue amplifier of an algorithmic ADC, and does not require an individual DAC for each ADC to generate reference voltages like a successive approximation register (SAR) ADC^{40}. Instead, our ADC scheme allows sharing a single digitaltoanalogue converter (DAC) across all neurons to amortize the DAC area, leading to a more compact design. The multibit MVM is validated by comparing ideal and measured results, as shown in Fig. 3h and Extended Data Fig. 5. More details on the multibit input and output implementation can be found in Methods.
The neuron can also be reconfigured to directly implement Rectified Linear Unit (ReLU)/sigmoid/tanh as activations when needed. In addition, it supports probabilistic sampling for stochastic activation functions by injecting pseudorandom noise generated by a linearfeedback shift register (LFSR) block into the neuron integrator. All the neuron circuit operations are performed by dynamically configuring a single amplifier in the neuron as either an integrator or a comparator during different phases of operations, as detailed in Methods. This results in a more compact design than other work that merges ADC and neuron activation functions within the same module^{12,13}. Although most existing CIM designs use timemultiplexed ADCs for multiple rows and columns to amortize the ADC area, the compactness of our neuron circuit allows us to dedicate a neuron for each pair of BL and SL, and tightly interleave the neuron with RRAM devices within the TNSA architecture, as can be seen in Extended Data Fig. 11d.
Hardwarealgorithm cooptimizations
The innovations on the chip architecture and circuit design bring superior efficiency and reconfigurability to NeuRRAM. To complete the story, we must ensure that AI inference accuracy can be preserved under various circuit and device nonidealities^{3,41}. We developed a set of hardwarealgorithm cooptimization techniques that allow NeuRRAM to deliver softwarecomparable accuracy across diverse AI applications. Importantly, all the AI benchmark results presented in this paper are obtained entirely from hardware measurements on complete datasets. Although most previous efforts (with a few exceptions^{8,17}) have reported benchmark results using a mixture of hardware characterization and software simulation, for example, emulate the arraylevel MVM process in software using measured device characteristics^{3,5,21,24}, such an approach often fails to model the complete set of nonidealities existing in realistic hardware. As shown in Fig. 4a, these nonidealities may include (1) Voltage drop on input wires (R_{wire}), (2) on RRAM array drivers (R_{driver}) and (3) on crossbar wires (e.g. BL resistance R_{BL}), (4) limited RRAM programming resolution, (5) RRAM conductance relaxation^{41}, (6) capacitive coupling from simultaneously switching array wires, and (7) limited ADC resolution and dynamic range. Our experiments show that omitting certain nonidealities in simulation leads to overoptimistic prediction of inference accuracy. For example, the third and the fourth bars in Fig. 5a show a 2.32% accuracy difference between simulation and measurement for CIFAR10 classification^{19}, whereas the simulation accounts for only nonidealities (5) and (7), which are what previous studies most often modelled^{5,21}.
Our hardwarealgorithm cooptimization approach includes three main techniques: (1) modeldriven chip calibration, (2) noiseresilient neuralnetwork training and analogue weight programming, and (3) chipintheloop progressive model finetuning. Modeldriven chip calibration uses the real model weights and input data to optimize chip operating conditions such as input voltage pulse amplitude, and records any ADC offsets for subsequent cancellation during inference. Ideally, the MVM output voltage dynamic range should fully utilize the ADC input swing to minimize discretization error. However, without calibration, the MVM output dynamic range varies with network layers even with the weight normalization effect of the voltagemode sensing. To calibrate MVM to the optimal dynamic range, for each network layer, we use a subset of trainingset data as calibration input to search for the best operating conditions (Fig. 4b). Extended Data Fig. 6 shows that different calibration input distributions lead to different output distributions. To ensure that the calibration data can closely emulate the distribution seen at test time, it is therefore crucial to use trainingset data as opposed to randomly generated data during calibration. It is noted that when performing MVM on multiple cores in parallel, those shared bias voltages cannot be optimized for each core separately, which might lead to suboptimal operating conditions and additional accuracy loss (detailed in Methods).
Stochastic nonidealities such as RRAM conductance relaxation and read noises degrade the signaltonoise ratio (SNR) of the computation, leading to an inference accuracy drop. Some previous work obtained a higher SNR by limiting each RRAM cell to store a single bit, and encoding higherprecision weights using multiple cells^{9,10,16}. Such an approach lowers the weight memory density. Accompanying that approach, the neural network is trained with weights quantized to the corresponding precision. In contrast, we utilize the intrinsic analogue programmability of RRAM^{42} to directly store highprecision weights and train the neural networks to tolerate the lower SNR. Instead of training with quantized weights, which is equivalent to injecting uniform noise into weights, we train the model with highprecision weights while injecting noise with the distribution measured from RRAM devices. RRAMs on NeuRRAM are characterized to have a Gaussiandistributed conductance spread, caused primarily by conductance relaxation. Therefore, we inject a Gaussian noise into weights during training, similar to a previous study^{21}. Figure 5a shows that the technique significantly improves the model’s immunity to noise, from a CIFAR10 classification accuracy of 25.34% without noise injection to 85.99% with noise injection. After the training, we program the nonquantized weights to RRAM analogue conductances using an iterative write–verify technique, described in Methods. This technique enables NeuRRAM to achieve an inference accuracy equivalent to models trained with 4bit weights across various applications, while encoding each weight using only two RRAM cells, which is twotimes denser than previous studies that require one RRAM cell per bit.
By applying the above two techniques, we already can measure inference accuracy comparable to or better than software models with 4bit weights on Google speech command recognition, MNIST image recovery and MNIST classification (Fig. 1e). For deeper neural networks, we found that the error caused by those nonidealities that have nonlinear effects on MVM outputs, such as voltage drops, can accumulate through layers, and become more difficult to mitigate. In addition, multicore parallel MVM leads to large instantaneous current, further exacerbating nonidealities such as voltage drop on input wires ((1) in Fig. 4a). As a result, when performing multicore parallel inference on a deep CNN, ResNet20^{43}, the measured accuracy on CIFAR10 classification (83.67%) is still 3.36% lower than that of a 4bitweight software model (87.03%).
To bridge this accuracy gap, we introduce a chipintheloop progressive finetuning technique. Chipintheloop training mitigates the impact of nonidealities by measuring training error directly on the chip^{44}. Previous work has shown that finetuning the final layers using the backpropagated gradients calculated from hardwaremeasured outputs helped improve accuracy^{5}. We find this technique to be of limited effectiveness in countering those nonlinear nonidealities. Such a technique also requires reprogramming RRAM devices, which consumes additional energy. Our chipintheloop progressive finetuning overcomes nonlinear model errors by exploiting the intrinsic nonlinear universal approximation capacity of the deep neural network^{45}, and furthermore eliminates the need for weight reprogramming. Figure 4d illustrates the finetuning procedure. We progressively program the weights one layer at a time onto the chip. After programming a layer, we perform inference using the trainingset data on the chip up to that layer, and use the measured outputs to finetune the remaining layers that are still training in software. In the next time step, we program and measure the next layer on the chip. We repeat this process until all the layers are programmed. During the process, the nonidealities of the programmed layers can be progressively compensated by the remaining layers through training. Figure 5b shows the efficacy of this progressive finetuning technique. From left to right, each data point represents a new layer programmed onto the chip. The accuracy at each layer is evaluated by using the chipmeasured outputs from that layer as inputs to the remaining layers in software. The cumulative CIFAR10 testset inference accuracy is improved by 1.99% using this technique. Extended Data Fig. 8a further illustrates the extent to which finetuning recovers the trainingset accuracy loss at each layer, demonstrating the effectiveness of the approach in bridging the accuracy gap between software and hardware measurements.
Using the techniques described above, we achieve inference accuracy comparable to software models trained with 4bit weights across all the measured AI benchmark tasks. Figure 1e shows that we achieve a 0.98% error rate on MNIST handwritten digit recognition using a 7layer CNN, a 14.34% error rate on CIFAR10 object classification using ResNet20, a 15.34% error rate on Google speech command recognition using a 4cell LSTM, and a 70% reduction of L2 imagereconstruction error compared with the original noisy images on MNIST image recovery using an RBM. Some of these numbers are not yet to the accuracies achieved by fullprecision digital implementations. The accuracy gap mainly comes from lowprecision (≤4bit) quantization of inputs and activations, especially on the most sensitive input and output layers^{46}. For instance, Extended Data Fig. 8b presents an ablation study that shows that quantizing input images to 4bit alone results in a 2.7% accuracy drop for CIFAR10 classification. By contrast, the input layer only accounts for 1.08% of compute and 0.16% of weights of a ResNet20 model. Therefore, they can be offloaded to higherprecision digital compute units with little overheads. In addition, applying more advanced quantization techniques and optimizing training procedures such as data augmentation and regularization should further improve the accuracy for both quantized software models and hardwaremeasured results.
Table 1 summarizes the key features of each demonstrated model. Most of the essential neuralnetwork layers and operations are implemented on the chip, including all the convolutional, fully connected and recurrent layers, neuron activation functions, batch normalization and the stochastic sampling process. Other operations such as average pooling and elementwise multiplications are implemented on an FPGA integrated on the same board as NeuRRAM (Extended Data Fig. 11a). Each of the models is implemented by allocating the weights to multiple cores on a single NeuRRAM chip. We developed a software toolchain to allow easy deployment of AI models on the chip^{47}. The implementation details are described in Methods. Fundamentally, each of the selected benchmarks represents a general class of common edge AI tasks: visual recognition, speech processing and image denoising. These results demonstrate the versatility of the TNSA architecture and the wide applicability of the hardwarealgorithm cooptimization techniques.
The NeuRRAM chip simultaneously improves efficiency, flexibility and accuracy over existing RRAMCIM hardware by innovating across the entire hierarchy of the design, from a TNSA architecture enabling reconfigurable dataflow direction, to an energy and areaefficient voltagemode neuron circuit, and to a series of algorithmhardware cooptimization techniques. These techniques can be more generally applied to other nonvolatile resistive memory technologies such as phasechange memory^{8,17,21,23,24}, magnetoresistive RAM^{48} and ferroelectric fieldeffect transistors^{49}. Going forwards, we expect NeuRRAM’s peak energy efficiency (EDP) to improve by another two to three orders of magnitude while supporting bigger AI models when scaling from 130nm to 7nm CMOS and RRAM technologies (detailed in Methods). Multicore architecture design with networkonchip that realizes efficient and versatile data transfers and interarray pipelining is likely to be the next major challenge for RRAMCIM^{37,38}, which needs to be addressed by further crosslayer cooptimization. As resistive memory continues to scale towards offering terabits of onchip memory^{50}, such a cooptimization approach will equip CIM hardware on the edge with sufficient performance, efficiency and versatility to perform complex AI tasks that can only be done on the cloud today.
Methods
Core block diagram and operating modes
Figure 2d and Extended Data Fig. 1 show the block diagram of a single CIM core. To support versatile MVM directions, most of the design is symmetrical in the row (BLs and WLs) and column (SLs) directions. The row and column register files store the inputs and outputs of MVMs, and can be written externally by either an Serial Peripheral Interface (SPI) or a randomaccess interface that uses an 8bit address decoder to select one register entry, or internally by the neurons. The SL peripheral circuits contain an LFSR block used to generate pseudorandom sequences used for probabilistic sampling. It is implemented by two LFSR chains propagating in opposite directions. The registers of the two chains are XORed to generate spatially uncorrelated random numbers^{51}. The controller block receives commands and generates control waveforms to the BL/WL/SL peripheral logic and to the neurons. It contains a delaylinebased pulse generator with tunable pulse width from 1 ns to 10 ns. It also implements clockgating and powergating logic used to turn off the core in idle mode. Each WL, BL and SL of the TNSA is driven by a driver consisting of multiple pass gates that supply different voltages. On the basis of the values stored in the register files and the control signals issued by the controller, the WL/BL/SL logic decides the state of each pass gate.
The core has three main operating modes: a weightprogramming mode, a neurontesting mode and an MVM mode (Extended Data Fig. 1). In the weightprogramming mode, individual RRAM cells are selected for read and write. To select a single cell, the registers at the corresponding row and column are programmed to ‘1’ through random access with the help of the row and column decoder, whereas the other registers are reset to ‘0’. The WL/BL/SL logic turns on the corresponding driver pass gates to apply a set/reset/read voltage on the selected cell. In the neurontesting mode, the WLs are kept at ground voltage (GND). Neurons receive inputs directly from BL or SL drivers through their BL or SL switch, bypassing RRAM devices. This allows us to characterize the neurons independently from the RRAM array. In the MVM mode, each input BL and SL is driven to V_{ref} − V_{read}, V_{ref} + V_{read} or V_{ref} depending on the registers’ value at that row or column. If the MVM is in the BLtoSL direction, we activate the WLs that are within the input vector length while keeping the rest at GND; if the MVM is in the SLtoBL direction, we activate all the WLs. After neurons finish analoguetodigital conversion, the pass gates from BLs and SLs to the registers are turned on to allow neuronstate readout.
Device fabrication
RRAM arrays in NeuRRAM are in a onetransistor–oneresistor (1T1R) configuration, where each RRAM device is stacked on top of and connects in series with a selector NMOS transistor that cuts off the sneak path and provides current compliance during RRAM programming and reading. The selector ntype metaloxidesemiconductor (NMOS), CMOS peripheral circuits and the bottom four backendofline interconnect metal layers are fabricated in a standard 130nm foundry process. Owing to the higher voltage required for RRAM forming and programming, the selector NMOS and the peripheral circuits that directly interface with RRAM arrays use thickoxide input/output (I/O) transistors rated for 5V operation. All the other CMOS circuits in neurons, digital logic, registers and so on use core transistors rated for 1.8V operations.
The RRAM device is sandwiched between metal4 and metal5 layers shown in Fig. 2c. After the foundry completes the fabrication of CMOS and the bottom four metal layers, we use a laboratory process to finish the fabrication of the RRAM devices and the metal5 interconnect, and the top metal pad and passivation layers. The RRAM device stack consists of a titanium nitride (TiN) bottomelectrode layer, a hafnium oxide (HfO_{x}) switching layer, a tantalum oxide (TaO_{x}) thermalenhancement layer^{52} and a TiN topelectrode layer. They are deposited sequentially, followed by a lithography step to pattern the lateral structure of the device array.
RRAM write–verify programming and conductance relaxation
Each neuralnetwork weight is encoded by the differential conductance between two RRAM cells on adjacent rows along the same column. The first RRAM cell encodes positive weight, and is programmed to a low conductance state (g_{min}) if the weight is negative; the second cell encodes negative weight, and is programmed to g_{min} if the weight is positive. Mathematically, the conductances of the two cells are \({\rm{\max }}({g}_{{\rm{\max }}}\frac{W}{{w}_{{\rm{\max }}}},{g}_{{\rm{\min }}})\) and \({\rm{\max }}({g}_{{\rm{\max }}}\frac{W}{{w}_{{\rm{\max }}}},{g}_{{\rm{\min }}})\) respectively, where g_{max} and g_{min} are the maximum and minimum conductance of the RRAMs, w_{max} is the maximum absolute value of weights, and W is the unquantized highprecision weight.
To program an RRAM cell to its target conductance, we use an incrementalpulse write–verify technique^{42}. Extended Data Fig. 3a,b illustrates the procedure. We start by measuring the initial conductance of the cell. If the value is below the target conductance, we apply a weak set pulse aiming to slightly increase the cell conductance. Then we read the cell again. If the value is still below the target, we apply another set pulse with amplitude incremented by a small amount. We repeat such set–read cycles until the cell conductance is within an acceptance range to the target value or overshoots to the other side of the target. In the latter case, we reverse the pulse polarity to reset, and repeat the same procedure as with set. During the set/reset pulse train, the cell conductance is likely to bounce up and down multiple times until eventually it enters the acceptance range or reaches a timeout limit.
There are a few tradeoffs in selecting programming conditions. (1) A smaller acceptance range and a higher timeout limit improve programming precision, but require a longer time. (2) A higher g_{max} improves the SNR during inference, but leads to higher energy consumption and more programming failures for cells that cannot reach high conductance. In our experiments, we set the initial set pulse voltage to be 1.2 V and the reset pulse voltage to be 1.5 V, both with an increment of 0.1 V and pulse width of 1 μs. A RRAM read takes 1–10 μs, depending on its conductance. The acceptance range is ±1 μS to the target conductance. The timeout limit is 30 set–reset polarity reversals. We used g_{min} = 1 μS for all the models, and g_{max} = 40 μS for CNNs and g_{max} = 30 μS for LSTMs and RBMs. With such settings, 99% of the RRAM cells can be programmed to the acceptance range within the timeout limit. On average each cell requires 8.52 set/reset pulses. In the current implementation, the speed of such a write–verify process is limited by external control of DAC and ADC. If integrating everything into a single chip, such write–verify will take on average 56 µs per cell. Having multiple copies of DAC and ADC to perform write–verify on multiple cells in parallel will further improve RRAM programming throughput, at the cost of more chip area.
Besides the longer programming time, another reason to not use an overly small write–verify acceptance range is RRAM conductance relaxation. RRAM conductance changes over time after programming. Most of the change happens within a short time window (less than 1 s) immediately following the programming, after which the change becomes much slower, as shown in Extended Data Fig. 3d. The abrupt initial change is called ‘conductance relaxation’ in the literature^{41}. Its statistics follow a Gaussian distribution at all conductance states except when the conductance is close to g_{min}. Extended Data Fig. 3c,d shows the conductance relaxation measured across the whole g_{min}tog_{max} conductance range. We found that the loss of programming precision owing to conductance relaxation is much higher than that caused by the write–verify acceptance range. The average standard deviation across all levels of initial conductance is about 2.8 μS. The maximum standard deviation is about 4 μS, which is close to 10% of g_{max}.
To mitigate the relaxation, we use an iterative programming technique. We iterate over the RRAM array for multiple times. In each iteration, we measure all the cells and reprogram those whose conductance has drifted outside the acceptance range. Extended Data Fig. 3e shows that the standard deviation becomes smaller with more programming iterations. After 3 iterations, the standard deviation becomes about 2 μS, a 29% decrease compared with the initial value. We use 3 iterations in all our neuralnetwork demonstrations and perform inference at least 30 min after the programming such that the measured inference accuracy would account for such conductance relaxation effects. By combining the iterative programming with our hardwareaware model training approach, the impact of relaxation can be largely mitigated.
Implementation of MVM with multibit inputs and outputs
The neuron and the peripheral circuits support MVM at configurable input and output bitprecisions. An MVM operation consists of an initialization phase, an input phase and an output phase. Extended Data Fig. 4 illustrates the neuron circuit operation. During the initialization phase (Extended Data Fig. 4a), all BLs and SLs are precharged to V_{ref}. The sampling capacitors C_{sample} of the neurons are also precharged to V_{ref}, whereas the integration capacitors C_{integ} are discharged.
During the input phase, each input wire (either BL or SL depending on MVM direction) is driven to one of three voltage levels, V_{ref} − V_{read}, V_{ref} and V_{ref} + V_{read}, through three pass gates, as shown in Fig. 3b. During forwards MVM, under differentialrow weight mapping, each input is applied to a pair of adjacent BLs. The two BLs are driven to the opposite voltage with respect to V_{ref}. That is, when the input is 0, both wires are driven to V_{ref}; when the input is +1, the two wires are driven to V_{ref} + V_{read} and V_{ref} − V_{read}; and when the input is −1, to V_{ref} − V_{read} and V_{ref} + V_{read}. During backwards MVM, each input is applied to a single SL. The difference operation is performed digitally after neurons finish analoguetodigital conversions.
After biasing the input wires, we then pulse those WLs that have inputs for 10 ns, while keeping output wires floating. As voltages of the output wires settle to \({V}_{j}=\frac{{\sum }_{i}{V}_{i}{G}_{{ij}}}{{\sum }_{i}{G}_{{ij}}}\), where G_{ij} represents conductance of RRAM at the ith row and the jth column, we turn off the WLs to stop all current flow. We then sample the charge remaining on the output wire parasitic capacitance to C_{sample} located within neurons, followed by integrating the charge onto C_{integ}, as shown in Extended Data Fig. 4b. The sampling pulse is 10 ns (limited by the 100MHz external clock from the FPGA); the integration pulse is 240 ns, limited by large integration capacitor (104 fF), which was chosen conservatively to ensure function correctness and testing different neuron operating conditions.
The multibit input digitaltoanalogue conversion is performed in a bitserial fashion. For the nth LSB, we apply a single pulse to the input wires, followed by sampling and integrating charge from output wires onto C_{integ} for 2^{n−1} cycles. At the end of multibit input phase, the complete analogue MVM output is stored as charge on C_{integ}. For example, as shown in Fig. 3e, when the input vectors are 4bit signed integers with 1 signbit and 3 magnitudebits, we first send pulses corresponding to the first (least significant) magnitudebit to input wires, followed by sampling and integrating for one cycle. For the second and the third magnitudebits, we again apply one pulse to input wires for each bit, followed by sampling and integrating for two cycles and four cycles, respectively. In general, for nbit signed integer inputs, we need a total of n − 1 input pulses and 2^{n−1} − 1 sampling and integration cycles.
Such a multibit input scheme becomes inefficient for highinput bitprecision owing to the exponentially increasing sampling and integration cycles. Moreover, headroom clipping becomes an issue as charge integrated at C_{integ} saturates with more integration cycles. The headroom clipping can be overcome by using lower V_{read}, but at the cost of a lower SNR, so the overall MVM accuracy might not improve when using higherprecision inputs. For instance, Extended Data Fig. 5a,c shows the measured rootmeansquare error (r.m.s.e.) of the MVM results. Quantizing inputs to 6bit (r.m.s.e. = 0.581) does not improve the MVM accuracy compared with 4bit (r.m.s.e. = 0.582), owing to the lower SNR.
To solve both the issues, we use a 2phase input scheme for input greater than 4bits. Figure 3f illustrates the process. To perform MVM with 6bit inputs and 8bit outputs, we divide inputs into two segments, the first containing the three MSBs and the second containing the three LSBs. We then perform MVM including the output analoguetodigital conversion for each segment separately. For the MSBs, neurons (ADCs) are configured to output 8bits; for the LSBs, neurons output 5bits. The final results are obtained by shifting and adding the two outputs in digital domain. Extended Data Fig. 5d shows that the scheme lowers MVM r.m.s.e. from 0.581 to 0.519. Extended Data Fig. 12c–e further shows that such a twophase scheme both extends the input bitprecision range and improves the energy efficiency.
Finally, during the output phase, the analoguetodigital conversion is again performed in a bitserial fashion through a binary search process. First, to generate the signbit of outputs, we disconnect the feedback loop of the amplifier to turn the integrator into a comparator (Extended Data Fig. 4c). We drive the right side of C_{integ} to V_{ref}. If the integrated charge is positive, the comparator output will be GND, and supply voltage VDD otherwise. The comparator output is then inverted, latched and readout to the BL or SL via the neuron BL or SL switch before being written into the peripheral BL or SL registers.
To generate k magnitudebits, we add or subtract charge from C_{integ} (Extended Data Fig. 4d), followed by comparison and readout for k cycles. From MSB to LSB, the amount of charge added or subtracted is halved every cycle. Whether to add or to subtract is automatically determined by the comparison result stored in the latch from the previous cycle. Figure 3g illustrates such a process. A signbit of ‘1’ is first generated and latched in the first cycle, representing a positive output. To generate the most significant magnitudebit, the latch turns on the path from V_{decr−} = V_{ref} − V_{decr} to C_{sample}. The charge sampled by C_{sample} is then integrated on C_{integ} by turning on the negative feedback loop of the amplifier, resulting in C_{sample}V_{decr} amount of charge being subtracted from C_{integ}. In this example, C_{sample}V_{decr} is greater than the original amount of charge on C_{integ}, so the total charge becomes negative, and the comparator generates a ‘0’ output. To generate the second magnitudebit, V_{decr} is reduced by half. This time, the latch turns on the path from V_{decr+} = V_{ref} + 1/2V_{decr} to C_{sample}. As the total charge on C_{integ} after integration is still negative, the comparator outputs a ‘0’ again in this cycle. We repeat this process until the least significant magnitudebit is generated. It is noted that if the initial signbit is ‘0’, all subsequent magnitudebits are inverted before readout.
Such an output conversion scheme is similar to an algorithmic ADC or a SAR ADC in the sense that a binary search is performed for n cycles for a nbit output. The difference is that an algorithmic ADC uses a residue amplifier, and a SAR ADC requires a multibit DAC for each ADC, whereas our scheme does not need a residue amplifier, and uses a single DAC that outputs 2 × (n − 1) different V_{decr+} and V_{decr−} levels, shared by all neurons (ADCs). As a result, our scheme enables a more compact design by timemultiplexing an amplifier for integration and comparison, eliminating the residual amplifier, and amortizing the DAC area across all neurons in a CIM core. For CIM designs that use a dense memory array, such a compact design allows each ADC to be timemultiplexed by a fewer number of rows and columns, thus improving throughput.
To summarize, both the configurable MVM input and output bitprecisions and various neuron activation functions are implemented using different combinations of the four basic operations: sampling, integration, comparison and charge decrement. Importantly, all the four operations are realized by a single amplifier configured in different feedback modes. As a result, the design realizes versatility and compactness at the same time.
Multicore parallel MVM
NeuRRAM supports performing MVMs in parallel on multiple CIM cores. Multicore MVM brings additional challenges to computational accuracy, because certain hardware nonidealities that do not manifest in singlecore MVM become more severe with more cores. They include voltage drop on input wires, coretocore variation and supply voltage instability. voltage drop on input wires (nonideality (1) in Fig. 4a) is caused by large current drawn from a shared voltage source simultaneously by multiple cores. It makes equivalent weights stored in each core vary with applied inputs, and therefore have a nonlinear inputdependent effect on MVM outputs. Moreover, as different cores have a different distance from the shared voltage source, they experience a different amounts of voltage drops. Therefore, we cannot optimize readvoltage amplitude separately for each core to make its MVM output occupy exactly the full neuron input dynamic range.
These nonidealities together degrade the multicore MVM accuracy. Extended Data Fig. 5e,f shows that when performing convolution in parallel on the 3 cores, outputs of convolutional layer 15 are measured to have a higher r.m.s.e. of 0.383 compared with 0.318 obtained by performing convolution sequentially on the 3 cores. In our ResNet20 experiment, we performed 2core parallel MVMs for convolutions within block 1 (Extended Data Fig. 9a), and 3core parallel MVMs for convolutions within blocks 2 and 3.
The voltagedrop issue can be partially alleviated by making the wires that carry large instantaneous current as low resistance as possible, and by employing a power delivery network with more optimized topology. But the issue will persist and become worse as more cores are used. Therefore, our experiments aim to study the efficacy of algorithmhardware cooptimization techniques in mitigating the issue. Also, it is noted that for a fullchip implementation, additional modules such as intermediate result buffers, partialsum accumulators and networkonchip will need to be integrated to manage intercore data transfers. Program scheduling should also be carefully optimized to minimize buffer size and energy spent at intermediate data movement. Although there are studies on such fullchip architecture and scheduling^{37,38,53}, they are outside the scope of this study.
Noiseresilient neuralnetwork training
During noiseresilient neuralnetwork training, we inject noise into weights of all fully connected and convolutional layers during the forwards pass of neuralnetwork training to emulate the effects of RRAM conductance relaxation and read noises. The distribution of the injected noise is obtained by RRAM characterization. We used the iterative write–verify technique to program RRAM cells into different initial conductance states and measure their conductance relaxation after 30 min. Extended Data Fig. 3d shows that measured conductance relaxation has an absolute value of mean <1 μS (g_{min}) at all conductance states. The highest standard deviation is 3.87 μS, about 10% of the g_{max} 40 μS, found at about 12 μS initial conductance state. Therefore, to simulate such conductance relaxation behaviour during inference, we inject a Gaussian noise with a zero mean and a standard deviation equal to 10% of the maximum weights of a layer.
We train models with different levels of noise injection from 0% to 40%, and select the model that achieves the highest inference accuracy at 10% noise level for onchip deployment. We find that injecting a higher noise during training than testing improves models’ noise resiliency. Extended Data Fig. 7a–c shows that the best testtime accuracy in the presence of 10% weight noise is obtained with 20% trainingtime noise injection for CIFAR10 image classification, 15% for Google voice command classification and 35% for RBMbased image reconstruction.
For CIFAR10, the better initial accuracy obtained by the model trained with 5% noise is most likely due to the regularization effect of noise injection. A similar phenomenon has been reported in neuralnetwork quantization literature where a model trained with quantization occasionally outperforms a fullprecision model^{54,55}. In our experiments, we did not apply additional regularization on top of noise injection for models trained without noise, which might result in suboptimal accuracy.
For RBM, Extended Data Fig. 7d further shows how reconstruction errors reduce with the number of Gibbs sampling steps for models trained with different noises. In general, models trained with higher noises converge faster during inference. The model trained with 20% noise reaches the lowest error at the end of 100 Gibbs sampling steps.
Extended Data Fig. 7e shows the effect of noise injection on weight distribution. Without noise injection, the weights have a Gaussian distribution. The neuralnetwork outputs heavily depend on a small fraction of large weights, and thus become vulnerable to noise injection. With noise injection, the weights distribute more uniformly, making the model more noise resilient.
To efficiently implement the models on NeuRRAM, inputs to all convolutional and fully connected layers are quantized to 4bit or below. The input bitprecisions of all the models are summarized in Table 1. We perform the quantized training using the parameterized clipping activation technique^{46}. The accuracies of some of our quantized models are lower than that of the stateoftheart quantized model because we apply <4bit quantization to the most sensitive input and output layers of the neural networks, which have been reported to cause large accuracy degradation and are thus often excluded from lowprecision quantization^{46,54}. To obtain better accuracy for quantized models, one can use higher precision for sensitive input and output layers, apply more advanced quantization techniques, and use more optimized data preprocessing, data augmentation and regularization techniques during training. However, the focus of this work is to achieve comparable inference accuracy on hardware and on software while keeping all these variables the same, rather than to obtain stateoftheart inference accuracy on all the tasks. The aforementioned quantization and training techniques will be equally beneficial for both our software baselines and hardware measurements.
Chipintheloop progressive finetuning
During the progressive chipintheloop finetuning, we use the chipmeasured intermediate outputs from a layer to finetune the weights of the remaining layers. Importantly, to fairly evaluate the efficacy of the technique, we do not use the testset data (for either training or selecting checkpoint) during the entire process of finetuning. To avoid overfitting to a small fraction of data, measurements should be performed on the entire trainingset data. We reduce the learning rate to 1/100 of the initial learning rate used for training the baseline model, and finetune for 30 epochs, although we observed that the accuracy generally plateaus within the first 10 epochs. The same weight noise injection and input quantization are applied during the finetuning.
Implementations of CNNs, LSTMs and RBMs
We use CNN models for the CIFAR10 and MNIST image classification tasks. The CIFAR10 dataset consists of 50,000 training images and 10,000 testing images belonging to 10 object classes. We perform image classification using the ResNet20^{43}, which contains 21 convolutional layers and 1 fully connected layer (Extended Data Fig. 9a), with batch normalizations and ReLU activations between the layers. The model is trained using the Keras framework. We quantize the input of all convolutional and fully connected layers to a 3bit unsigned fixedpoint format except for the first convolutional layer, where we quantize the input image to 4bit because the inference accuracy is more sensitive to the input quantization. For the MNIST handwritten digits classification, we use a sevenlayer CNN consisting of six convolutional layers and one fully connected layer, and use maxpooling between layers to downsample feature map sizes. The inputs to all the layers, including the input image, are quantized to a 3bit unsigned fixedpoint format.
All the parameters of the CNNs are implemented on a single NeuRRAM chip including those of the convolutional layers, the fully connected layers and the batch normalization. Other operations such as partialsum accumulation and average pooling are implemented on an FPGA integrated on the same board as the NeuRRAM. These operations amount to only a small fraction of the total computation and integrating their implementation in digital CMOS would incur negligible overhead; the FPGA implementation was chosen to provide greater flexibility during test and development.
Extended Data Fig. 9a–c illustrates the process to map a convolutional layer on a chip. To implement the weights of a fourdimensional convolutional layer with dimension H (height), W (width), I (number of input channels), O (number of output channels) on twodimensional RRAM arrays, we flatten the first three dimensions into a onedimensional vector, and append the bias term of each output channel to each vector. If the range of the bias values is B times of the weight range, we evenly divide the bias values and implement them using B rows. Furthermore, we merge the batch normalization parameters into convolutional weights and biases after training (Extended Data Fig. 9b), and program the merged Wʹ and bʹ onto RRAM arrays such that no explicit batch normalization needs to be performed during inference.
Under the differentialrow weightmapping scheme, the parameters of a convolutional layer are converted into a conductance matrix of size (2(HWI + B), O). If the conductance matrix fits into a single core, an input vector is applied to 2(HWI + B) rows and broadcast to O columns in a single cycle. HWIO multiply–accumulate (MAC) operations are performed in parallel. Most ResNet20 convolutional layers have a conductance matrix height of 2(HWI + B) that is greater than the RRAM array length of 256. We therefore split them vertically into multiple segments, and map the segments either onto different cores that are accessed in parallel, or onto different columns within a core that are accessed sequentially. The details of the weightmapping strategies are described in the next section.
The Google speech command dataset consists of 65,000 1slong audio recordings of voice commands, such as ‘yes’, ‘up’, ‘on’, ‘stop’ and so on, spoken by thousands of different people. The commands are categorized into 12 classes. Extended Data Fig. 9d illustrates the model architecture. We use the Melfrequency cepstral coefficient encoding approach to encode every 40ms piece of audio into a length40 vector. With a hop length of 20 ms, we have a time series of 50 steps for each 1s recording.
We build a model that contains four parallel LSTM cells. Each cell has a hidden state of length 112. The final classification is based on summation of outputs from the four cells. Compared with a singlecell model, the 4cell model reduces the classification error (of an unquantized model) from 10.13% to 9.28% by leveraging additional cores on the NeuRRAM chip. Within a cell, in each time step, we compute the values of four LSTM gates (input, activation, forget and output) based on the inputs from the current step and hidden states from the previous step. We then perform elementwise operations between the four gates to compute the new hiddenstate value. The final logit outputs are calculated based on the hidden states of the final time step.
Each LSTM cell has 3 weight matrices that are implemented on the chip: an inputtohiddenstate matrix with size 40 × 448, a hiddenstatetohiddenstate matrix with size 112 × 448 and a hiddenstatetologits matrix with size 112 × 12. The elementwise operations are implemented on the FPGA. The model is trained using the PyTorch framework. The inputs to all the MVMs are quantized to 4bit signed fixedpoint formats. All the remaining operations are quantized to 8bit.
An RBM is a type of generative probabilistic graphical model. Instead of being trained to perform discriminative tasks such as classification, it learns the statistical structure of the data itself. Extended Data Fig. 9e shows the architecture of our imagerecovery RBM. The model consists of 794 fully connected visible neurons, corresponding to 784 image pixels plus 10 onehot encoded class labels and 120 hidden neurons. We train the RBM using the contrastive divergence learning procedure in software.
During inference, we send 3bit images with partially corrupted or blocked pixels to the model running on a NeuRRAM chip. The model then performs backandforth MVMs and Gibbs sampling between visible and hidden neurons for ten cycles. In each cycle, neurons sample binary states h and v from the MVM outputs based on the probability distributions: \(p({h}_{j}=1 {\bf{v}})=\sigma ({b}_{j}+{\sum }_{i}{v}_{i}{w}_{ij})\) and \(p({h}_{j}=1 {\bf{v}})=\) \(\sigma ({b}_{j}+{\sum }_{i}{v}_{i}{w}_{ij})\), where σ is the sigmoid function, a_{i} is a bias for hidden neurons (h) and b_{j} is a bias for visible neurons (v). After sampling, we reset the uncorrupted pixels (visible neurons) to the original pixel values. The final inference performance is evaluated by computing the average L2reconstruction error between the original image and the recovered image. Extended Data Fig. 10 shows some examples of the measured image recovery.
When mapping the 794 × 120 weight matrix to multiple cores of the chip, we try to make the MVM output dynamic range of each core relatively consistent such that the recovery performance will not overly rely on the computational accuracy of any single core. To achieve this, we assign adjacent pixels (visible neurons) to different cores such that every core sees a downsampled version of the whole image, as shown in Extended Data Fig. 9f). Utilizing the bidirectional MVM functionality of the TNSA, the visibletohidden neuron MVM is performed from the SLtoBL direction in each core; the hiddentovisible neuron MVM is performed from the BLtoSL direction.
Weightmapping strategy onto multiple CIM cores
To implement an AI model on a NeuRRAM chip, we convert the weights, biases and other relevant parameters (for example, batch normalization) of each model layer into a single twodimensional conductance matrix as described in the previous section. If the height or the width of a matrix exceed the RRAM array size of a single CIM core (256 × 256), we split the matrix into multiple smaller conductance matrices, each with maximum height and width of 256.
We consider three factors when mapping these conductance matrices onto the 48 cores: resource utilization, computational load balancing and voltage drop. The top priority is to ensure that all conductance matrices of a model are mapped onto a single chip such that no reprogramming is needed during inference. If the total number of conductance matrices does not exceed 48, we can map each matrix onto a single core (case (1) in Fig. 2a) or multiple cores. There are two scenarios when we map a single matrix onto multiple cores. (1) When a model has different computational intensities, defined as the amount of computation per weights, for different layers, for example, CNNs often have higher computational intensity for earlier layers owing to larger feature map dimensions, we duplicate the more computationally intensive matrices to multiple cores and operate them in parallel to increase throughput and balance the computational loads across the layers (case (2) in Fig. 2a). (2) Some models have ‘wide’ conductance matrices (output dimension >128), such as our imagerecovery RBM. If mapping the entire matrix onto a single core, each input driver needs to supply large current for its connecting RRAMs, resulting in a significant voltage drop on the driver, deteriorating inference accuracy. Therefore, when there are spare cores, we can split the matrix vertically into multiple segments and map them onto different cores to mitigate the voltage drop (case (6) in Fig. 2a).
By contrast, if a model has more than 48 conductance matrices, we need to merge some matrices so that they can fit onto a single chip. The smaller matrices are merged diagonally such that they can be accessed in parallel (case (3) in Fig. 2a). The bigger matrices are merged horizontally and accessed by timemultiplexing input rows (case (4) in Fig. 2a). When selecting the matrices to merge, we want to avoid the matrices that belong to the same two categories described in the previous paragraph: (1) those that have high computational intensity (for example, early layers of ResNet20) to minimize impact on throughput; and (2) those with ‘wide’ output dimension (for example, late layers of ResNet20 have large number of output channels) to avoid a large voltage drop. For instance, in our ResNet20 implementation, among a total of 61 conductance matrices (Extended Data Fig. 9a: 1 from input layer, 12 from block 1, 17 from block 2, 28 from block 3, 2 from shortcut layers and 1 from final dense layer), we map each of the conductance matrices in blocks 1 and 3 onto a single core, and merge the remaining matrices to occupy the 8 remaining cores.
Table 1 summarizes core usage for all the models. It is noted that for partially occupied cores, unused RRAM cells are either unformed or programmed to high resistance state; WLs of unused rows are not activated during inference. Therefore, they do not consume additional energy during inference.
Testsystem implementation
Extended Data Fig. 11a shows the hardware test system for the NeuRRAM chip. The NeuRRAM chip is configured by, receives inputs from and sends outputs to a Xilinx Spartan6 FPGA that sits on an Opal Kelly integrated FPGA board. The FPGA communicates with the PC via a USB 3.0 module. The test board also houses voltage DACs that provide various bias voltages required by RRAM programming and MVM, and ADCs to measure RRAM conductance during the write–verify programming. The power of the entire board is supplied by a standard ‘cannon style’ d.c. power connector and integrated switching regulators on the Opal Kelly board such that no external lab equipment is needed for the chip operation.
To enable fast implementation of various machinelearning applications on the NeuRRAM chip, we developed a software toolchain that provides Pythonbased application programming interfaces (APIs) at various levels. The lowlevel APIs provide access to basic operations of each chip module such as RRAM read and write and neuron analoguetodigital conversion; the middlelevel APIs include essential operations required for implementing neuralnetwork layers such as the multicore parallel MVMs with configurable bitprecision and RRAM write–verify programming; the highlevel APIs integrate various middlelevel modules to provide complete implementations of neuralnetwork layers, such as weight mapping and batch inference of convolutional and fully connected layers. The software toolchain aims to allow software developers who are not familiar with the NeuRRAM chip design to deploy their machinelearning models on the NeuRRAM chip.
Power and throughput measurements
To characterize MVM energy efficiency at various input and output bitprecisions, we measure the power consumption and latency of the MVM input and output stages separately. The total energy consumption and the total time are the sum of input and output stages because the two stages are performed independently as described in the above sections. As a result, we can easily obtain the energy efficiency of any combinations of input and output bitprecisions.
To measure the inputstage energy efficiency, we generate a 256 × 256 random weight matrix with Gaussian distribution, split it into 2 segments, each with dimension 128 × 256, and program the two segments to two cores using the differentialrow weight mapping. We measure the power consumption and latency for performing 10 million MVMs, or equivalently 655 billion MAC operations. The comparison with previous work shown in Fig. 1d uses the same workload as benchmark.
Extended Data Fig. 12a shows the energy per operation consumed during the input and the output stages of MVMs under various bitprecisions. The inputs are in the signed integer format, where the first bit represents the sign, and the other bits represent the magnitude. Onebit (binary) and twobit (ternary) show similar energy because each input wire is driven to one of three voltage levels. Binary input is therefore just a special case for ternary input. It is noted that the curve shown in Extended Data Fig. 12a is obtained without the twophase operation. As a result, we see a superlinear increase of energy as input bitprecision increases. Similar to the inputs, the outputs are also represented in the signed integer format. The outputstage energy consumption grows linearly with output bitprecision because one additional binary search cycle is needed for every additional bit. The output stage consumes less energy than the input stage because it does not involve toggling highly capacitive WLs that are driven at a higher voltage, as we discuss below.
For the MVM measurements shown in Extended Data Fig. 12b–e, the MVM output stage is assumed to use 2bithigher precision than inputs to account for the additional bitprecision required for partialsum accumulations. The required partialsum bitprecision for the voltagemode sensing implemented by NeuRRAM is much lower than that required by the conventional currentmode sensing. As explained before, conventional currentsensing designs can only activate a fraction of rows each cycle, and therefore need many partialsum accumulation steps to complete an MVM. In contrast, the proposed voltagesensing scheme can activate all the 256 input wires in a single cycle, and therefore requires less partialsum accumulation steps and lower partialsum precisions.
Extended Data Fig. 12b shows the energy consumption breakdown. A large fraction of energy is spent in switching on and off the WLs that connect to gates of select transistors of RRAM devices. These transistors use thickoxide I/O transistors to withstand highvoltage during RRAM forming and programming. They are sized large enough (width 1 µm and length 500 nm) to provide sufficient current for RRAM programming. As a result, they require high operating voltages and add large capacitance to the WLs, both contributing to high power consumption (P = fCV^{2}, where f is the frequency at which the capacitance is charged and discharged). Simulation shows that each of the 256 access transistors contributes about 1.5 fF to a WL; WL drivers combined contribute about 48 fF to each WL; additional WL capacitance is mostly from the interwire capacitance from neighbouring BLs and WLs. The WL energy is expected to decrease significantly if RRAMs can be written by a lower voltage and have a lower conductance state, and if a smaller transistor with better drivability can be used.
For applications that require probabilistic sampling, the two counterpropagating LFSR chains generate random Bernoulli noises and inject the noises as voltage pulses into neurons. We measure each noiseinjection step to consume on average 121 fJ per neuron, or 0.95 fJ per weight, which is small compared with other sources of energy consumption shown in Extended Data Fig. 12b.
Extended Data Fig. 12c–e shows the measured latency, peak throughput and throughputpower efficiency for performing the 256 × 256 MVMs. It is noted that we used EDP as a figure of merit for comparing designs rather than throughputpower efficiency as teraoperations per second per watt (TOPS W^{−1}, reciprocal of energy per operation), because it captures the timetosolution aspect in addition to energy consumption. Similar to previous work in this field, the reported throughput and energy efficiency represent their peak values when the CIM array utilization is 100%, and does not include time and energy spent at buffering and moving intermediate data. Future work that integrates intermediate data buffers, partialsum accumulators and so on within a single complete CIM chip should show energy efficiency measured on endtoend AI applications.
Projection of NeuRRAM energy efficiency with technology scaling
The current NeuRRAM chip is fabricated using a 130nm CMOS technology. We expect the energy efficiency to improve with the technology scaling. Importantly, isolated scaling of CMOS transistors and interconnects is not sufficient for the overall energyefficiency improvement. RRAM device characteristics must be optimized jointly with CMOS. The current RRAM array density under a 1T1R configuration is limited not by the fabrication process but by the RRAM write current and voltage. The current NeuRRAM chip uses large thickoxide I/O transistors as the ‘T’ to withstand >4V RRAM forming voltage and provide enough write current. Only if we lower both the forming voltage and the write current can we obtain higher density and therefore lower parasitic capacitance for improved energy efficiency.
Assuming that RRAM devices at a newer technology node can be programmed at a logiccompatible voltage level, and the required write current can be reduced such that the size of the connecting transistor keeps shrinking, the EDP improvements will come from (1) lower operating voltage and (2) smaller wire and transistor capacitance, that is, Energy ∝ CV^{2} and Delay ∝ CV/I. At 7 nm, for instance, we expect the WL switching energy (Extended Data Fig. 12b) to reduce by about 22.4 times, including 2.6 times from WL voltage scaling (1.3 V → 0.8 V), and 8.5 times from capacitance scaling (capacitance from select transistors, WL drivers and wires are all assumed to scale with minimum metal pitch 340 nm → 40 nm). Peripheral circuit energy (dominated by the neuron readout process) is projected to reduce by 42 times, including 5 times from VDD scaling (1.8 V → 0.8 V) and 8.5 times from smaller parasitic capacitance. The energy consumed by the MVM pulses and charge transfer process is independent of the range of RRAM conductance, as power consumption and settling time of the RRAM array scale with the same conductance factor that cancels in their product. Specifically the energy per RRAM MAC is E_{MAC} = C_{par} var(V_{in}), limited only by the parasitic capacitance per unit RRAM cell C_{par}, and the variance in the driven input voltage var(V_{in}). Therefore, the MVM energy consumption will reduce by approximately 34 times, including 4 times from readvoltage scaling (0.5 V → 0.25 V), and 8.5 times from smaller parasitic capacitance. Overall, we expect an energy consumption reduction of about 34 times when scaling the design from 130 nm to 7 nm.
In terms of the latency, the current design is limited by the long integration time of neuron, caused primarily by the relatively large integration capacitor size (104 fF), which was chosen conservatively to ensure function correctness and testing different neuron operating conditions. At more advanced technology nodes, one could use a much smaller capacitor size to achieve a higher speed. The main concern for scalingdown capacitor size is that the fabricationinduced capacitor size mismatch will take up a higher fraction of total capacitance, resulting in a lower SNR. However, previous ADC designs have used a unit capacitor size as small as 50 aF (ref. ^{56}; 340 times smaller than our C_{sample}). For a more conservative design, a study has shown that in a 32nm process, a 0.45fF unit capacitor has only 1.2% average standard deviation^{57}. Besides, the integration time also depends on the drive current of the transistors. Assuming that the transistor current density (μA μm^{−1}) stays relatively unchanged after VDD scaling, and that the transistor width in the neuron scales with the contact gate pitch (310 nm → 57 nm), the total transistor drive current will reduce by 5.4 times. As a result, when scaling C_{sample} from 17 fF to 0.2 fF and C_{integ} proportionally from 104 fF to 1.22 fF, the latency will improve by 15.7 times. Therefore, conservatively, we expect the overall EDP to improve by at least 535 times when scaling the design from 130nm to 7nm technology. Extended Data Table 2 shows that such scaling will enable NeuRRAM to deliver higher energy and area efficiency than today’s stateoftheart edge inference accelerators^{58,59,60,61}.
Data availability
The datasets used for benchmarks are publicly available^{18,19,20}. Other data that support the findings of this study are available in a public repository^{47}.
Code availability
The software toolchain used to test and deploy AI tasks on the NeuRRAM chip, and the codes used to perform noiseresilient model training and chipintheloop progressive model finetuning are available in a public repository^{47}.
References
Wong, H. S. P. et al. Metaloxide RRAM. Proc. IEEE 100, 1951–1970 (2012).
Prezioso, M. et al. Training and operation of an integrated neuromorphic network based on metaloxide memristors. Nature 521, 61–64 (2015).
Ambrogio, S. et al. Equivalentaccuracy accelerated neuralnetwork training using analogue memory. Nature 558, 60–67 (2018).
Ielmini, D. & Wong, H. S. P. Inmemory computing with resistive switching devices. Nat. Electron. 1, 333–343 (2018).
Yao, P. et al. Fully hardwareimplemented memristor convolutional neural network. Nature 577, 641–646 (2020).
Mochida, R. et al. A 4M synapses integrated analog ReRAM based 66.5 TOPS/W neuralnetwork processor with cell current controlled writing and flexible network architecture. In Symposium on VLSI Technology, Digest of Technical Papers 175–176 (IEEE, 2018).
Chen, W. H. et al. CMOSintegrated memristive nonvolatile computinginmemory for AI edge processors. Nat. Electron. 2, 420–428 (2019).
KhaddamAljameh, R. et al. HERMES coreA 14nm CMOS and PCMbased inmemory compute core using an array of 300ps/LSB linearized CCObased ADCs and local digital processing. In IEEE Symposium on VLSI Circuits, Digest of Technical Papers JFS25 (IEEE, 2021).
Hung, J. M. et al. A fourmegabit computeinmemory macro with eightbit precision based on CMOS and resistive randomaccess memory for AI edge devices. Nat. Electron. 4, 921–930 (2021).
Xue, C. X. et al. A 1Mb multibit ReRAM computinginmemory macro with 14.6ns parallel MAC computing time for CNN based AI edge processors. In IEEE International SolidState Circuits Conference (ISSCC), Digest of Technical Papers 388–390 (IEEE, 2019).
Cai, F. et al. A fully integrated reprogrammable memristor–CMOS system for efficient multiply–accumulate operations. Nat. Electron. 2, 290–299 (2019).
Ishii, M. et al. Onchip trainable 1.4M 6T2R PCM synaptic array with 1.6K stochastic LIF neurons for spiking RBM. In International Electron Devices Meeting (IEDM), Technical Digest 14.2.1–14.2.4 (IEEE, 2019).
Yan, B. et al. RRAMbased spiking nonvolatile computinginmemory processing engine with precisionconfigurable in situ nonlinear activation. In Symposium on VLSI Technology, Digest of Technical Papers T86–T87 (IEEE, 2019).
Wan, W. et al. A 74 TMACS/W CMOSRRAM neurosynaptic core with dynamically reconfigurable dataflow and insitu transposable weights for probabilistic graphical models. In IEEE International SolidState Circuits Conference (ISSCC), Digest of Technical Papers 498–500 (IEEE, 2020).
Liu, Q. et al. A fully integrated analog ReRAM based 78.4TOPS/W computeinmemory chip with fully parallel MAC computing. In IEEE International SolidState Circuits Conference (ISSCC), Digest of Technical Papers 500–502 (IEEE, 2020).
Xue, C. X. et al. A CMOSintegrated computeinmemory macro based on resistive randomaccess memory for AI edge devices. Nat. Electron. 4, 81–90 (2021).
Narayanan, P. et al. Fully onchip MAC at 14 nm enabled by accurate rowwise programming of PCMbased weights and parallel vectortransport in durationformat. IEEE Trans. Electron Devices 68, 6629–6636 (2021).
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradientbased learning applied to document recognition. Proc. IEEE 86, 2278–2323 (1998).
Krizhevsky, A. & Hinton, G. Learning Multiple Layers of Features from Tiny Images (2009); https://www.cs.toronto.edu/~kriz/learningfeatures2009TR.pdf
Warden, P. Speech commands: a dataset for limitedvocabulary speech recognition. Preprint at https://arxiv.org/abs/1804.03209 (2018).
Joshi, V. et al. Accurate deep neural network inference using computational phasechange memory. Nat. Commun. 11, 2473 (2020).
Alibart, F., Zamanidoost, E. & Strukov, D. B. Pattern classification by memristive crossbar circuits using ex situ and in situ training. Nat. Commun. 4, 2072 (2013).
Eryilmaz, S. B. et al. Experimental demonstration of arraylevel learning with phase change synaptic devices. In International Electron Devices Meeting (IEDM), Technical Digest 25.5.1–25.5.4 (IEEE, 2013).
Burr, G. W. et al. Experimental demonstration and tolerancing of a largescale neural network (165 000 synapses) using phasechange memory as the synaptic weight element. IEEE Trans. Electron Devices 62, 3498–3507 (2015).
Eryilmaz, S. B. et al. Training a probabilistic graphical model with resistive switching electronic synapses. IEEE Trans. Electron Devices 63, 5004–5011 (2016).
Sheridan, P. M. et al. Sparse coding with memristor networks. Nat. Nanotechnol. 12, 784–789 (2017).
Yao, P. et al. Face classification using electronic synapses. Nat. Commun. 8, 15199 (2017).
Banbury, C. et al. MLPerf tiny benchmark. In Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks (2021).
Roy, S., Sridharan, S., Jain, S. & Raghunathan, A. TxSim: modeling training of deep neural networks on resistive crossbar systems. IEEE Trans. Very Large Scale Integr. Syst. 29, 730–738 (2021).
Yang, T. J. & Sze, V. Design considerations for efficient deep neural networks on processinginmemory accelerators. In International Electron Devices Meeting (IEDM), Technical Digest 22.1.1–22.1.4 (IEEE, 2019).
Lecun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Hochreiter, S. & Schmidhuber, J. Long shortterm memory. Neural Comput. 9, 1735–1780 (1997).
Koller, D. & Friedman, N. Probabilistic Graphical Models: Principles and Techniques (Adaptive Computation and Machine Learning series) (MIT Press, 2009).
Su, J. W. et al. A 28nm 64Kb inferencetraining twoway transpose multibit 6T SRAM computeinmemory macro for AI edge chips. In IEEE International SolidState Circuits Conference (ISSCC), Digest of Technical Papers 240–242 (IEEE, 2020).
Guo, R. et al. A 5.1pJ/neuron 127.3us/inference RNNbased speech recognition processor using 16 computinginmemory SRAM macros in 65nm CMOS. In IEEE Symposium on VLSI Circuits, Digest of Technical Papers 120–121 (IEEE, 2019).
Wang, Z. et al. Fully memristive neural networks for pattern classification with unsupervised learning. Nat. Electron. 1, 137–145 (2018).
Shafiee, A. et al. ISAAC: a convolutional neural network accelerator with insitu analog arithmetic in crossbars. In Proc. 2016 43rd International Symposium on Computer Architecture (ISCA) 1426 (IEEE/ACM, 2016).
Ankit, A. et al. PUMA: a programmable ultraefficient memristorbased accelerator for machine learning inference. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 715–731 (ACM, 2019).
Wan, W. et al. A voltagemode sensing scheme with differentialrow weight mapping for energyefficient RRAMbased inmemory computing. In Symposium on VLSI Technology, Digest of Technical Papers (IEEE, 2020).
Murmann, B. Digitally assisted data converter design. In European Conference on SolidState Circuits (ESSCIRC) 24–31 (IEEE, 2013).
Zhao, M. et al. Investigation of statistical retention of filamentary analog RRAM for neuromophic computing. In International Electron Devices Meeting (IEDM), Technical Digest 39.4.1–39.4.4 (IEEE, 2018).
Alibart, F., Gao, L., Hoskins, B. D. & Strukov, D. B. High precision tuning of state for memristive devices by adaptable variationtolerant algorithm. Nanotechnology 23, 762–775 (2012).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).
Cauwenberghs, G. & Bayoumi, M. A. Learning on Silicon—Adaptive VLSI Neural Systems (Kluwer Academic, 1999).
Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989).
Choi, J. et al. PACT: parameterized clipping activation for quantized neural networks. Preprint at https://arxiv.org/abs/1805.06085 (2018).
Wan, W. weierwan/Neurram_48core: Initial Release (Version 1.0) [Computer software]. Zenodo https://doi.org/10.5281/zenodo.6558399 (2022).
Jung, S. et al. A crossbar array of magnetoresistive memory devices for inmemory computing. Nature 601, 211–216 (2022).
Jerry, M. et al. Ferroelectric FET analog synapse for acceleration of deep neural network training. In International Electron Devices Meeting (IEDM), Technical Digest 6.2.1–6.2.4 (IEEE, 2018).
Jiang, Z. et al. Nextgeneration ultrahighdensity 3D vertical resistive switching memory (VRSM)–Part II: design guidelines for device, array, and architecture. IEEE Trans. Electron Devices 66, 5147–5154 (2019).
Cauwenberghs, G. An analog VLSI recurrent neural network learning a continuoustime trajectory. IEEE Trans. Neural Netw. 7, 346–361 (1996).
Wu, W. et al. A methodology to improve linearity of analog RRAM for neuromorphic computing. In Symposium on VLSI Technology, Digest of Technical Papers 103–104 (IEEE, 2018).
Ji, Y. et al. FPSA: a full system stack solution for reconfigurable ReRAMbased NN accelerator architecture. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 733–747 (ACM, 2019).
Esser, S. K., Mckinstry, J. L., Bablani, D., Appuswamy, R. & Modha, D. S. Learned step size quantization. In International Conference on Learning Representations (ICLR) (2020).
Jung, S. et al. Learning to quantize deep networks by optimizing quantization intervals with task loss. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 4345–4354 (IEEE/CVF, 2019).
Stepanovic, D. & Nikolic, B. A 2.8 GS/s 44.6 mW timeinterleaved ADC achieving 50.9 dB SNDR and 3 dB effective resolution bandwidth of 1.5 GHz in 65 nm CMOS. IEEE J. Solid State Circuits 48, 971–982 (2013).
Tripathi, V. & Murmann, B. Mismatch characterization of small metal fringe capacitors. IEEE Trans. Circuits Syst. I Regul. Pap. 61, 2236–2242 (2014).
Chen, Y. H., Krishna, T., Emer, J. S. & Sze, V. Eyeriss: an energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid State Circuits 52, 127–138 (2017).
Zimmer, B. et al. A 0.32128 TOPS, scalable multichipmodulebased deep neural network inference accelerator with groundreferenced signaling in 16 nm. IEEE J. Solid State Circuits 55, 920–932 (2020).
Lee, J. et al. UNPU: an energyefficient deep neural network accelerator with fully variable weight bit precision. IEEE J. Solid State Circuits 54, 173–185 (2019).
Pei, J. et al. Towards artificial general intelligence with hybrid Tianjic chip architecture. Nature 572, 106–111 (2019).
Murmann, B. ADC Performance Survey 1997–2021 (2021); https://web.stanford.edu/~murmann/adcsurvey.html
Acknowledgements
This work is supported in part by NSF Expeditions in Computing (Penn State, award number 1317470), the Office of Naval Research (Science of AI program), the SRC JUMP ASCENT Center, Stanford SystemX Alliance, Stanford NMTRI, Beijing Innovation Center for Future Chips, National Natural Science Foundation of China (61851404), and Western Digital Corporation.
Author information
Authors and Affiliations
Contributions
W.W., R.K., S.B.E., S.J., H.S.P.W. and G.C. designed the NeuRRAM chip architecture and circuits. W.W., S.B.E., W.Z. and D.W. implemented physical layout of the chip. W.Z., H.Q., B.G. and H.W. contributed to the RRAM device fabrication and integration with CMOS. W.W., R.K., S.D. and G.C. developed the test system. W.W. developed the software toolchain, implemented the AI models on the chip and conducted all chip measurements. W.W., C.S. and S.J. worked on the development of AI models. W.W., R.K., C.S., P.R., S.J., H.S.P.W. and G.C. contributed to the experiment design and analysis and interpretation of the measurements. B.G., S.J., H.W., H.S.P.W. and G.C. supervised the project. All authors contributed to the writing and editing of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature thanks Matthew Marinella and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended Data Fig. 1 Peripheral driver circuits for TNSA and chip operating modes.
a, driver circuits’ configuration under the weightprogramming mode. b, under the neurontesting mode. c, under the MVM mode. d, circuit diagram of the two counterpropagating LFSR chains XORed to generate pseudorandom sequences for probabilistic sampling.
Extended Data Fig. 2 Various MVM dataflow directions and their CIM implementations.
Left, various MVM dataflow directions commonly seen in different AI models. Middle, conventional CIM implementation of various dataflow directions. Conventional designs typically locate all peripheral circuits such as ADCs outside of RRAM array. The resulting implementations of bidirectional and recurrent MVMs incur overheads in area, latency, and energy. Right, the Transposable Neurosynaptic Array (TNSA) interleaves RRAM weights and CMOS neurons across the array and supports diverse MVM directions with minimal overhead.
Extended Data Fig. 3 Iterative write–verify RRAM programming.
a, Flowchart of the incrementalpulse write–verify technique to program RRAMs into target analogue conductance range. b, An example sequence of the write–verify programming. c, RRAM conductance distribution measured during and after the write–verify programming. Each blue dot represents one RRAM cell measured during write–verify. The grey shades show that the RRAM conductance relaxation cause the distribution to spread out from the target values. The darker shade shows that the iterative programming helps narrow the distribution. d, Standard deviation of conductance change measured at different initial conductance states and different time duration after the initial programming. The initial conductance relaxation happens at a faster rate than longer term retention degradation. e, Standard deviation of conductance relaxation decreases with increasing iterative programming cycles. f, Distribution of the number of SET/RESET pulses needed to reach conductance acceptance range.
Extended Data Fig. 4 4 basic neuron operations that enable MVM with multibit inputs and outputs.
a, Initialization, precharge sampling capacitor C_{sample} and output wires (SLs), and discharge integration capacitor C_{integ}. b, Sampling and integration, sample SL voltage onto C_{sample}, followed by integrating the charge onto C_{integ}. c, Comparison and readout. The amplifier is turned into comparator mode to determine the polarity of the integrated voltage. Comparator outputs are written out of the neuron through the outer feedback loop. d, Charge decrement, charge is added or subtracted on C_{integ} through the outer feedback loop, depending on value stored in the latch.
Extended Data Fig. 5 Scatter plots of measured MVMs vs. ideal MVMs.
Results in ad are generated using the same 64×64 normally distributed random matrix and 1000 uniformed distributed floatingpoint vectors ϵ [1, 1]. a, Forward MVM using differential input scheme with inputs quantized to 4bit and outputs 6bit. b, Backward MVM using differential output scheme. The higher RMSE is caused by more voltage drop on each SL driver that needs to drive 128 RRAM cells, compared to 64 cells driven by each BL driver during forward MVM. c, MVM rootmeansquare error (RMSE) does not reduce when increasing input from 4bit (a) to 6bit. This is caused by using a lower input voltage that leads to worse signaltonoiseratio. d, 2phase operation reduces MVM RMSE with 6bit input by breaking inputs into 2 segments and performing MVMs separately, such that input voltage does not need to be reduced. e–f, Outputs from conv15 layer of ResNet20. Weights of conv15 are divided to 3 CIM cores. Layer outputs show a higher RMSE when performing MVM in parallel on the 3 cores (f) than sequentially on the 3 cores (e).
Extended Data Fig. 6 Data distribution with and without modeldriven chip calibration.
Left, Distribution of inputs to the final fully connected layer of ResNet20 when the inputs are generated from (toptobottom) CIFAR10 testset data, trainingset data, and random uniform data. Right, Distribution of outputs from the final fully connected layer of ResNet20. The testset and trainingset have similar distributions while random uniform data produces a markedly different output distribution. To ensure that the MVM output voltage dynamic range during testing is calibrated to occupy the full ADC input swing, the calibration data should come from trainingset data that closely resembles the testset data.
Extended Data Fig. 7 Noiseresilient training of CNNs, LSTMs and RBMs.
a, Change in CIFAR10 testset classification accuracy under different weight noise levels during inference. Noise is represented as fraction of the maximum absolute value of weights. Different curves represent models trained at different levels of noise injection. b, Change in voice command recognition accuracy with weight noise levels. c, Change in MNIST imagereconstruction error with weight noise levels. d, Decreasing of imagereconstruction error with Gibbs sampling steps during RBM inference. e, Differences in weight distributions when trained without and with noise injection.
Extended Data Fig. 8 Measured chip inference performance.
a, CIFAR10 trainingset accuracy loss due to hardware nonidealities, and accuracy recovery at each step of the chipintheloop progressive finetuning. From left to right, each data point represents a new layer programmed onto the chip. The blue solid lines represent the accuracy loss measured when performing inference of that layer onchip. The red dotted lines represent the measured recovery in accuracy by finetuning subsequent layers. b, Ablation study showing the impacts of input, activation, and weight quantizations, and weight noise injection on inference errors.
Extended Data Fig. 9 Implementation of various AI models.
a, Architecture of ResNet20 for CIFAR10 classification. b, The batch normalization parameters are merged into convolutional weights and biases before mapping onchip. c, Illustration of the process to map 4dimensional weights of a convolutional layer to NeuRRAM CIM cores. d, Architecture of the LSTM model used for Google speech command recognition. The model contains 4 parallel LSTM cells and makes predictions based on the sum of outputs from the 4 cells. e, Architecture of the RBM model used for MNIST image recovery. During inference, MVMs and Gibbs sampling are performed back andforth between visible and hidden neurons. f, Process to map RBM on NeuRRAM CIM cores. Adjacent pixels are assigned to different cores to equalize the MVM output dynamic range at different cores.
Extended Data Fig. 10 Chipmeasured image recovery using RBM.
Top: Recovery of MNIST testset images with randomly selected 20% of pixels flipped to complementary intensity. Bottom: Recovery of MNIST testset images with bottom 1/3 of pixels occluded.
Extended Data Fig. 11 NeuRRAM test system and chip micrographs at various scales.
a, A NeuRRAM chip wirebonded to a package. b, Measurement board that connects a packaged NeuRRAM chip (left) to a fieldprogrammable gate array (FPGA, right). The board houses all the components necessary to power, operate and measure the chip. No external lab equipment is needed for the chip operations. c, Micrograph of a 48core NeuRRAM chip. d, Zoomedin micrograph of a single CIM core. e, Zoomedin micrograph of 2×2 corelets within the TNSA. One neuron circuit occupies 1270 μm^{2}, which is >100× smaller than most ADC designs in 130nm summarized in an ADC survey^{62}. f, Chip area breakdown.
Extended Data Fig. 12 Energy consumption, latency, and throughput measurement results.
a, Measured energy consumption per operation during the MVM input stage (without 2phase operation) and output stage, where one multiply–accumulate (MAC) counts as two operations. b, Energy consumption breakdown at various MVM input and output bitprecisions. Outputs are 2bit higher than inputs during a MVM to account for additional precision requirements from partialsum accumulation. c, Latency for performing one MVM with 256×256 weight matrix. d, Peak computational throughput (in gigaoperations per second). e, Throughputpower efficiency (in teraoperations per watt).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wan, W., Kubendran, R., Schaefer, C. et al. A computeinmemory chip based on resistive randomaccess memory. Nature 608, 504–512 (2022). https://doi.org/10.1038/s41586022049928
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41586022049928
This article is cited by

Electrochemical randomaccess memory: recent advances in materials, devices, and systems towards neuromorphic computing
Nano Convergence (2024)

Domain wall magnetic tunnel junctionbased artificial synapses and neurons for allspin neuromorphic hardware
Nature Communications (2024)

Metalorganic framework single crystal for inmemory neuromorphic computing with a light control
Communications Materials (2024)

Implementation of binarized neural networks immune to device variation and voltage drop employing resistive random access memory bridges and capacitive neurons
Communications Engineering (2024)

Purely selfrectifying memristorbased passive crossbar array for artificial neural network accelerators
Nature Communications (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.