A computing-in-memory macro based on three-dimensional resistive random-access memory

Huo, Qiang; Yang, Yiming; Wang, Yiming; Lei, Dengyun; Fu, Xiangqu; Ren, Qirui; Xu, Xiaoxin; Luo, Qing; Xing, Guozhong; Chen, Chengying; Si, Xin; Wu, Hao; Yuan, Yiyang; Li, Qiang; Li, Xiaoran; Wang, Xinghua; Chang, Meng-Fan; Zhang, Feng; Liu, Ming

doi:10.1038/s41928-022-00795-x

Download PDF

Article
Open access
Published: 26 July 2022

A computing-in-memory macro based on three-dimensional resistive random-access memory

Nature Electronics volume 5, pages 469–477 (2022)Cite this article

14k Accesses
53 Citations
52 Altmetric
Metrics details

Subjects

Abstract

Non-volatile computing-in-memory macros that are based on two-dimensional arrays of memristors are of use in the development of artificial intelligence edge devices. Scaling such systems to three-dimensional arrays could provide higher parallelism, capacity and density for the necessary vector–matrix multiplication operations. However, scaling to three dimensions is challenging due to manufacturing and device variability issues. Here we report a two-kilobit non-volatile computing-in-memory macro that is based on a three-dimensional vertical resistive random-access memory fabricated using a 55 nm complementary metal–oxide–semiconductor process. Our macro can perform 3D vector–matrix multiplication operations with an energy efficiency of 8.32 tera-operations per second per watt when the input, weight and output data are 8, 9 and 22 bits, respectively, and the bit density is 58.2 bit µm^–2. We show that the macro offers more accurate brain MRI edge detection and improved inference accuracy on the CIFAR-10 dataset than conventional methods.

A four-megabit compute-in-memory macro with eight-bit precision based on CMOS and resistive random-access memory for AI edge devices

Article 20 December 2021

A crossbar array of magnetoresistive memory devices for in-memory computing

Article 12 January 2022

A compute-in-memory chip based on resistive random-access memory

Article Open access 17 August 2022

Main

The use of convolutional neural networks (CNNs) has led to improvements in computer vision, natural language processing and speech recognition. These methods do, however, have tremendous computational and storage demands^1,2,3. In particular, CNNs require a large parameter space for operation, especially the three-dimensional (3D) CNNs used in video recognition or medical applications such as brain magnetic resonance imaging (MRI)^4,5,6. Artificial intelligence (AI) edge devices^7,8,9 based on the von Neumann architecture limit performance and energy efficiency due to the large amount of data transfer between processing and memory units. The energy consumption of memory access alone can be more than 100 times greater than floating-point computation and the development of memory performance is, in general, far behind processor technologies¹⁰.

Non-volatile computing-in-memory (nvCIM) architectures based on memristor technologies could be used to break this memory-wall bottleneck. Previous nvCIM work has focused on resistive random-access memory (RRAM) systems that use a two-dimensional (2D) layout^{11,12,13,14,15,16,17,18,19}. But the use of 2D RRAM nvCIMs to create large-scale 3D CNNs has a number of challenges (Fig. 1a,b). Increased number of weights, multibit weight representation and wider metal lines to carry the higher bit-line (BL) readout current (I_BL) are needed when conventional parallel word-line (WL) input in situ vector–matrix multiplication (PWIVMM) schemes are used, making array areas large. In addition, when the number of RRAM cells involved in vector–matrix multiplication (VMM) operations is large, error accumulation caused by cell conductance drift can be difficult to distinguish. Here 3D RRAMs could provide higher parallelism, capacity and density for VMM operations. Their increased RRAM cell resistance can decrease the total I_BL (and thus the power) during VMM operation. However, small I_BL can also lead to latency (T_AC), especially when the input vector or weight matrix is sparse and the more difficult manufacturing of 3D RRAM creates the potential for greater error accumulation. Previous work on 3D RRAM-based nvCIMs has been limited only to the memory array, and effective 3D computing systems that include peripheral circuits are lacking^{20,21,22,23,24,25,26,27,28,29,30,31}.

**Fig. 1: Challenges of applying 2D RRAM to large 3D CNNs and proposed 3D nvCIM scheme.**

In this Article, we report a two-kilobit (kb) 3D vertical resistive random-access memory (VRRAM)-based nvCIM macro. We create this high-density computing scheme by combining a multilevel self-selective (MLSS) 3D VRRAM array (Supplementary Fig. 1) with an antidrift multibit analogue input-weight multiplication (ADINWM) scheme. A current-amplitude-discrete-shaping (CADS) scheme included in ADINWM is used to enlarge the I_BL sensing margin (SM) and eliminate the distortion error typically observed in conventional PWIVMM schemes. Our 3D VRRAM operates at nanoampere current levels to reduce the system power; to minimize latency, we use an analogue multiplier (AM) with a gate precharge switch follower and a direct small-current converter (DSCC). The peripheral circuits of our nvCIM macro are fabricated using 55 nm complementary metal–oxide–semiconductor (CMOS) technology. Our 3D nvCIM macro achieves computing precision with a range of input-, weight- and output-bit levels, and offers improved inference accuracies compared with conventional approaches when tested using the Modified National Institute of Standards and Technology (MNIST) and Canadian Institute For Advanced Research (CIFAR-10) datasets.

System architecture

Figure 2a shows the architecture of our 3D VRRAM-based nvCIM macro used to implement 1-bit input to 2-bit weight (1bIN–2bW), 4-bit input to 5-bit weight (4bIN–5bW) and 8-bit input to 9-bit weight (8bIN–9bW) VMM operations. The whole structure includes several parts: MLSS 3D VRRAM array, CADS, DSCC and digital processor. Specific information about the eight-layer 3D VRRAM is shown in Supplementary Fig. 1 and Methods, including the writing scheme, schematic, interior image and distribution of readout current I_BL. This array has 32 WLs and 8 × 8 BLs. The 3D VRRAM only contains WLs and BLs because of using a self-selective cell (SSC) to avoid the introduction of planar electrodes and select line (SL) in conventional 3D VRRAMs. The SSC is realized by using a bilayer HfO₂/TaO_x device. When adopting conventional 3D VRRAMs, using WL or SL as the input cannot guarantee that all the devices participate in VMM operations in one computing cycle^27,29. When using a pillar electrode as the input and a planar electrode as the output, although all the devices participate in the calculation in one cycle, the number of neurons depending on the number of layers is limited²⁹. Using WL as the input and BL as the neuron output, all the devices of our array engage in VMM operations in one cycle and the number of neurons is not limited. Therefore, this structure is better for highly parallel VMM computing in one cycle than the conventional structure of 3D VRRAMs. Compared with previous work²⁸, this array is easy to manufacture. The array size can be increased to 10 Mb by future technology advancements³². Each VRRAM cell can be set to two or four conductance states corresponding to 1-bit or 2-bit weight, respectively. Multibit weights are composed of low-bit weights³³ through the cooperation of peripheral circuits. In this work, the positive weights (PWs) and negative weights (NWs) are stored in two adjacent layers (positive-weight layer (PWL) and negative-weight layer (NWL), respectively) of 3D VRRAMs. Four-bit PWs are represented by four 1-bit weights stored in the PWL. Eight-bit PWs are represented by four 2-bit weights stored in the PWL. The NW case is similar to the PW case. The 5-bit weight of the 4bIN–5bW operation consists of a 4-bit weight in the PWL and NWL each. Similarly, the 9-bit weight of the 8bIN–9bW operation consists of an 8-bit weight in the PWL and NWL each.

**Fig. 2: 3D VRRAM-based nvCIM macro using ADINWM scheme.**

In previous works, the PWIVMM scheme is widely adopted^{11,12,13,14,16,17,18}. The red lines in Fig. 2a represent the data flow of the conventional PWIVMM scheme when implementing the 1bIN–2bW operation. Figure 2b demonstrates the concrete work flow of the PWIVMM scheme in this work. The input voltages (IN_WL0–IN_WL31) enter the VRRAM in parallel through the WL switches. The difference in I_BL processed by the DSCC in the PWL and NWL is the result of the VMM operation. However, as shown in Fig. 1b and the formula shown in Fig. 1a, when using multilevel RRAMs, the distributions of the final output results may overlap due to error accumulation caused by the conductance drift of multiple cells. When the number n of RRAM cells involved in VMM computing is large, the distribution overlap is more obvious. For 3D RRAMs, the difficulty in manufacturing makes the cell conductance fluctuate greatly. Therefore, the problem is more serious than that of the 2D RRAM array, which limits the number of cells involved in computing and further limits computing parallelism.

In this work, the ADINWM scheme is proposed and the data flow of the 8bIN–9bW VMM operation is depicted (Fig. 2a, blue lines). The following discussion mainly focuses on operations in the PWL for illustration; the situation is the same for the NWL. For all precision operations, V_READ is applied to the WL in series. For the 8bIN–9bW operation, the number of bits of four serial cells (W_n0–W_n3) on the same WL is set as 2 bits. The inputs are IN0–IN7 and each I_BL represents 2-bit data. After I_BL shaping by the CADS circuit, we obtain four discrete stable currents (I_Shaped0–I_Shaped3). Figure 2b shows that the 8bIN–9bW operation can be attained by the difference between the results of the 8bIN–8bPW and 8bIN–8bNW operations. For 8bIN–8bPW multiplication, according to the formula and table shown in Fig. 2c, it is divided into four 4-bit multiplications and the results correspond to four output currents (namely, I_OUTLL, I_OUTLH, I_OUTHL and I_OUTHH). These multiplications are realized by 4-bit AMs. The currents are converted into an 8-bit digital signal by the DSCC and then shifted (OUTHH shifts 8 bits to the left, and OUTHL and OUTLH shift 4 bits to the left) and combined to complete the 8-bit multiplication operation for the input and PW in PWL. All the multiplication results from 32 WLs are accumulated after 32 cycles in series by a digital processor. Finally, the difference with the result of the 8bIN–8bNW VMM operation in the NWL is used to obtain the final VMM result. For 4bIN–4bPW multiplication (Supplementary Fig. 2), the number of bits of four serial cells is set as 1 bit. After I_BL shaping by the CADS circuit, the four currents (I_Shaped0–I_Shaped3) are combined with 4-bit inputs (IN0–IN3) by two 4–2-bit AMs (multiplying 2-bit input by 4-bit input). The outputs are the multiplication results I_OUTL and I_OUTH, which are later sampled by the DSCC and combined to get the 4-bit multiplication result. The result accumulates with the results from the remaining 31 rows in series by a digital processor. The complete result is obtained by subtracting the result of the 4bIN–4bNW operation in the NWL. The 1bIN–2bW operation is very similar to the 4bIN–5bW operation, except that only one I_Shaped from a 1-bit cell needs to be multiplied by the 1-bit input through the 1-bit AM.

The antidrift in the ADINWM scheme is to avoid the cumulative error caused by device conductance variation by reading the current through the WL in series, further stabilizing the current to be calculated through current shaping and completing the multiply–accumulate (MAC) computing by combining the near-memory multibit analogue multiplication with the digital processing unit. It overcomes the error accumulation problem caused by the conductance drift of multiple cells in the PWIVMM scheme.

CADS scheme for error mitigation

There are two challenges for I_BL. First, due to fluctuations in I_BL caused by the drift in the RRAM cell conductance, it will be difficult to distinguish I_BL; consequently, the SM becomes smaller. In this work, the SM is defined as half the interval between two adjacent readout currents. Because a 3D VRRAM is much more difficult to manufacture, this downside will be more prominent. Second, when applying the conventional PWIVMM scheme, each weight must have one-to-one correspondence with the RRAM cell conductance to ensure accurate VMM operations. However, this puts forward high requirements for the write circuit and the device itself. To overcome these challenges, the CADS scheme is proposed.

Figure 3a illustrates the CADS circuit. CADS consists of one input current mirror, three current comparators and three output current mirrors. The circuit can convert fluctuating I_BL into a discrete stable current and enlarge the SM for the next accurate operations (Fig. 3b). Besides, as long as I_BL is within a certain range, it can be converted into a fixed value. Therefore, the VRRAM cell conductance corresponding to a specific weight can be within a certain range rather than a fixed value, which mitigates the requirement of the writing circuit. Figure 3c shows the operation of current shaping by the CADS circuit. Further, I_BL is copied three times through the current mirror, which connects three current comparators with different threshold currents (I_Sn) controlled by three switches (namely, S1, S2 and S3). For the 8bIN–9bW operation (2-bit cell), S1, S2 and S3 are at a high level and three current comparators work simultaneously. If I_BL is higher than I_Sn, the corresponding V_Sn is low and the switch of the output current mirror will be enabled to control the multiples of the benchmark current (I_BM = 10 nA). Finally, I_Shaped can be selected from 0, 10, 20 and 30 nA depending on the range of I_BL of the VRRAM cell. For the 4bIN–5bW operation, one of S1, S2 and S3 is at a high level. When I_BL is higher than I_SX, V_SX is low and the corresponding switch of the output current mirror is turned on to obtain I_BM, namely, I_Shaped. After I_BL shaping by CADS, the SM of I_BL is enlarged to 5 nA and a discrete, stable I_Shaped is suitable for the next accurate computing.

**Fig. 3: CADS circuit for error mitigation.**

DSCC and AM scheme for high speed and performance

Due to the nanoampere operation current of 3D VRRAMs, the power consumption of the overall nvCIM macro can be greatly reduced at the cost of system latency. For high energy efficiency, the high performance of peripheral circuits is required.

Figure 4a shows the operation of 1 million samples per second (MSPS) 8-bit DSCC. The DSCC mainly comprises an 8-bit current digital-to-analogue converter (DAC) consisting of current mirrors, a successive approximation register (SAR) logic circuit and a comparator composed of a differential amplifier. The DSCC can directly sample and convert small currents, eliminating the time required for a conventional analogue-to-digital converter (ADC) to convert from analogue current to analogue voltage and reducing the T_AC. Besides, the power of the DSCC mainly depends on the input current (Supplementary Notes 2 and 3), which elucidates that it can save a lot of energy when implementing the VMM operations between sparse matrices (such as sparse weights)³⁴ or the readout current of RRAM is small. Furthermore, we add in a structure that can set the initial voltage of 0.9 V to make the complementary current mirror in the saturation region to reduce the voltage swing, reducing the latency from 10.00 to 0.83 µs (Supplementary Note 2). Therefore, the DSCC is very suitable for the VMM operations in RRAM with current as the output. In all precision operations of our work, an 8-bit DSCC is adopted because the number of bits of the output can be adjusted as needed. Taking the 8bIN–9bW operation as an example, the inputs of the DSCC are the output currents I_OUTLL, I_OUTLH, I_OUTHL and I_OUTHH of the AM. The difference between the DAC current and input current drives the load capacitor to generate the corresponding response voltage. The voltage is converted into a logic signal by double-sampling amplification and transmitted to the SAR logic and current DAC. After eight double samplings, the final results (namely, OUTLL, OUTLH, OUTHL and OUTHH) are sent to a digital processor for the next computing. As the main power of the DSCC depends on the input currents, the nanoampere operation current significantly reduces the power consumption.

**Fig. 4: High-performance 8-bit AM and 8-bit DSCC.**

Figure 4b illustrates the 4-bit AM comprising current doublers with gate precharge switch followers. For the 8bIN–9bW operation, four shaped currents (I_Shaped0–I_Shaped3) are input to the AM, and these currents are multiplied by inputs IN0–IN7 through current doublers to get the output currents I_OUTLL, I_OUTLH, I_OUTHL and I_OUTHH. Due to the large parasitic capacitance of the output gate of the current doubler, the gate precharge switch follower has been added to reduce T_AC. As demonstrated in Fig. 4c, when the gate precharge mode is used, en and S2 are turned on at the beginning of the cycle; then, the follower controlling the output gate is started. The output (V_WSF) quickly approaches the target value; then, the follower will be off and S1 will be on. The direct connection between the input gate and output gate of the current mirror can eliminate the follower error and accurate V_WSF can be obtained. In this way, the T_AC of AM can be reduced by 54%.

The combination of the two schemes reduces the total T_AC from 10 to 1 µs due to the dominant role of the AM (Supplementary Note 5). Details of the AMs for the 1bIN–2bW and 4bIN–5bW operations are provided in Supplementary Note 4.

Evaluation of 3D VRRAM-based nvCIM macro

Figure 5a,b demonstrates the input 3D brain MRI and hardware automatic test platform with 3D VRRAM-based nvCIM macro as the computing core (Methods and Supplementary Fig. 6). This platform controlled by a field-programmable gate array (FPGA) implements the 1bIN–2bW, 4bIN–5bW and 8bIN–9bW VMM operations based on the ADINWM scheme and realizes the 1bIN–2bW VMM operation under the PWIVMM scheme for comparison. Figure 5c depicts the ideal results by software simulation and the experimental results using the ADINWM and PWIVMM schemes under the 1bIN–2bW VMM configuration for the edge detection of 3D brain MRI. The result with the ADINWM scheme is close to the ideal result, whereas the result of the PWIVMM scheme introduces more noise due to error accumulation caused by the drift in the conductance of the 3D VRRAM cell.

For the 4bIN–5bW and 8bIN–9bW configurations, compared with pure software simulation, we only lose 0.81% and 0.84% inference accuracy on the MNIST dataset with a self-defined CNN. For the 1bIN–2bW configuration, the ADINWM scheme improves the inference accuracy under the PWIVMM scheme from 93.89% to 94.70% on the MNIST dataset (Supplementary Fig. 7 and Methods). Figure 5d presents a comparison of inference accuracy on the CIFAR-10 dataset for the VGG-8 model between the ADINWM scheme, PWIVMM scheme and ideal result without the drift of cell conductance at the 8bIN–9bW configuration. All the tests are completed on the software platform (Methods). Compared with the conventional method, the ADINWM scheme can effectively alleviate the accuracy drop caused by device fluctuations and achieve an inference accuracy of 90.54% on the CIFAR-10 dataset.

The power analysis of the 3D VRRAM-based nvCIM macro is provided in Supplementary Figs. 8 and 9 and Supplementary Note 6. The power of our chip is at the nanowatt level owing to the higher resistance of the 3D VRRAM cell, which is suitable for extremely low-power edge computation. In the 1bIN–2bW operation using the PWIVMM scheme, the main power is produced by the 3D VRRAM, whereas the DSCC consumes the most energy using the ADINWM scheme, especially for the 1bIN–2bW operation. Figure 5e,f demonstrates the energy efficiency and bit density of our work. This work achieves 62.11 tera-operations per second per watt (TOPS W^–1) at the 1bIN–2bW operation, and 29.94 and 8.32 TOPS W^–1 at the 4bIN–5bW and 8bIN–9bW operations, respectively. The 3D VRRAM cell density is 16.6 times or more—higher than previous 2D RRAM-based works. Therefore, this structure of nvCIM based on a multilayer 3D RRAM has important density and power consumption advantages for a large-capacity 2D/3D CNN calculation. A comparison between this work and previous nvCIM works is shown in Supplementary Table 1. The die photo and summary of our chip are depicted in Supplementary Fig. 10.

Conclusions

We have reported a 2-kb 3D VRRAM nvCIM macro that includes peripheral circuits and is fabricated using 55 nm CMOS technology. Our computing-in-memory (CIM) system achieves an energy efficiency of 62.11 TOPS W^–1 in a 1bIN–2bW configuration, 29.94 TOPS W^–1 in a 4bIN–5bW configuration and 8.32 TOPS W^–1 when using an 8bIN–9bW configuration. The bit density is 58.2 bit µm^–2, which is higher than previous 2D RRAM-based systems. The ADINWM scheme can improve the inference accuracy compared with more common PWIVMM schemes. The macro can be used for brain MRI edge detection and shows a 0.91% improvement in inference accuracy using the CIFAR-10 dataset compared with conventional approaches. Our scheme for AI edge computation is also potentially useful in the design of nvCIM macros based on other emerging 3D RRAM technologies that may suffer from device stability issues. Future work will focus on building a larger 3D VRRAM and the construction of a CIM system with the capacity to contain an entire CNN.

Methods

3D VRRAM-based hardware platform

The hardware platform consists of three main parts:

a power management system that supplies power to other modules,
a printed circuit board with 3D VRRAM-based nvCIM macro. A 2-kb 3D VRRAM and peripheral circuits including a CADS circuit, AM, DSCC and partial digital circuits are bonded and packaged in a stacked way and they communicate with the outside through the printed circuit board.
an FPGA board with an embedded processor is responsible for data management between the nvCIM macro and the host computer as well as the logic control of the nvCIM macro.

For the 1bIN–2bW configuration, we employ the developed platform for the edge detection of 3D brain MRI, which is one of the main effects of convolution kernels from the first several layers of the CNN. Specifically, 3D Prewitt kernels—the prevailing kernels for the edge detection of 3D images—are unrolled into 1 × 27 vectors and programmed into 3D VRRAM cells belonging to different BLs by the peripheral write circuits and used for convolving on 3D brain MRI.

When adopting the PWIVMM scheme, 27 pixels in the receptive field are parallelly input into the 3D VRRAM through the WL switches in the form of 27 1-bit input voltages each time. Since each pixel is 8-bit data, after eight consecutive cycles, the VMM operation in the receptive field is completed, and the edge detection image is obtained after multiple sliding window operations (the 1-bit input data flow is shown in Fig. 2). For the ADINWM scheme, the pixels converted into the corresponding 1-bit input voltage are serially input into the AM, and the VMM operation in the receptive field is completed after 8 × 27 cycles; the edge detection image is obtained after multiple sliding window operations (the 1-bit input data flow is shown in Supplementary Fig. 2).

Details of 3D VRRAM

The details of the 3D VRRAM are shown in Supplementary Fig. 1. The size of the eight-layer 3D VRRAM used in this work is 8 × 8 × 32 (corresponding to 64 BLs and 32 WLs; Fig. 2) limited to the current process. The detailed fabrication flow of the 3D VRRAM is given in Supplementary Note 1.

The V_w/2 writing scheme is applied to the 3D VRRAM. Here V_w refers to the write voltage. During the writing process, the WLs and BLs of the selected cells are biased at V_w and ground, respectively, whereas both WLs and BLs of the unselected cells are biased at V_w/2.

Evaluation of inference accuracy on MNIST and CIFAR-10 datasets

The detailed configuration of the self-defined CNN is depicted in Supplementary Fig. 7. The feature extraction layer contains six 5 × 5 convolution kernels and 2 × 2 max pooling layer. The classification layer is a 200 × 10 full connection layer. As a proof-of-concept demonstration, six 5 × 5 convolution kernels learned by the self-defined CNN on the MNIST dataset at the 1bIN–2bW, 4bIN–5bW and 8bIN–9bW configurations are unrolled into multiple 1 × 25 vectors and mapped on 3D VRRAM cells belonging to different BLs. Similar to the edge detection of 3D images, the learned convolution kernels are used to extract features of handwritten digits under the control of the FPGA platform. The results of the VMM operations on hardware are input to the next modules implemented by software including activation functions, max pooling layer and classification layer.

The quantized VGG-8 CNN model for inference on CIFAR-10 uses the ‘WAGE’ method for reference³⁵. The CADS, AM and DSCC circuits are implemented by functions embedded in the inference process of VGG-8. Limited by current technology, although the inference accuracy can be obtained by repeatedly writing the weights of CNN to the 3D VRRAM, the device may be broken down due to too many write operations caused by a large CNN. However, larger 3D VRRAMs or multiple arrays in our future work will resolve this problem.

Data availability

The data that support the plots within this paper and other findings of this study are available from the corresponding authors upon reasonable request. The MRI data in the paper are available at http://www.ia.unc.edu/MSseg.

Code availability

The code that supports the experimental platforms and proposed 3D VRRAM-based nvCIM test chip is available from the corresponding authors upon reasonable request.

References

Lecun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article Google Scholar
Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2017).
Sze, V., Chen, Y.-H., Yang, T.-J. & Emer, J. S. Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105, 2295–2329 (2017).
Article Google Scholar
Xu, X. et al. Scaling for edge inference of deep neural networks. Nat. Electron. 1, 216–222 (2018).
Article Google Scholar
Ji, S. et al. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 221–231 (2013).
Article Google Scholar
Hegde, K., Agrawal, R., Yao, Y. & Fletcher, C. Morph: flexible acceleration for 3D CNN-based video understanding. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) 933–946 (IEEE, 2018).
Liu, S. et al. Cambricon: an instruction set architecture for neural networks. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) 393–405 (IEEE, 2016).
Shin, D. et al. DNPU: an 8.1TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks. In 2017 IEEE International Solid-State Circuits Conference (ISSCC) 240–241 (IEEE, 2017).
Chen, Y.-H., Krishna, T., Emer, J. S. & Sze, V. Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid State Circuits 52, 127–138 (2017).
Article Google Scholar
Pandiyan, D. & Wu, C. Quantifying the energy cost of data movement for emerging smart phone workloads on mobile platforms. In 2014 IEEE International Symposium on Workload Characterization (IISWC) 171–180 (IEEE, 2014).
Chen, W. H. et al. CMOS-integrated memristive non-volatile computing-in-memory for AI edge processors. Nat. Electron. 2, 420–428 (2019).
Article Google Scholar
Cai, F. et al. A fully integrated reprogrammable memristor–CMOS system for efficient multiply–accumulate operations. Nat. Electron. 2, 290–299 (2019).
Article Google Scholar
Yao, P. et al. Fully hardware-implemented memristor convolutional neural network. Nature 577, 641–646 (2020).
Article Google Scholar
Chi, P. et al. PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. ACM SIGARCH Comput. Archit. News 44, 27–39 (2016).
Article Google Scholar
Xue, C. et al. A 22nm 2Mb ReRAM compute-in-memory macro with 121-28TOPS/W for multibit MAC computing for tiny AI edge devices. In 2020 IEEE International Solid-State Circuits Conference—(ISSCC) 244–246 (IEEE, 2020).
Liu, Q. et al. A fully integrated analog ReRAM based 78.4TOPS/W compute-in-memory chip with fully parallel MAC computing. In 2020 IEEE International Solid-State Circuits Conference—(ISSCC) 500–502 (IEEE, 2020).
Xue, C. et al. A 1Mb multibit ReRAM computing-in-memory macro with 14.6ns parallel MAC computing time for CNN based AI edge processors. In 2019 IEEE International Solid-State Circuits Conference—(ISSCC) 388–390 (IEEE, 2019).
Chen, W.-H. et al. A 65 nm 1 Mb nonvolatile computing-in-memory ReRAM macro with sub-16 ns multiply-and-accumulate for binary DNN AI edge processor. In 2018 IEEE International Solid-State Circuits Conference—(ISSCC) 494–495 (IEEE, 2018).
Lee, M. et al. 2-stack 1D-1R cross-point structure with oxide diodes as switch elements for high density resistance RAM applications. In 2007 IEEE International Electron Devices Meeting 771–774 (IEEE, 2007).
Lee, M. et al. Stack friendly all-oxide 3D RRAM using GaInZnO peripheral TFT realized over glass substrates. In 2008 IEEE International Electron Devices Meeting 1–4 (IEEE, 2008).
Yoon, H. et al. Vertical cross-point resistance change memory for ultra high density non-volatile memory applications. In 2009 Symposium on VLSI Technology 26–27 (IEEE, 2009).
Chen, H. et al. HfOx based vertical resistive random access memory for cost-effective 3D cross-point architecture without cell selector. In 2012 International Electron Devices Meeting 20.7.1–20.7.4 (IEEE, 2012).
Yu, S. et al. 3D vertical RRAM—scaling limit analysis and demonstration of 3D array operation. In 2013 Symposium on VLSI Technology T158–T159 (IEEE, 2013).
Deng, Y. et al. Design and optimization methodology for 3D RRAM arrays. In 2013 IEEE International Electron Devices Meeting 25.7.1–25.7.4 (IEEE, 2013).
Shulaker, M. M. et al. Three-dimensional integration of nanotechnologies for computing and data storage on a single chip. Nature 547, 74–78 (2017).
Article Google Scholar
Adam, G. C. et al. 3-D memristor crossbars for analog and neuromorphic computing applications. IEEE Trans. Electron Devices 64, 312–318 (2017).
Article Google Scholar
Li, Z., Chen, P. Y., Xu, H. & Yu, S. Design of ternary neural network with 3-D vertical RRAM array. IEEE Trans. Electron Devices 64, 2721–2727 (2017).
Article Google Scholar
Lin, P. et al. Three-dimensional memristor circuits as complex neural networks. Nat. Electron. 3, 225–232 (2020).
Article Google Scholar
Li, H. et al. Four-layer 3D vertical RRAM integrated with FinFET as a versatile computing unit for brain-inspired cognitive information processing. In 2016 IEEE Symposium on VLSI Technology 1–2 (IEEE, 2016).
Luo, Q. et al. 8-layers 3D vertical RRAM with excellent scalability towards storage class memory applications. In 2017 IEEE International Electron Devices Meeting (IEDM) 2.7.1–2.7.4 (IEEE, 2017).
Huo, Q. et al. Demonstration of 3D convolution kernel function based on 8-layer 3D vertical resistive random access memory. IEEE Electron Device Lett. 47, 497–500 (2020).
Article Google Scholar
Xu, X. et al. Fully CMOS compatible 3D vertical RRAM with self-aligned self-selective cell enabling sub-5nm scaling. In 2016 IEEE Symposium on VLSI Technology 1–2 (IEEE, 2016).
Irem, B. et al. Neuromorphic computing with multi-memristive synapses. Nat. Commun. 9, 2514 (2018).
Han, S. et al. EIE: efficient inference engine on compressed deep neural network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) 243–254 (ACM, 2016).
Wu, S. et al. Training and inference with integers in deep neural networks. In 2018 International Conference on Learning Representations (ICLR) (2018).

Download references

Acknowledgements

This work is supported by the National Key Research Plan of China (nos. 2018YFB0407500, 2019YFB2205100, 2021YFB3601300 and 2019Q1NRC001); Strategic Priority Research Program of the Chinese Academy of Sciences, China (grant no. XDB44000000); and the National Natural Science Foundation of China under grant nos. 61720106013, 61888102, 61922083, 61834009 and 62025406.

Author information

These authors contributed equally: Qiang Huo, Yiming Yang.

Authors and Affiliations

Institute of Microelectronics of Chinese Academy of Sciences, Beijing, China
Qiang Huo, Xiangqu Fu, Qirui Ren, Xiaoxin Xu, Qing Luo, Guozhong Xing, Hao Wu, Yiyang Yuan, Feng Zhang & Ming Liu
Beijing Institute of Technology, Beijing, China
Yiming Yang, Yiming Wang, Xiaoran Li & Xinghua Wang
School of Integrated Circuits, Guangdong University of Technology, Guangzhou, China
Dengyun Lei
Xiamen University of Technology, Xiamen, China
Chengying Chen
University of Electronic Science and Technology of China, Chengdu, China
Xin Si & Qiang Li
National Tsing Hua University, Hsinchu, Taiwan
Meng-Fan Chang

Authors

Qiang Huo
View author publications
You can also search for this author in PubMed Google Scholar
Yiming Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yiming Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dengyun Lei
View author publications
You can also search for this author in PubMed Google Scholar
Xiangqu Fu
View author publications
You can also search for this author in PubMed Google Scholar
Qirui Ren
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoxin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Qing Luo
View author publications
You can also search for this author in PubMed Google Scholar
Guozhong Xing
View author publications
You can also search for this author in PubMed Google Scholar
Chengying Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xin Si
View author publications
You can also search for this author in PubMed Google Scholar
Hao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yiyang Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoran Li
View author publications
You can also search for this author in PubMed Google Scholar
Xinghua Wang
View author publications
You can also search for this author in PubMed Google Scholar
Meng-Fan Chang
View author publications
You can also search for this author in PubMed Google Scholar
Feng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ming Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Q.H., Y. Yang and F.Z. conceived the concept of this work and designed the CIM chip and experiments. X.X., Q. Luo and D.L. contributed to the 3D VRRAM design, measurements and device optimizations. Y. Yang, D.L. and Q.H. built the test platform. Q.H. conceived the CNN algorithm and simulation design. Q.H., X.F. and Q.R. wrote the manuscript. F.Z. and M.L. supervised this work. All the authors contributed to the results and discussion and commented on the manuscript.

Corresponding authors

Correspondence to Xinghua Wang or Feng Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Electronics thanks Lu Lu, Jae-sun Seo and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–10, Table 1 and Notes 1–6.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Huo, Q., Yang, Y., Wang, Y. et al. A computing-in-memory macro based on three-dimensional resistive random-access memory. Nat Electron 5, 469–477 (2022). https://doi.org/10.1038/s41928-022-00795-x

Download citation

Received: 04 June 2021
Accepted: 14 June 2022
Published: 26 July 2022
Issue Date: July 2022
DOI: https://doi.org/10.1038/s41928-022-00795-x

This article is cited by

Memristor-based hardware accelerators for artificial intelligence
- Yi Huang
- Takashi Ando
- Qiangfei Xia
Nature Reviews Electrical Engineering (2024)
First demonstration of in-memory computing crossbar using multi-level Cell FeFET
- Taha Soliman
- Swetaki Chatterjee
- Hussam Amrouch
Nature Communications (2023)
A full spectrum of computing-in-memory technologies
- Zhong Sun
- Shahar Kvatinsky
- Ru Huang
Nature Electronics (2023)
Monolithic 3D integration of 2D transistors and vertical RRAMs in 1T–4R structure for high-density memory
- Maosong Xie
- Yueyang Jia
- Rui Yang
Nature Communications (2023)
CMOS backend-of-line compatible memory array and logic circuitries enabled by high performance atomic layer deposited ZnO thin-film transistor
- Wenhui Wang
- Ke Li
- Yida Li
Nature Communications (2023)