Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A four-megabit compute-in-memory macro with eight-bit precision based on CMOS and resistive random-access memory for AI edge devices

Abstract

Non-volatile computing-in-memory (nvCIM) architecture can reduce the latency and energy consumption of artificial intelligence computation by minimizing the movement of data between the processor and memory. However, artificial intelligence edge devices with high inference accuracy require large-capacity nvCIM macros capable of high-bit-precision dot-product operations. Here we report a four-megabit nvCIM macro that combines memory cells with peripheral circuitry and is based on 22-nm-foundry binary resistive random-access memory devices and complementary metal–oxide–semiconductor (CMOS) processes. The fully CMOS-integrated macro features an asymmetrically modulated input-and-calibration scheme, a calibrated-and-weighted current-to-voltage stacking read scheme, and input-shaping hardware to overcome the challenges involved in designing large-capacity nvCIM macros with high bit precision. The macro offers latencies between 5.2 and 15.2 ns and energy efficiency between 194.4 and 15.6 tera-operations per second per watt in binary to 8-bit-input–8-bit-weight dot-product operations.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of the proposed 4 Mb ReRAM nvCIM macro.
Fig. 2: AMIC scheme.
Fig. 3: Readout scheme of the proposed nvCIM.
Fig. 4: Proposed IN-S scheme and measurement results.

Similar content being viewed by others

Data availability

The data that support the plots within this paper and other findings of this study are available from the corresponding author upon reasonable request.

Code availability

The code that supports the experimental platforms and proposed nvCIM test chip is available from the corresponding author upon reasonable request.

References

  1. Xue, C.-X. et al. 15.4 A 22nm 2Mb ReRAM compute-in-memory macro with 121-28TOPS/W for multibit MAC computing for tiny AI edge devices. In 2020 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers 244–245 (IEEE, 2020).

  2. Xue, C.-X. et al. 24.1 A 1Mb multibit ReRAM computing-in-memory macro with 14.6ns parallel MAC computing time for CNN based AI edge processors. In 2019 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers 388–390 (IEEE, 2019).

  3. Chen, W.-H. et al. CMOS-integrated memristive non-volatile computing-in-memory for AI edge processors. Nat. Electron. 2, 420–428 (2019).

    Article  Google Scholar 

  4. Tang, K.-T. et al. Considerations of integrating computing-in-memory and processing-in-sensor into convolutional neural network accelerators for low-power edge devices. In 2019 IEEE Symposium on VLSI Technology T166–T167 (IEEE, 2019).

  5. Mochida, R. et al. A 4M synapses integrated analog ReRAM based 66.5 TOPS/W neural-network processor with cell current controlled writing and flexible network architecture. In 2018 IEEE Symposium on VLSI Technology 175–176 (IEEE, 2018).

  6. Yao, P. et al. Fully hardware-implemented memristor convolutional neural network. Nature 577, 641–646 (2020).

    Article  Google Scholar 

  7. Wan, W. et al. 33.1 A 74 TMACS/W CMOS-RRAM neurosynaptic core with dynamically reconfigurable dataflow and in-situ transposable weights for probabilistic graphical models. In 2020 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers 498–499 (IEEE, 2020).

  8. Liu, Q. et al. 33.2 A fully integrated analog ReRAM based 78.4TOPS/W compute-in-memory chip with fully parallel MAC computing. In 2020 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers 500–502 (IEEE, 2020).

  9. Cai, F. et al. A fully integrated reprogrammable memristor–CMOS system for efficient multiply–accumulate operations. Nat. Electron. 2, 290–299 (2019).

    Article  Google Scholar 

  10. Li, C. et al. Analogue signal and image processing with large memristor crossbars. Nat. Electron. 1, 52–59 (2018).

    Article  Google Scholar 

  11. Wang, Z. et al. Fully memristive neural networks for pattern classification with unsupervised learning. Nat. Electron. 1, 137–145 (2018).

    Article  Google Scholar 

  12. Ambrogio, S. et al. Equivalent-accuracy accelerated neural-network training using analogue memory. Nature 558, 60–67 (2018).

    Article  Google Scholar 

  13. Wu, F. et al. Brain-inspired computing exploiting carbon nanotube FETs and resistive RAM: hyperdimensional computing case study. In 2018 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers 492–494 (IEEE, 2018).

  14. Zidan, M.-A. et al. The future of electronics based on memristive systems. Nat. Electron. 1, 22–29 (2018).

    Article  Google Scholar 

  15. Ielmini, D. et al. In-memory computing with resistive switching devices. Nat. Electron. 1, 333–343 (2018).

    Article  Google Scholar 

  16. Yao, P. et al. Face classification using electronic synapses. Nat. Commun. 8, 15199 (2017).

    Article  Google Scholar 

  17. Sheridan, P. et al. Sparse coding with memristor networks. Nat. Nanotechnol. 12, 784–789 (2017).

    Article  Google Scholar 

  18. Li, H. et al. Hyperdimensional computing with 3D VRRAM in-memory kernels: device-architecture co-design for energy-efficient, error-resilient language recognition. In 2016 IEEE International Electron Devices Meeting (IEDM) 16.1.1–16.1.4 (IEEE, 2016).

  19. Chen, B. et al. Efficient in-memory computing architecture based on crossbar arrays. In 2015 IEEE International Electron Devices Meeting (IEDM) 17.5.1–17.5.4 (IEEE, 2015).

  20. Prezioso, M. et al. Training and operation of an integrated neuromorphic network based on metal-oxide memristors. Nature 521, 61–64 (2015).

    Article  Google Scholar 

  21. Wong, H.-S. P. et al. Memory leads the way to better computing. Nat. Nanotechnol. 10, 191–194 (2015).

    Article  Google Scholar 

  22. Yang, J. J. et al. Memristive devices for computing. Nat. Nanotechnol. 8, 13–24 (2013).

    Article  Google Scholar 

  23. Borghetti, J. et al. ‘Memristive’ switches enable ‘stateful’ logic operations via material implication. Nature 464, 873–876 (2010).

    Article  Google Scholar 

  24. Ney, A. et al. Programmable computing with a single magnetoresistive element. Nature 425, 485–487 (2003).

    Article  Google Scholar 

  25. Chou, C.-C. et al. A 22nm 96KX144 RRAM macro with a self-tracking reference and a low ripple charge pump to achieve a configurable read window and a wide operating voltage range. In 2020 IEEE Symposium on VLSI Circuits 1–2 (IEEE, 2020).

  26. Dong, Q. et al. 15.3 A 351TOPS/W and 372.4GOPS compute-in-memory SRAM macro in 7nm FinFET CMOS for machine-learning applications. In 2020 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers 242–244 (IEEE, 2020).

  27. Gonugondla, S. K. et al. A 42pJ/decision 3.12TOPS/W robust in-memory machine learning classifier with on-chip training. In 2018 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers 490–492 (IEEE, 2018).

  28. Biswas, A. et al. Conv-RAM: an energy-efficient SRAM with embedded convolution computation for low-power CNN-based machine learning applications. In 2018 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers 488–490 (IEEE, 2018).

  29. Levisse, A. et al. Write termination circuits for RRAM: a holistic approach from technology to application considerations. IEEE Access 8, 109297–109308 (2020).

    Article  Google Scholar 

  30. Chang, M.-F. et al. 19.4 Embedded 1Mb ReRAM in 28nm CMOS with 0.27-to-1V read using swing-sample-and-couple sense amplifier and self-boost-write-termination scheme. In 2014 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers 332–333 (IEEE, 2014).

  31. Liu, Y. et al. 4.7 A 65nm ReRAM-enabled nonvolatile processor with 6× reduction in restore time and 4× higher clock frequency using adaptive data retention and self-write-termination nonvolatile logic. In 2016 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers 84–86 (IEEE, 2016).

  32. Wu, T. F. et al. 14.3 A 43pJ/cycle non-volatile microcontroller with 4.7μs shutdown/wake-up integrating 2.3-bit/cell resistive RAM and resilience techniques. In 2019 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers 226–228 (IEEE, 2019).

  33. Jain, P. et al. 13.2 A 3.6Mb 10.1Mb/mm2 embedded non-volatile ReRAM macro in 22nm FinFET technology with adaptive forming/set/reset schemes yielding down to 0.5V with sensing time of 5ns at 0.7V. In 2019 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers 212–214 (IEEE, 2019).

  34. Lee, C.-F. et al. A 1.4Mb 40-nm embedded ReRAM macro with 0.07um2 bit cell, 2.7mA/100MHz low-power read and hybrid write verify for high endurance application. In 2017 IEEE Asian Solid-State Circuits Conference (A-SSCC) 9–12 (IEEE, 2017).

  35. Shulaker, M. et al. Three-dimensional integration of nanotechnologies for computing and data storage on a single chip. Nature 547, 74–78 (2017).

    Article  Google Scholar 

  36. Seo, S. et al. Artificial optic-neural synapse for colored and color-mixed pattern recognition. Nat. Commun. 9, 5106 (2018).

    Article  Google Scholar 

  37. Zhou, F. et al. Optoelectronic resistive random access memory for neuromorphic vision sensors. Nat. Nanotechnol. 14, 776–782 (2019).

    Article  Google Scholar 

  38. Yu, S. et al. Binary neural network with 16 Mb RRAM macro chip for classification and online training. In 2016 IEEE International Electron Devices Meeting (IEDM) 16.2.1–16.2.4 (IEEE, 2016).

  39. Cassinerio, M. et al. Logic computation in phase change materials by threshold and memory switching. Adv. Mater. 25, 5975–5980 (2013).

  40. Abbey, T. et al. An embedded environmental control micro-chamber system for RRAM memristor characterisation. In 2018 IEEE International Symposium on Circuits and Systems (ISCAS) 1–4 (IEEE, 2018).

  41. Boybat, I. et al. Neuromorphic computing with multi-memristive synapses. Nat. Commun. 9, 2514 (2018).

    Article  Google Scholar 

  42. Gallo, M. L. et al. Mixed-precision in-memory computing. Nat. Electron. 1, 246–253 (2018).

    Article  Google Scholar 

  43. Xue, C.-X. et al. A CMOS-integrated compute-in-memory macro based on resistive random-access memory for AI edge devices. Nat. Electron 4, 81–90 (2021).

    Article  Google Scholar 

  44. Chen, W.-H. et al. A 65nm 1Mb nonvolatile computing-in-memory ReRAM macro with sub-16ns multiply-and-accumulate for binary DNN AI edge processors. In 2018 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers 494–496 (IEEE, 2018).

  45. Deng, L. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 29, 141–142 (2012).

    Article  Google Scholar 

  46. Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images (Univ. Toronto, 2009); http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

  47. Biswas, A. & Chandrakasan, A. P. CONV-SRAM: an energy-efficient SRAM with in-memory dot-product computation for low-power convolutional neural networks. IEEE J. Solid State Circuits 54, 217–230 (2019).

    Article  Google Scholar 

Download references

Acknowledgements

We express our appreciation for support from NVM-DTP of TSMC, TSMC-NTHU JDP, CR of TSMC, NTHU and MOST-Taiwan for their technical and financial support.

Author information

Authors and Affiliations

Authors

Contributions

C.-X.X. and J.-M.H. designed the circuits for the nvCIM macro and test chip. C.-X.X., J.-M.H., H.-Y.K., Y.-H.H., S.-P.H., F.-C.C., T.-W.L., C.-J.J., K.-T.T., C.-C.L., R.-S.L., C.-C.H., M.-S.H. and M.-F.C. contributed ideas. H.-Y.K., Y.-H.H., S.-P.H., F.-C.C., T.-W.L., C.-J.J., W.-S.K. and C.-I.S. built the test measurement system and testing flow for the ReRAM nvCIM macro. F.-C.C. and T.-W.L. built the CIFAR-100 demonstration system. C.-X.X. and J.-M.H. performed the analysis and measurements of the nvCIM macro. C.-C.C., Y.-D.C., T.-Y.J.C. and M.-F.C. managed the project. C.-X.X., Y.-D.C., M.-S.H and M.-F.C. wrote the paper.

Corresponding author

Correspondence to Meng-Fan Chang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Electronics thanks Deliang Fan, Miao Hu and Tony T. Kim for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Summary table and resistance distribution of foundry provided 22 nm 1T1R ReRAM cell.

a, The cell size of the production-ready 22 nm 1T1R SLC ReRAM cell was 53F2 26. 27 The set and reset voltages ranged from 1.62~3.63 V, and endurance was estimated at >10 K set/reset 28 cycles. b, Measured cell resistance distribution of 22 nm ReRAM cell. The proposed on-chip 29 program and verify circuit reduced cell-resistance variation by 0.6x in LRS and 0.63x in HRS cells, 30 resulting in an average resistance ratio (HRS/LRS; R-ratio) of 4.51.

Extended Data Fig. 2 Circuits and operational waveform of proposed calibrated-and-weighted 35 current-to-voltage stacking scheme (CW-CVS).

a, Schematic illustrations of CW-CVS and SCCS. b, Operational waveform of CW-CVS scheme used to transform post-calibration dataline current (IDLC) into a stacked voltage signal with place values for multibit partial dot-product results. c, Simulated VSUM1 distribution with and without SCCS. When SCCS was not used, severe overlapping of neighboring partial dot-product values in the voltage domain resulted in a small signal margin. SCCS reduced overlap between neighboring partial dot-product values, resulting in sufficient signal margin for full-precision sensing.

Extended Data Fig. 3 Circuitry and operating waveform of 2-bit-output full-range 48 voltage-mode sense amplifier (2bFR-VSA).

a, Schematic illustration of 2bFR-VSA with differential-voltage double-coupling behavior made possible by two serial capacitor-pairs (C0-C2 and C1-C3). b, Operational waveform for output SAOUT[1:0]=3. c, Breakdown of each operating phase. d, Proposed 2bFR-VSA reduced input-offset voltage by 1.76x-2.91x across the entire input-voltage range (that is, 0 V~VDD), compared to conventional VSA with the same area but without an input-offset suppression scheme. e, Readout yield versus input signal margin of 2bFR-VSA based on 10 K Monte Carlo simulations under typical corners at 250 C. The readout yield of 2bFR-VSA reached 99% under input signal margin of 14 mV and 99.9% under input signal margin of 18 mV.

Extended Data Fig. 4 Yield in macro-level dot-product of nvCIM macro.

Simulated distribution of partial dot-product values readout by 2bFR-VSA, taking into account the process variation typical of ReRAM devices and CMOS transistors.

Extended Data Fig. 5 Comparison table of 2bFR-VSA and CNV-VSA.

Compared a conventional 1b-VSA with 2-bit sequential sensing, the 2bFR-VSA achieved 1.97x improvement in readout latency and a 1.23x improvement in energy consumption. The proposed 2bFR-VSA outperformed conventional 1b-VSA by 1.62x in terms of FoM (latency x Energy x Area).

Extended Data Fig. 6 Concept of digital shift-and-add (DSF) circuits.

Block diagram showing the readout path of the 14b DOUT, including 6b analog-domain readout and digital shift-and-add (DSF) circuits.

Extended Data Fig. 7 Analysis of accuracy under the proposed input-shaping scheme.

a, The improvements in inference accuracy provided by IN-S when applied to an nvCIM macro with reduced output precision can be attributed to (1) an increase in the number of full-precision partial dot-product readouts, and (2) fewer accumulated quantization errors resulting from a reduction in analog readout precision. b, Three examples illustrating changes in quantization error under the IN-S scheme. Case 1: Quantization error remains unchanged; Case 2: Quantization error decreases; Case 3: Quantization error increases. c. IN-S increased the number of non-zero inputs in the MSBs by 8.5% when classifying CIFAR-100 with ResNet-20 model. d, Distribution of the three outcomes of IN-S when classifying CIFAR-100 with ResNet-20 model. e, Simulated quantization error values across layers of the ResNet-20 model trained for the CIFAR-100 dataset. The input-shaping scheme (IN-S) reduced the accumulated quantization error by 45.9%~28.2%, when implemented in conjunction with the proposed macro using the asymmetric quantization readout scheme. f, When applied to an nvCIM with 4b reduced precision for the LSBs, IN-S produced overall reductions in accumulated quantization error that exceeded the overall increases by 2.33x. Overall, IN-S processing reduced the accumulation of quantization errors by 36.4%.

Extended Data Fig. 8 Fluctuations in macro-level dot-product of nvCIM macro.

Measured fluctuations in macro-level dot-product accuracy of proposed nvCIM macro under various temperatures and power supply voltages (VDD). The measured dot-product accuracy was accumulated from the 18 convolution layers of the ResNet-20 model trained for the CIFAR-100 dataset.

Extended Data Fig. 9 Die Photo, chip summary and area breakdown with position chart of recent silicon verified nvCIM macros.

a, Die photo and testchip summary table. b, Basic circuits in the memory macro accounted for 82.23% of the 4 Mb macro area. Computing related circuits (AMIC, CW-CVS, 2bFR-VSA, DSF) accounted for 17.13% of the 4 Mb macro area. c, Position chart of recent silicon verified nvCIM works. Our work achieved the highest figure of merit (FoM), where the FoM was the product of energy efficiency, input bit precision, weight bit precision, and output ratio. Our proposed 4 Mb nvCIM macro also provided the largest memory capacity.

Extended Data Fig. 10 Experiment platform for evaluating proposed nvCIM macro.

a, Experiment platform comprising 22 nm 4 Mb nvCIM testchip, an FPGA as system controller with intermediate data processor, and a PC to display classification results. b, Flow chart of inference process performed on the experiment platform. While the 4 Mb nvCIM macro performed dot-product operations (convolutions), the FPGA fed the input to the nvCIM macro and summed up the batch of partial dot products. The FPGA also performed ReLU operations, pooling, and 1st layer convolution functions. The PC platform presented the intermediate data generated during inference as well as the final inference results.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hung, JM., Xue, CX., Kao, HY. et al. A four-megabit compute-in-memory macro with eight-bit precision based on CMOS and resistive random-access memory for AI edge devices. Nat Electron 4, 921–930 (2021). https://doi.org/10.1038/s41928-021-00676-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41928-021-00676-9

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing