Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference

Abstract

Analogue in-memory computing (AIMC) with resistive memory devices could reduce the latency and energy consumption of deep neural network inference tasks by directly performing computations within memory. However, to achieve end-to-end improvements in latency and energy consumption, AIMC must be combined with on-chip digital operations and on-chip communication. Here we report a multicore AIMC chip designed and fabricated in 14 nm complementary metal–oxide–semiconductor technology with backend-integrated phase-change memory. The fully integrated chip features 64 AIMC cores interconnected via an on-chip communication network. It also implements the digital activation functions and additional processing involved in individual convolutional layers and long short-term memory units. With this approach, we demonstrate near-software-equivalent inference accuracy with ResNet and long short-term memory networks, while implementing all the computations associated with the weight layers and the activation functions on the chip. For 8-bit input/output matrix–vector multiplications, in the four-phase (high-precision) or one-phase (low-precision) operational read mode, the chip can achieve a maximum throughput of 16.1 or 63.1 tera-operations per second at an energy efficiency of 2.48 or 9.76 tera-operations per second per watt, respectively.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: IBM HERMES project chip overview.
Fig. 2: MVM characterization.
Fig. 3: ResNet-9 on CIFAR-10 measurement results.
Fig. 4: LSTM for character prediction measurement results.
Fig. 5: LSTM for image caption generation measurement results.

Similar content being viewed by others

Data availability

The data that support the plots within this paper and other findings of this study are available from the corresponding authors upon reasonable request.

References

  1. Murmann, B. Mixed-signal computing for deep neural network inference. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 29, 3–13 (2021).

    Article  Google Scholar 

  2. Shafiee, A. et al. ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) 14–26 (IEEE Press, 2016).

  3. Sebastian, A., Le Gallo, M., Khaddam-Aljameh, R. & Eleftheriou, E. Memory devices and applications for in-memory computing. Nat. Nanotechnol. 15, 529–544 (2020).

    Article  Google Scholar 

  4. Yu, S., Jiang, H., Huang, S., Peng, X. & Lu, A. Compute-in-memory chips for deep learning: recent trends and prospects. IEEE Circuits Syst. Magazine 21, 31–56 (2021).

    Article  Google Scholar 

  5. Lanza, M. et al. Memristive technologies for data storage, computation, encryption, and radio-frequency communication. Science 376, eabj9979 (2022).

    Article  Google Scholar 

  6. Wang, Z. et al. Resistive switching materials for information processing. Nat. Rev. Mater. 5, 173–195 (2020).

    Article  Google Scholar 

  7. Xiao, T. P., Bennett, C. H., Feinberg, B., Agarwal, S. & Marinella, M. J. Analog architectures for neural network acceleration based on non-volatile memory. Appl. Phys. Rev. 7, 031301 (2020).

    Article  Google Scholar 

  8. Yu, S. et al. Binary neural network with 16 Mb RRAM macro chip for classification and online training. in 2016 IEEE International Electron Devices Meeting (IEDM) 16.2.1–16.2.4 (IEEE, 2016).

  9. Hu, M. et al. Memristor-based analog computation and neural network classification with a dot product engine. Adv. Mater. 30, 1705914 (2018).

    Article  Google Scholar 

  10. Tsai, H. et al. Inference of long-short term memory networks at software-equivalent accuracy using 2.5m analog phase change memory devices. in 2019 Symposium on VLSI Technology T82–T83 (IEEE, 2019).

  11. Yao, P. et al. Fully hardware-implemented memristor convolutional neural network. Nature 577, 641–646 (2020).

    Article  Google Scholar 

  12. Joshi, V. et al. Accurate deep neural network inference using computational phase-change memory. Nat. Commun. 11, 2473 (2020).

    Article  Google Scholar 

  13. Biswas, A. & Chandrakasan, A. P. CONV-SRAM: an energy-efficient SRAM with in-memory dot-product computation for low-power convolutional neural networks. IEEE J. Solid-State Circuits 54, 217–230 (2019).

    Article  Google Scholar 

  14. Merrikh-Bayat, F. et al. High-performance mixed-signal neurocomputing with nanoscale floating-gate memory cell arrays. IEEE Trans. Neural Netw. Learn. Syst. 29, 4782–4790 (2018).

    Article  Google Scholar 

  15. Cai, F. et al. A fully integrated reprogrammable memristor–CMOS system for efficient multiply–accumulate operations. Nat. Electron. 2, 290–299 (2019).

    Article  Google Scholar 

  16. Chen, W.-H. et al. CMOS-integrated memristive non-volatile computing-in-memory for AI edge processors. Nat. Electron. 2, 420–428 (2019).

    Article  Google Scholar 

  17. Yin, S., Sun, X., Yu, S. & Seo, J.-S. High-throughput in-memory computing for binary deep neural networks with monolithically integrated RRAM and 90-nm CMOS. IEEE Trans. Electron Devices 67, 4185–4192 (2020).

    Article  Google Scholar 

  18. Khaddam-Aljameh, R. et al. HERMES-Core—a 1.59-TOPS/mm2 PCM on 14-nm CMOS in-memory compute core using 300-ps/LSB linearized CCO-based ADCs. IEEE J. Solid-State Circuits 57, 1027–1038 (2022).

    Article  Google Scholar 

  19. Deaville, P., Zhang, B. & Verma, N. A 22nm 128-kb MRAM row/column-parallel in-memory computing macro with memory-resistance boosting and multi-column ADC readout. in 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits) 268–269 (IEEE, 2022).

  20. Khwa, W.-S. et al. A 40-nm, 2M-cell, 8b-precision, hybrid SLC-MLC PCM computing-in-memory macro with 20.5–65.0TOPS/W for tiny-Al edge devices. in 2022 IEEE International Solid-State Circuits Conference (ISSCC) 65, 1–3 (IEEE, 2022).

  21. Wan, W. et al. A compute-in-memory chip based on resistive random-access memory. Nature 608, 504–512 (2022).

    Article  Google Scholar 

  22. Hung, J.-M. et al. A four-megabit compute-in-memory macro with eight-bit precision based on CMOS and resistive random-access memory for AI edge devices. Nat. Electron. 4, 921–930 (2021).

    Article  Google Scholar 

  23. Fick, L., Skrzyniarz, S., Parikh, M., Henry, M. B. & Fick, D. Analog matrix processor for edge AI real-time video analytics. in 2022 IEEE International Solid-State Circuits Conference (ISSCC) 65, 260–262 (IEEE, 2022).

  24. Jia, H. et al. Scalable and programmable neural network inference accelerator based on in-memory computing. IEEE J. Solid-State Circuits 57, 198–211 (2022).

    Article  Google Scholar 

  25. Narayanan, P. et al. Fully on-chip MAC at 14 nm enabled by accurate row-wise programming of PCM-based weights and parallel vector-transport in duration-format. IEEE Trans. Electron Devices 68, 6629–6636 (2021).

    Article  Google Scholar 

  26. Dazzi, M. et al. Efficient pipelined execution of CNNs based on in-memory computing and graph homomorphism verification. IEEE Trans. Comput. 70, 922–935 (2021).

    Article  MathSciNet  MATH  Google Scholar 

  27. Boybat, I. et al. Neuromorphic computing with multi-memristive synapses. Nat. Commun. 9, 2514 (2018).

    Article  Google Scholar 

  28. Khaddam-Aljameh, R. et al. A multi-memristive unit-cell array with diagonal interconnects for in-memory computing. IEEE Trans. Circuits Syst. II, Exp. Briefs 68, 3522–3526 (2021).

    Google Scholar 

  29. Sarwat, S. G. et al. Mechanism and impact of bipolar current voltage asymmetry in computational phase-change memory. Adv. Mater. 2201238 (2022).

  30. Papandreou, N. et al. Programming algorithms for multilevel phase-change memory. in IEEE International Symposium on Circuits and Systems (ISCAS) 329–332 (IEEE, 2011).

  31. Le Gallo, M. et al. Precision of bit slicing with in-memory computing based on analog phase-change memory crossbars. Neuromorp. Comput. Eng. 2, 014009 (2022).

    Article  Google Scholar 

  32. Ielmini, D., Sharma, D., Lavizzari, S. & Lacaita, A. Reliability impact of chalcogenide-structure relaxation in phase-change memory (PCM) cells, part I: experimental study. IEEE Trans. Electron Devices 56, 1070–1077 (2009).

    Article  Google Scholar 

  33. Le Gallo, M., Krebs, D., Zipoli, F., Salinga, M. & Sebastian, A. Collective structural relaxation in phase-change memory devices. Adv. Electron. Mater. 4, 1700627 (2018).

    Article  Google Scholar 

  34. Le Gallo, M., Sebastian, A., Cherubini, G., Giefers, H. & Eleftheriou, E. Compressed sensing with approximate message passing using in-memory computing. IEEE Trans. Electron Devices 65, 4304–4312 (2018).

    Article  Google Scholar 

  35. Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. Tech. Rep. (2009).

  36. Marcus, M. P., Santorini, B. & Marcinkiewicz, M. A. Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist. 19, 313–330 (1993).

    Google Scholar 

  37. Rashtchian, C., Young, P., Hodosh, M. & Hockenmaier, J. Collecting image annotations using Amazon’s mechanical turk. in Proc. NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk CSLDAMT ’10 139–147 (Association for Computational Linguistics, 2010).

  38. Rasch, M. J. et al. Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators. Preprint at arXiv https://doi.org/10.48550/arXiv.2302.08469 (2023).

  39. Rasch, M. J. et al. A flexible and fast pytorch toolkit for simulating training and inference on analog crossbar arrays. in 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS) 1–4 (IEEE, 2021).

  40. Mujika, A., Meier, F. & Steger, A. Fast-slow recurrent neural networks. in Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) (Curran Associates, 2017).

  41. Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. in Proc. 40th Annual Meeting on Association for Computational Linguistics, ACL ’02 311–318 (Association for Computational Linguistics, 2002).

  42. Jain, S. et al. A heterogeneous and programmable compute-in-memory accelerator architecture for analog-AI using dense 2-D mesh. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 31, 114–127 (2022).

    Article  Google Scholar 

  43. Dazzi, M., Sebastian, A., Benini, L. & Eleftheriou, E. Accelerating inference of convolutional neural networks using in-memory computing. Front. Comput. Neurosci. 15, 674154 (2021).

    Article  Google Scholar 

  44. Lin, P. et al. Three-dimensional memristor circuits as complex neural networks. Nat. Electron. 3, 225–232 (2020).

    Article  Google Scholar 

  45. Huo, Q. et al. A computing-in-memory macro based on three-dimensional resistive random-access memory. Nat. Electron. 5, 469–477 (2022).

    Article  Google Scholar 

  46. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE Computer Society, 2016).

  47. Devries, T. & Taylor, G. W. Improved regularization of convolutional neural networks with cutout. Preprint at arXiv https://doi.org/10.48550/arXiv.1708.04552 (2017).

  48. Nandakumar, S. R. et al. Phase-change memory models for deep learning training and inference. in 2019 26th IEEE International Conference on Electronics, Circuits and Systems (ICECS) 727–730 (IEEE, 2019).

  49. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

    Article  Google Scholar 

  50. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).

    MathSciNet  MATH  Google Scholar 

  51. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. in 3rd International Conference on Learning Representations, ICLR 2015 (2015).

  52. Vinyals, O., Toshev, A., Bengio, S. & Erhan, D. Show and tell: a neural image caption generator. in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 3156–3164 (IEEE, 2015).

  53. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2818–2826 (IEEE, 2016).

  54. Russakovsky, O. et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015).

    Article  MathSciNet  Google Scholar 

  55. Ioffe, S. & Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift. in Proc. 32nd International Conference on International Conference on Machine Learning—Volume 37, ICML’15 448–456 (PMLR, 2015).

Download references

Acknowledgements

We thank G. W. Burr, M. Bühler, T. Maurer, A. Müller, Y. Kohda, K. Hosakawa, S. Ambrogio, F. L. Lie, F. Liu, T. Levin and T. Gordon for assistance with the chip design; A. Okazaki, H. Mori and M. Bergendahl for assistance with the chip packaging; J. F. Mas, G. Cristiano and J. Paret for chip testing and simulation; F. Odermatt, I. Boybat, S. R. Nandakumar, C. Piveteau, C. Lammie and H. Benmeziane for help with the network deployment on the chip; and A. Pantazi, R. Haas, A. Curioni, S. Tsai, W. Haensch, J. Burns, R. Divakaruni and M. Khare for managerial support. We would also like to thank L. Benini and B. Rajendran for their support with supervising the students. This work was supported by the IBM Research AI Hardware Center. A. Sebastian acknowledges partial funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement nos. 682675 and 966764).

Author information

Authors and Affiliations

Authors

Contributions

M.L.G. and A. Sebastian defined the neural network inference and compute precision characterization research. R.K.-A. led and performed the analogue design of the chip. M.S. led the digital design of the chip. M.S., M.D., G.K., M.B., A. Singh, S.M.M. and P.A.F. performed the digital design. M.L.G., A.V., B.K., G.K. and A.G. performed the chip testing and wrote the code to operate it. A.V. and B.K. performed the neural network inference and MVM characterization hardware experiments. J.B., X.T., V.J. and M.J.R. performed the hardware-aware training of the neural networks. M.L.G. and U.E. performed the chip performance measurements. U.E. built the chip testing platform. A.P. and T.A. wrote the field-programmable gate array code to interface with the chip. K.B., S.C., I.O., T.P., V.C., C.S., I.A. and N.S. performed the backend integration of the PCM devices and wafer-level testing. M.L.G. wrote the manuscript with input from all authors. V.N., P.A.F., E.E. and A. Sebastian supervised the project.

Corresponding authors

Correspondence to Manuel Le Gallo or Abu Sebastian.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Electronics thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Digital communication fabric.

a, Schematic of link controller. The dotted line refers to the core boundary. The transmitter-side link controller prepends a preamble to the payload being sent on the links. The preamble contains information to uniquely identify a routing path for a particular type of payload. If the receiving core is enabled to receive data from the transmitting core in the routing table, the link controller samples the incoming data. Furthermore, the link controller in the transmitting and receiving cores can select a portion of the payload according to another set of routing registers. b, Possible link connections for Core(3,5) and Core(4,5), where the notation Core(r,c) refers to the core located at row r and column c in Fig. 1b. c, Link connections for the entire chip (available connections are denoted in green color). The RX and TX connections for Core(3,5) and Core(4,5) shown in b are indicated. d, Link characterization results on one chip for communicating data from the LDPU of a core to the LDPU of another core. A payload of 255 bytes is sent by the transmitter core and an error is triggered if at least one byte of the payload received in the LDPU of the receiver core does not match with the original payload. All links with a Manhattan distance of 1 or 2 cores show no errors when run at 100 MHz, and 98% of them at 400 MHz. Links with longer Manhattan distances show more errors potentially due to attenuation from longer-distance routing metal wires due to parasitics. The issue can be mitigated in a future design by placement of buffers at reasonable distances along the wires, or by employing a core connectivity matrix that does not rely on long distance links. All the links used in the experimental demonstrations shown in this work have a Manhattan distance of 1 core and are fully working at 400 MHz.

Extended Data Fig. 2 PCM crossbar array.

a, Schematic of 8T4R unit-cell. The top electrodes of the conductance pairs of each polarity connect to separate bit lines \(B{L}_{m}^{+}\), \(B{L}_{m}^{-}\) and the sources of their lower access-transistors connect to separate source lines \(S{L}_{n}^{+}\), \(S{L}_{n}^{-}\). Thus, the devices in a conductance pair are weighted with equal significance and the total conductance per unit-cell becomes: \(\left({g}_{1}^{+}+{g}_{2}^{+}\right)-\left({g}_{1}^{-}+{g}_{2}^{-}\right)\). b, Schematic of PCM crossbar array. To program the PCM devices, the dedicated per-core programming FSM instructs the diagonal selection decoder to enable one diagonal of cells that contains the devices that are to be programmed. The diagonal selection decoder controls the \({SEL}_{m,n}^{1}\) and \({SEL}_{m,n}^{2}\) signals in the unit-cell, which are routed diagonally throughout the array. The selected devices are programmed by the current-steering DAC-based programming units located on top of the PCM array. To perform an MVM, the 256 inputs to the crossbar array (IN0 − IN255) are applied via the red source lines (SLs) to the 8T4R cells. The resulting bit line (BL) currents are summed up on the blue wires and read by the ADCs that flank the crossbar array on the left and right. c, Layout of one ADC. The block diagram that is shown below the layout illustrates the various components of the ADC, namely, the read voltage regulator, the current-to-frequency converter, and the 2 × 12-bit ripple counter.

Extended Data Fig. 3 PCM device.

a, A typical programming curve indicating the programmed device conductance as a function of the programming current. The device conductance is determined by the phase configuration within the PCM device and in particular, the size of the amorphous region. Data are presented as mean values + / − one standard deviation over 10 repeated measurements on a single device. b, Low-angle annular darkfield (LAADF) scanning transmission electron microscope (STEM) image of a fully RESET PCM device showing a substantially large amorphous region that fully blocks the bottom electrode. LAADF enables the imaging of the amorphous region with high resolution. c, LAADF of a partially RESET PCM device showing a much smaller amorphous region. The synaptic weights are stored in an analog manner in terms of these phase configurations and the resulting conductance values.

Extended Data Fig. 4 Input modulation modes for MVM.

a, Full array read procedure for MVMs showing the connection between ADC, unit-cells, and input modulator switches. Signals PP and PN connect the positive source lines \(S{L}_{1:N}^{+}\) to the positive potential V+ and negative potential V, respectively. For NP and NN it is vice versa. b, 1-phase modulation mode. Inputs of positive and negative polarity are applied to weights of positive and negative polarity in one modulation cycle TPWM. c, 4-phase modulation mode. Inputs of positive and negative polarity are applied individually to weights of positive and negative polarity in four modulation cycles.

Extended Data Fig. 5 Weight programming procedure.

a, Crossbar array during programming. b, Proposed TDP algorithm to program a target conductance value G on a unit-cell. c, Weight error comparison between TDP of this work and previous approaches. TDP - Max-fill refers to programming the two devices with iterative programming up to the ODP \({G}_{\max }\), as proposed in Ref. 31. Due to the wide SET distribution shown in Fig. 2a, some devices in the core either cannot achieve \({G}_{\max }\), or conversely could be programmed to much higher conductance values than \({G}_{\max }\). Therefore, the latter approach leads to programming inaccuracies resulting from either under-utilizing the conductance range of individual devices or from devices that cannot reach \({G}_{\max }\). The proposed TDP algorithm solves this issue by using the readout SET conductance of the devices of the unit-cell to map the weight.

Extended Data Fig. 6 Area and power splits.

a, Area breakdown of the main chip components. b, Static power consumed by the different components of the chip measured for the operation of the LSTM unit of the image caption generation task (4 core rows and GDPU active).

Extended Data Table 1 Summary of IBM HERMES Project Chip specifications
Extended Data Table 2 Comparison of IBM HERMES Project Chip with other multi-core AIMC chips demonstrating neural network inference

Supplementary information

Supplementary Video 1

Live demonstration of image captioning using the IBM HERMES project chip.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Le Gallo, M., Khaddam-Aljameh, R., Stanisavljevic, M. et al. A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference. Nat Electron 6, 680–693 (2023). https://doi.org/10.1038/s41928-023-01010-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41928-023-01010-1

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics