Illusion of large on-chip memory by networked computing chips for neural network inference

Radway, Robert M.; Bartolo, Andrew; Jolly, Paul C.; Khan, Zainab F.; Le, Binh Q.; Tandon, Pulkit; Wu, Tony F.; Xin, Yunfeng; Vianello, Elisa; Vivet, Pascal; Nowak, Etienne; Wong, H.-S. Philip; Aly, Mohamed M. Sabry; Beigne, Edith; Wootters, Mary; Mitra, Subhasish

doi:10.1038/s41928-020-00515-3

Article
Published: 11 January 2021

Illusion of large on-chip memory by networked computing chips for neural network inference

Robert M. Radway ORCID: orcid.org/0000-0003-3393-5489¹,
Andrew Bartolo²,
Paul C. Jolly¹,
Zainab F. Khan ORCID: orcid.org/0000-0002-8374-8924¹,
Binh Q. Le^1,3,
Pulkit Tandon¹,
Tony F. Wu^1,4,
Yunfeng Xin¹,
Elisa Vianello⁵,
Pascal Vivet⁵,
Etienne Nowak⁵,
H.-S. Philip Wong¹,
Mohamed M. Sabry Aly⁶,
Edith Beigne⁴,
Mary Wootters^1,2 &
…
Subhasish Mitra^1,2

Nature Electronics volume 4, pages 71–80 (2021)Cite this article

4553 Accesses
16 Citations
57 Altmetric
Metrics details

Subjects

Abstract

Hardware for deep neural network (DNN) inference often suffers from insufficient on-chip memory, thus requiring accesses to separate memory-only chips. Such off-chip memory accesses incur considerable costs in terms of energy and execution time. Fitting entire DNNs in on-chip memory is challenging due, in particular, to the physical size of the technology. Here, we report a DNN inference system—termed Illusion—that consists of networked computing chips, each of which contains a certain minimal amount of local on-chip memory and mechanisms for quick wakeup and shutdown. An eight-chip Illusion system hardware achieves energy and execution times within 3.5% and 2.5%, respectively, of an ideal single chip with no off-chip memory. Illusion is flexible and configurable, achieving near-ideal energy and execution times for a wide variety of DNN types and sizes. Our approach is tailored for on-chip non-volatile memory with resilience to permanent write failures, but is applicable to several memory technologies. Detailed simulations also show that our hardware results could be scaled to 64-chip Illusion systems.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: An ideal chip and our Illusion system with nearly identical performance.**

**Fig. 2: DNN mapping onto our Illusion system for sparse inter-chip messages.**

**Fig. 3: Inference scheduling with quick wakeup and shutdown and Distributed ENDURER.**

**Fig. 4: Illusion system performance summary.**

**Fig. 5: Measured ideal chip and Illusion system total power and per-chip power.**

**Fig. 6: Illusion’s minimum capacity per chip and Distributed ENDURER performance.**

A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference

Article 10 August 2023

Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators

Article Open access 30 August 2023

A compute-in-memory chip based on resistive random-access memory

Article Open access 17 August 2022

Data availability

The data that support the findings of this work are available at https://github.com/robust-systems-group/illusion_system.

Code availability

The code that supports the findings of this work is available at https://github.com/robust-systems-group/illusion_system.

References

Aly, M. M. S. et al. Energy-efficient abundant-data computing: the N3XT 1,000. Computer 48, 24–33 (2015).
Google Scholar
Aly, M. M. S. et al. The N3XT approach to energy-efficient abundant-data computing. Proc. IEEE 107, 19–48 (2019).
Article Google Scholar
Donato, M. et al. On-chip deep neural network storage with multi-level eNVM. In Proc. 55th Design Automation Conference (DAC) https://doi.org/10.1145/3195970.3196083 (IEEE, 2018).
Li, H., Bhargava, M., Whatmough, P. N. & Wong, H.-S. P. On-chip memory technology design space explorations for mobile deep neural network accelerators. In Proc. 56th Design Automation Conference (DAC) https://doi.org/10.1145/3316781.3317874 (IEEE, 2019).
Hestness, J. et al. Deep learning scaling is predictable, empirically. Preprint at https://arxiv.org/abs/1712.00409 (2017).
Xu, X. et al. Scaling for edge inference of deep neural networks. Nat. Electron. 1, 216–222 (2018).
Article Google Scholar
Wu, C. J. et al. Machine learning at Facebook: understanding inference at the edge. In Proc. International Symposium on High Performance Computer Architecture (HPCA) 331–344 https://doi.org/10.1109/HPCA.2019.00048(IEEE, 2019).
Sun, G., Zhao, J., Poremba, M., Xu, C. & Xie, Y. Memory that never forgets: emerging nonvolatile memory and the implication for architecture design. Natl Sci. Rev. 5, 577–592 (2018).
Article Google Scholar
Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Netw. 94, 103–114 (2017).
Article Google Scholar
Jouppi, N. P. et al. In-datacenter performance analysis of a tensor processing unit. In Proc. International Symposium on Computer Architecture (ISCA) 1–12 (ACM, 2017).
Lie, S. Wafer-scale deep learning (Hot Chips 2019 Presentation) https://www.hotchips.org/hc31/HC31_1.13_Cerebras.SeanLie.v02.pdf (Cerebras, 2019).
Chen, Y. H., Emer, J. & Sze, V. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) https://doi.org/10.1109/ISCA.2016.40 (2017).
Gao, M., Pu, J., Yang, X., Horowitz, M. & Kozyrakis, C. TETRIS: scalable and efficient neural network acceleration with 3D memory. In Proc. 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 751–764 (ACM, 2017).
Gao, M., Yang, X., Pu, J., Horowitz, M. & Kozyrakis, C. Tangram: optimized coarse-grained dataflow for scalable NN accelerators. In Proc. 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 807–820 (ACM, 2019).
Yang, X. et al. Interstellar: using Halide’s scheduling language to analyze DNN accelerators. In Proc. 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 369–383 (ACM, 2020).
Rabii, S. et al. Computational directions for augmented reality systems. In VLSI Symposium Circuits 102–106 (IEEE, 2019).
Wong, H.-S. P. & Salahuddin, S. Memory leads the way to better computing. Nat. Nanotechnol. 10, 191–194 (2015).
Article Google Scholar
Jung, M. et al. Driving into the memory wall: the role of memory for advanced driver assistance systems and autonomous driving. In Proc. International Symposium on Memory Systems https://doi.org/10.1145/3240302.3240322 (ACM, 2018).
Dazzi, M. et al. 5 Parallel Prism: a topology for pipelined implementations of convolutional neural networks using computational memory. Preprint at https://arxiv.org/abs/1906.03474 (2019).
Song, L., Qian, X., Li, H. & Chen, Y. PipeLayer: a pipelined ReRAM-based accelerator for deep learning. In Proc. International Symposium on High-Performance Computer Architecture (HPCA) 541–552 (IEEE, 2017).
Ankit, A. et al. PUMA: a programmable ultra-efficient memristor-based accelerator for machine learning inference. In Proc. International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) https://doi.org/10.1145/3297858.3304049 (ACM, 2019).
Narayanan, D. et al. PipeDream: generalized pipeline parallelism for DNN training. In ACM Symposium on Operating Systems Principles https://doi.org/10.1145/3341301.3359646 (SOSP, 2019).
Shao, Y. S. et al. Simba: scaling deep-learning inference with multi-chip-module-based architecture. In Proc. Annual International Symposium on Microarchitecture, MICRO 14–27 (IEEE, 2019).
Wei, X., Liang, Y. & Cong, J. Overcoming data transfer bottlenecks in FPGA-based DNN accelerators via layer conscious memory management. In Proc. 56th Annual Design Automation Conference https://doi.org/10.1145/3316781.3317875 (ACM, 2019).
Huang, Y. et al. GPipe: efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems (NeurIPS) 32 (NIPS, 2019).
Le, B. Q. et al. Resistive RAM with multiple bits per cell: array-level demonstration of 3 bits per cell. IEEE Trans. Electron Devices 66, 641–646 (2019).
Article Google Scholar
Wu, T. F. et al. 14.3-A 43-pJ/cycle non-volatile microcontroller with 4.7-μs shutdown/wake-up integrating 2.3-bit/cell resistive RAM and resilience techniques. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 226–228 (IEEE, 2019).
Hsieh, E. R. et al. High-density multiple bits-per-cell 1T4R RRAM array with gradual SET/RESET and its effectiveness for deep learning. In Proc. International Electron Devices Meeting (IEDM) https://doi.org/10.1109/IEDM19573.2019.8993514 (IEEE, 2019).
Chen, A. A review of emerging non-volatile memory (NVM) technologies and applications. Solid State Electron. 125, 25–38 (2016).
Article Google Scholar
Naffziger, S., Lepak, K., Paraschou, M. & Subramony, M. AMD chiplet architecture for high-performance server and desktop products. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 44–45 (IEEE, 2020).
Vivet, P. et al. A 220GOPS 96-core processor with 6 chiplets 3D-stacked on an active interposer offering 0.6-ns/mm latency, 3-Tb/s/mm² inter-chiplet interconnects and 156-mW/mm² @ 82%-peak-dfficiency DC–DC converters. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 46–48 (IEEE, 2020).
Greenhill, D. et al. A 14-nm 1-GHz FPGA with 2.5D transceiver integration. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 54–55 (IEEE, 2017).
Shulaker, M. M. et al. Three-dimensional integration of nanotechnologies for computing and data storage on a single chip. Nature 547, 74–78 (2017).
Article Google Scholar
Netzer, Y. & Wang, T. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning http://ufldl.stanford.edu/housenumbers (NIPS, 2011).
Zhang, Y., Suda, N., Lai, L. & Chandra, V. Hello Edge: keyword spotting on microcontrollers. Preprint at https://arxiv.org/abs/1711.07128 (2017).
Liu, L. & Deng, J. Dynamic deep neural networks: optimizing accuracy-efficiency trade-offs by selective execution. In Proc. 32nd AAAI Conference on Artifical Intelligence 3675–3682 (AAAI, 2018).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015—Conference Track Proceedings (ICLR, 2015).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Neural Inf. Process. Syst. https://doi.org/10.1145/3065386 (2012).
Deng, J. et al. ImageNet: a large-scale hierarchical image database. In Proc. 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).
Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N. & Wu, Y. Exploring the limits of language modeling. Preprint at https://arxiv.org/abs/1602.02410 (2016).
Chelba, C. et al. One billion word benchmark for measuring progress in statistical language modeling. In Proc. Annual Conference of the International Speech Communication Association, INTERSPEECH 2635–2639 (International Speech and Communication Association, 2014).
Turner, W. J. et al. Ground-referenced signaling for intra-chip and short-reach chip-to-chip interconnects. In Proc. 2018 IEEE Custom Integrated Circuits Conference, CICC 2018 https://doi.org/10.1109/CICC.2018.8357077 (IEEE, 2018).
Hills, G. et al. Understanding energy efficiency benefits of carbon nanotube field-effect transistors for digital VLSI. IEEE Trans. Nanotechnol. 17, 1259–1269 (2018).
Article Google Scholar
Le Gallo, M. et al. Mixed-precision in-memory computing. Nat. Electron. 1, 246–253 (2018).
Article Google Scholar
Dong, Q. et al. A 351TOPS/W and 372.4GOPS compute-in-memory SRAM macro in 7-nm FinFET CMOS for machine-learning applications. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 242–244 (IEEE, 2020).
Shafiee, A. et al. ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In Proc. 43rd Annual International Symposium on Computer Architecture (ISCA) 14–26 (IEEE, 2016).
Qiao, X., Cao, X., Yang, H., Song, L. & Li, H. AtomLayer: a universal ReRAM-based CNN accelerator with atomic layer computation. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC) https://doi.org/10.1109/DAC.2018.8465832 (IEEE, 2018).
Guo, R. et al. A 5.1-pJ/neuron 127.3-us/inference RNN-based speech recognition processor using 16 computing-in-memory SRAM macros in 65-nm CMOS. In Proc. 2019 IEEE Symposium on VLSI Circuits C120–C121 (IEEE, 2019).
Wan, W. et al. A 74 TMACS/W CMOS-RRAM neurosynaptic core with dynamically reconfigurable dataflow and in-situ transposable weights for probabilistic graphical models. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 498–500 (IEEE, 2020).
LeCun, Y., Cortes, C. & Burges, C. J. C. MNIST Handwritten Digit Database (2010); http://yann.lecun.com/exdb/mnist/
Warden, P. Speech commands: a dataset for limited-vocabulary speech recognition. Preprint at https://arxiv.org/abs/1804.03209 (2018).
Zhang, T., Lin, Z., Yang, G. & De Sa, C. QPyTorch: a low-precision arithmetic simulation framework. Preprint at https://arxiv.org/abs/1910.04540 (2019).
MSP430-GCC-OPENSOURCE GCC – Open Source Compiler for MSP Microcontrollers (Texas Instruments, accessed 5 August 2020); https://www.ti.com/tool/MSP430-GCC-OPENSOURCE

Download references

Acknowledgements

We acknowledge the Defense Advanced Research Projects Agency (DARPA) 3DSoC programme, the NSF/NRI/GRC E2CDA programme, Intel Corporation, CEA-LETI and the Stanford SystemX Alliance. M.M.S.A. is supported in part by the Singapore AME programmatic fund titled Hardware-Software Co-optimization for Deep Learning (project no. A1892b0026). We would also like to acknowledge S. Taheri and the Stanford Prototyping Facility for assistance with the design, test and debugging of the test harness printed circuit boards.

Author information

Authors and Affiliations

Department of Electrical Engineering, Stanford University, Stanford, CA, USA
Robert M. Radway, Paul C. Jolly, Zainab F. Khan, Binh Q. Le, Pulkit Tandon, Tony F. Wu, Yunfeng Xin, H.-S. Philip Wong, Mary Wootters & Subhasish Mitra
Department of Computer Science, Stanford University, Stanford, CA, USA
Andrew Bartolo, Mary Wootters & Subhasish Mitra
Department of Electrical Engineering, San Jose State University, San Jose, CA, USA
Binh Q. Le
Facebook, Menlo Park, CA, USA
Tony F. Wu & Edith Beigne
CEA, LETI, Grenoble, France
Elisa Vianello, Pascal Vivet & Etienne Nowak
School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Mohamed M. Sabry Aly

Authors

Robert M. Radway
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Bartolo
View author publications
You can also search for this author in PubMed Google Scholar
Paul C. Jolly
View author publications
You can also search for this author in PubMed Google Scholar
Zainab F. Khan
View author publications
You can also search for this author in PubMed Google Scholar
Binh Q. Le
View author publications
You can also search for this author in PubMed Google Scholar
Pulkit Tandon
View author publications
You can also search for this author in PubMed Google Scholar
Tony F. Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yunfeng Xin
View author publications
You can also search for this author in PubMed Google Scholar
Elisa Vianello
View author publications
You can also search for this author in PubMed Google Scholar
Pascal Vivet
View author publications
You can also search for this author in PubMed Google Scholar
Etienne Nowak
View author publications
You can also search for this author in PubMed Google Scholar
H.-S. Philip Wong
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed M. Sabry Aly
View author publications
You can also search for this author in PubMed Google Scholar
Edith Beigne
View author publications
You can also search for this author in PubMed Google Scholar
Mary Wootters
View author publications
You can also search for this author in PubMed Google Scholar
Subhasish Mitra
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.M.R. developed the Illusion approach, the system architectural design and the Illusion scheduling and mapping algorithms, and performed all measurements. P.C.J. led DNN implementation and training. R.M.R. and P.T. developed the BILP. T.F.W. and B.Q.L. designed the test chips, under the guidance of E.V., P.V., E.N., E.B. and H.-S.P.W. The test harness was developed by R.M.R. and T.F.W. Y.X., A.B. and R.M.R. performed Illusion system simulations under the guidance of M.M.S.A. The modelling of Illusion was performed by Z.F.K. and R.M.R. Distributed ENDURER was developed by Z.F.K., who performed analysis and simulations with M.M.S.A., with M.W. providing guidance. S.M. was in charge, advised and led on all aspects of the project.

Corresponding author

Correspondence to Robert M. Radway.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–15, Tables 1–18 and Sections 1–5.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Radway, R.M., Bartolo, A., Jolly, P.C. et al. Illusion of large on-chip memory by networked computing chips for neural network inference. Nat Electron 4, 71–80 (2021). https://doi.org/10.1038/s41928-020-00515-3

Download citation

Received: 17 May 2020
Accepted: 18 November 2020
Published: 11 January 2021
Issue Date: January 2021
DOI: https://doi.org/10.1038/s41928-020-00515-3

This article is cited by

Thermally stable threshold selector based on CuAg alloy for energy-efficient memory and neuromorphic computing applications
- Xi Zhou
- Liang Zhao
- Dongdong Li
Nature Communications (2023)
Performance analysis of multiple input single layer neural network hardware chip
- Akash Goel
- Amit Kumar Goel
- Adesh Kumar
Multimedia Tools and Applications (2023)