Spike-based dynamic computing with asynchronous sensing-computing neuromorphic chip

By mimicking the neurons and synapses of the human brain and employing spiking neural networks on neuromorphic chips, neuromorphic computing offers a promising energy-efficient machine intelligence. How to borrow high-level brain dynamic mechanisms to help neuromorphic computing achieve energy advantages is a fundamental issue. This work presents an application-oriented algorithm-software-hardware co-designed neuromorphic system for this issue. First, we design and fabricate an asynchronous chip called “Speck”, a sensing-computing neuromorphic system on chip. With the low processor resting power of 0.42mW, Speck can satisfy the hardware requirements of dynamic computing: no-input consumes no energy. Second, we uncover the “dynamic imbalance” in spiking neural networks and develop an attention-based framework for achieving the algorithmic requirements of dynamic computing: varied inputs consume energy with large variance. Together, we demonstrate a neuromorphic system with real-time power as low as 0.70mW. This work exhibits the promising potentials of neuromorphic computing with its asynchronous event-driven, sparse, and dynamic nature.


I. INTRODUCTION
In order to accelerate the development of automated and intelligent systems, an increasing number of data sources will be utilised.Consequently, the transmission of data between sources needs to be reduced by communicating only relevant and useful information.To achieve this goal, new and radical The Authors want to thank Giacomo Indiveri, Dylan Muir, and Kynan Eng for the creative process of conceiving the design concept.
The Authors want to thank Nicoletta Risi, Madison Cotteret and Hugh Greatorex for their comments, support and advice on the manuscript.Ole Richter, for the time of the manuscript creation, would like to acknowledge the financial support of the CogniGron research center and the Ubbo Emmius Funds (Univ. of Groningen), during the design and testing Ole Richter was solely affiliated to SynSense AG.
Affiliations: 1 SynSense AG, Thurgauerstrasse 60, 8050 Zurich, Swizerland 2 SynSense, No. 1577, Tianfu Avenue, Chegdu, Sichuan, PR China 3 Bio-Inspired Circuits and Systems (BICS) Lab, Zernike Institute for Advanced Materials, University of Groningen, Netherlands. 4Groningen Cognitive Systems and Materials Center (CogniGron), University of Groningen, Netherlands.developments are required in the fields of extreme edge computing and near-sensor processing.The local extraction of relevant information by moving intelligence to the sensory edge poses difficult challenges in real-time processing with low latency and on the smallest of energy budgets.Ondemand and sparse computation is a promising solution to reduce computational load and energy consumption [1] but is contrasted with the need for always-on sensory information processing.Event-based processing is a paradigm that can break the trade-off between these two requirements.
In the field of image and video processing, Convolutional Neural Networkss (CNNs) have celebrated significant successes [2,3] and edge inference accelerators for CNNs have been also very successful [4,5].This is achieved by making Application Specific Integrated Circuits (ASIC) with more efficient architectures that can skip zero multiplications to sparsify the computational load, and as such are now the industry standard [6].To further increase sparsity in the field of eventbased computation and sensing, a promising computational method is Spiking Neural Networks (SNN) [7,8].To exploit highly sparse CNNs even further, event-driven or sCNNs only process on the availability of individual pixel data and sparsify the activity from layer to layer by using threshold-based neural units [9].
Modern video compression extensively exploits the informa-Fig.1. Photograph of the realised "Speck1" ASIC.The rectangle at the bottom left of the chip is the 128x128 pixel event sensor for machine vision applications, while the rest of the area is occupied by the processing cores and the NoC.
tion redundancy between adjacent frames in order to compress the data stream [10].By taking this data reduction approach to the sensor level and enforcing that individual pixel operate independently and communicate only changes in intensity [11][12][13], the amount of data produced by the sensor is massively reduced.Furthermore, this reduction also reduces the latency of the sensor to levels usually only seen in specialised high frame rate camera modules.The utility of Temporal Contrast (TC) encoding cameras has been extensively demonstrated [14], and they are now being adopted by industry [15].Combining such low latency, high dynamic range and sparse sensor with an event-driven sCNN processor [16], that excels in real-time low latency processing on a single SoC is a natural technological step.To complement the architectural advantages of always-on sparse sensing and computation, the SoC is built in a fully asynchronous fashion.The asynchronous data flow architecture provides low latency, high throughput processing when requested by sensory input, while immediately shifting to a low power/idle state when the sensory input is absent.Specifically, no complex or slow wake-up procedures must be implemented to reduce power consumption.
Neuromorphic intelligence intends to solve the aforementioned challenges in the following domains: Small-scale event-driven sCNN processors using different architectural approaches have been proposed in the community [17][18][19][20][21][22][23], in contrast to classical large-scale neuromorphic architectures which are able to run sCNN networks at a tremendously high synaptic resource cost [24][25][26][27][28][29].We present for the first time a resource-efficient medium-scale sCNN processor combined with a machine vision event-driven sensor to form a truly novel smart vision sensor on a single ASIC.
We first present the asynchronous methodology and principles used in Sec.II., followed by the ASIC architecture in Sec.III including the sCNN processing principle in Sec.III-C and the individual blocks in Sec.III-A to III-E.We conclude with a comparison of different sCNN processors and vision sensor processor combinations in Sec.IV.

II. ASYNCHRONOUS LOGIC DESIGN METHODOLOGY
The design flow of the Speck1 ASIC followed a golden model approach.The golden model was verified using fully proven applications and ML tasks.Then extensive individual feature tests were derived.These feature tests were run at every step of the Integrated Circuit (IC) design flow and compared to the Golden model up to silicon validation after fab-out to verify functionality.Later models have automated production testing support.The design methodology follows the dataflow concept [30][31][32], the processing cores are laid out as individual Fig. 2. The pull-up/pull-down logic of an N-bit buffer (latch) in Speck1.The control section controls the acknowledge signal of the input channel as well as the transparency of the function block.The function block holds the data and clears it automatically on acknowledgement of the output channel.The Buffer decouples the input channel handshake from the output channel handshake by using the data validity.The validity is a C-element tree with OR2 gates on the input to signal when all data channels have data or all data channels are neutral.To save space, the output validity circuit is shared with the next data flow element.Later versions use a design which supports automated testing.
pipeline systems [33,34].Processing cores are compositions of template dataflow primitives [35]: Buffer (latch) as seen in Fig 2, Compose (function block, combined join and/or source), NC-Split (non-conditional fork), Split (conditional demux), Conditional True Pass (conditional sink or latch), Merge (nondeterministic, non-conditional) and Valid-trees (also used as a sink).The templates are built from pull-up/pull-down state holding logic cells [36,37] and are derived from DYNAP-SE2 [38] and DYNAP-SE [39].DYNAP-SE was built with the Asynchronous Circuit Toolkit (ACT) [40] and the template designs follow the Pre-Charge Full Buffer (PCFB) design of [41,42].These templates were used due to their low latency design and prior silicon verification available to us.In addition, some control primitives such as token latches, forks, muxes, and c-elements that do not carry data, were used to implement data flow control.The Speck1 ASIC uses a 4-phase handshake and the Quasi Delay Insensitive (QDI) Dual Rail (DR) data encoding inside the pipelines [35] of the sensor, sensor event pre-processing, convolutional cores, NoC as well as the first half of the readout core pipeline.The encoding is converted to Bundled Data (BD) encoding for optional offchip asynchronous event communication as well as for the Static Random Access Memory (SRAM) interface, which is using self-timed BD encoding.The readout core is divided into Fig. 3.The Speck architecture.The yellow area indicates both the 128x128 event-based vision sensor with its 2D asynchronous readout and the sensor event pre-processing pipeline.The blue area indicates the NoC responsible for all the event routing between all the components.The area indicated in orange incorporates all the nine sCNN cores that handle one convolution and one pooling layer each.The sCNN cores can optionally be operated as fully connected SNN layers with some restrictions.The brown area indicates the decision readout logic.This core enables interfacing to simple synchronous periphery.
an asynchronous part, talking to the NoC, and a synchronous part to ease integration with standard synchronous off-theshelf components, micro-controllers and infrastructure.The combinatorial logic inside the QDI DR pipelines consists of pull-up/pull-down non-inverting gates, while inversion is modelled by swapping true and false wires of the DR data bits.These non-inverting state-holding gates only model the positive transition of the true or false wire output while the negative signal transition occurs on the reset phase of the handshake.This ensures that the logic is hazard free.The Place and Route (P&R) is done by standard industry tools.
Performance is ensured by hierarchically detailed automatic floor planning that employs extensive guides and fences for the individual components and pipeline stages.

III. ARCHITECTURE
The architecture is comprised of 4 different components: the convolution cores, the sensor, the sensor event pre-processing block, and the readout core [43][44][45].These components are connected by a unicast event routing system the NoC, depicted in Fig. 3. Walking from the incoming photons through the processing:

A. Sensor
The sensor of Speck1 consists of 128x128 individually operating event-based Vision Pixels, also called Dynamic Vision Pixels [14].These pixels encode the incident light intensity temporally on a logarithmic intensity scale, also known as TC encoding.The analog pixel design follows the design given in [46] and was provided by Chenan Li from IniVation AG as an analog pixel Intellectual Property Macro (IP).Each pixel is attached to a single handshake buffer to decouple the pixel reset and timing from the nanosecond delays of the arbitration readout system.From the handshake pixel buffer, the event is handed to the arbitration system by signalling a shared pre-charge pull-down bus in the column .The sensor event pre-processor (Sensor event preprocessing in Fig. 3).The pre-processor can process events from the built-in sensor as well as events from off-chip.The in-built sensor events can also be streamed off-chip for monitoring and further processing.The pipeline stages pool, cut, rotate, mirror, channel filter and shift the input event stream, before forwarding it to one or two destination layers.
and on acknowledge signalling on a shared bus in the row.The arbitration is built out of one arbiter tree for column arbitration and one for the rows [47,48].It follows the same design found in DYNAP-SE [39].The event address is encoded with a QDI DR encoder from the acknowledge signals of the arbitration trees and handed off as a 4 phase handshake Address Event Representation channel (AER) word to the event pre-processing block.A complete arbitration process with ID encoding takes approximately 2.5 − 7.5ns for a single readout.This can be optimised significantly by known techniques [49,50], but the specification requirements of realworld signals are met with a margin which allows for a more basic arbitration approach.The arbitration endpoint with buffer in the pixel itself is optimised to limit the transistor count and results in a fill factor of 45% front illumination for each pixel.
The pixel also has a configurable kill switch to eliminate any hot pixel defects due to fabrication at the pixel level by forcing the pixel and buffer into a reset state.

B. Sensor event pre-processing core
To conform the raw AER event stream from the sensor to the requirements of the sCNN a pre-processing stage is required.The image may be flipped, rotated or cropped if only a Region Of Interest (ROI) of the image is required.A lower resolution of the image might be required or the polarity can be ignored.To accomplish this, the sensor event pre-processing pipeline consists of multiple stages seen in Fig. 4. The sensor event will travel through the following pipeline stages: • Sensor interface: the chip can receive pixel events from the built-in sensor and external sources directly via an AER interface and send the sensor events off-chip for monitoring.
• Pooling: sum pooling can be used to scale the 2D input address space of every event word by 1:1, 1:2 and 1:4 on x and y coordinate components individually.• ROI cutting: cutting can be used to cut a 1x1 to 128x128 size patch out of the 2D input address space that is forwarded to the sCNN.• Image rotation and mirroring, in case the smart sensor is mounted sideways, on top or is looking through a mirror, the 2D input address space x and y coordinate components can be flipped, inverted and swapped.• Polarity filtering: polarity selection enables the selection of both polarities as separate channels, to filter one of them or to combine both polarities on a single channel.• Source mapping: the resulting pre-processed event can be forwarded to up to 2 destination layers via the NoC, a routing header is attached and one event is sent per destination.The event is then sent to the convolutional cores via the NoC.

C. Network on Chip
The NoC router follows a star topology.The routing system operates in a non-blocking way for any feed-forward network model and routes events via AER connections.The mapping system allows data to be sent from one convolution core to up to 2 other cores and for one core to receive events from multiple sources without addressing superposition with up to 1024 incoming feature channels.On every incoming channel the routing header of every AER packet is read and the payload directed to the destination.This is done by establishing separate physical routing channels that are parallel and do not intersect for any network topology that does not contain recurrence.This prevents skew due to other connections and deadlocks by loops inside the pipeline structure.In combination with the PCFB method the First In First Out (FIFO) structures display low latency in routing the AER words to their destination.The routing header information is striped from the word during transport, and the payload delivered to its intended destination.Fig. 5.The convolution core architecture (sCNN core in Fig. 3).An event {c, x, y} enters the convolution core pipeline, with c as the incoming channel/feature, x as the horizontal coordinate and y as the vertical coordinate.After padding, the event is now expanded to {c, xp, yp}.The Kernel Anchor determines the anchor in both kernel and neuron space {c, x 0 , y 0 , x k 0 , y k 0 }.With x 0 , y 0 being the anchor in the neuron space and x k 0 , y k 0 for the kernel space.The kernel address sweep now calculates the kernel expansion in x, y and f the output channel/features to Z * {(c, f, x k , y k ), (f, x, y)}, with Z being the synaptic fan-out.The parallel address compression packs the storage addresses compact to avoid unused storage gaps for the neuron (f, x, y) => ncomp and kernel (c, f, x k , y k ) => kcomp.Depending on the core, the kernel memory is split into one or multiple memory blocks for parallel access.The kernel value is read from the storage address kcomp, {w, ncomp} with w being the signed 8-bit synaptic weight.On a simulation tick, the bias/leak sweep will generate a pair of {bcomp, ncomp} for every active neuron, the address bcomp gets read in the bias/leak memory and forwarded as {w, ncomp} with the kernel events to the neuron.Depending on the core, the neuron unit is split into one or multiple parallel compute units, see Fig. 7.The address space decompression turns the {ncomp} back to {f, x, y}.The sum pooling operates on the same event structure {f, xs, ys}.And the Channel shift and routing prepare it for routing S * {dx, fs, xs, ys} with S being the source fan-out of 1 or 2, dx corresponding to the destination id and fs being the arithmetically shifted destination channel.Fig. 6.The sCNN computation principle.For a single event (black) arriving on the input space, a corresponding anchor coordinate in the kernel space (orange middle) is calculated and the weight at this position is read.The read weight is applied to the corresponding anchor neuron in the output space (orange right).The anchor is used to define the starting position in kernel and output space depending on the layer configuration.From this point onward the kernel space is moved from the anchor, in this case with a stride of 1, so by 2 fields in the x coordinate (blue middle), while the neuron is moved on field in the x coordinate from the anchor in the opposite direction (blue right).This is continued until all kernel positions possible have been read.In this case 2 additional -brown and yellow as the stride configuration in this case skips every other position in x and y input coordinate space.This step is repeated for all output channels/features f with their corresponding kernel.The stride and kernel size are configurable and will result together with x, y, c, f , with c being the input channel in different sweeps and resulting affected neurons.

D. Convolution Cores
In contrast to CNNs, event-driven sCNNs do not operate on a full frame basis: for every arriving pixel event, the convolution is computed for only that pixel position.For a given input pixel, all output neurons are traversed which are associated with its convolution, as opposed to a kernel that is swept pixel-by-pixel over a complete image.An incoming The flow control at the input ensures that the controller always has a bubble and is therefore deadlock-free.The signed 16-bit neuron state variable gets read, modified by the signed synaptic or bias input, compared and written back.In case a threshold condition is met, the {ncomp} is sent out to indicate the corresponding neuron spiked.The above-threshold condition can both trigger a subtraction operation or a reset to a fixed value of the corresponding neuron state variable.The state variable cannot cross a configured lower bound and will be clamped to that value in case any operation brings the variable below it.event includes the x and y coordinates of the active pixel as well as the input channel c it belongs to.A step-by-step walk through for an arriving event as seen in Fig 5 -7: • Zero padding: the event is padded to retain the layer size if needed.The image field, i.e. the address of the events, is expanded by adding pixels to the borders to retain the image size after the convolution if needed.• Kernel anchor and address sweep: In the Kernel mapper, the event is first mapped to an anchor point in the output neuron and the kernel space.The behaviour is described in Fig. 6.Using this anchor the kernel, represented by an address, is linked to an address point in the output space.The referenced kernel is swept over the incoming pixel coordinate.The kernel address and the neuron address are swept inversely to each other as seen in Fig. 6.For every channel in the output neuron space, the kernel anchor address is incremented, so that a new kernel for the new output channel is used.The sweep over the kernel is repeated.In case a stride is configured in either the horizontal or vertical direction, the horizontal and vertical sweeps are adjusted to jump over kernel positions accordingly.
• Address space compression: To effectively use the limited memory space, the verbose kernel address as well as the neuron address are compressed to avoid unused memory locations.Depending on the configuration the address space gets packed, so that there are no avoidable gaps inside the address that are not used by the configuration.Fig. 8.The readout core, from Fig. 3, is separated into an input FIFO stage (light brown), a variable time sliding average stage (brown) and a readout processing stage (dark brown).The FIFO prevents stalls occurring in the asynchronous to synchronous transition in the following block to affect the NoC.The stage that writes to the time sliding average variable is an asynchronous operation whilst in parallel the read in a readout time window is synchronous.The compute stage (dark brown) that does maximum operations and threshold comparisons is fully synchronous to enable easy offchip interfacing to standard components.
to the compressed addresses, and the specific signed 8bit kernel weight is read.The weight and the compressed neuron address are then directed to the parallel neuron compute-in-memory-controller blocks according to the address location.Kernel positions with 0 weight are skipped during reading and are not forwarded to the neuron.
• Neuron compute units: The compute-in-memorycontroller block model a LIF neuron with a linear leak for every signed 16-bit memory word.Besides classic read and write, the memory controller has a read-add-check spike-write operation, as shown in Fig 7.
Whenever the accumulated value reaches a configured threshold, an event is sent out and the neuron state variable has a threshold subtraction or reset written back.• Bias and leak address sweep and memory: The leak (or bias) is modelled via an additional memory controller.The Leak/bias controller has a neuron individual signed 16-bit weight stored for every output channel map.On a time reference tick an update event with this bias is sent to all its active neurons.The reference tick is supplied from off chip and fully user configurable.• Pooling: The output events are finally merged onto a pooling stage.The pooling stage operates on the sum pooling principle, i.e., it merges the events from 1,2 or 4 neurons in both x and y coordinates individually.• Channel shift and routing: Before entering the routing NoC, the channels are shifted and a prefix with routing information is added, one event is sent per destination for up to 2.
The individual convolution cores can also be used as fully connected layers with up to 65K, 32K and 16K synaptic connections respectively to model final readout decision layers.Fabrication defects in all SRAM memories for kernel, bias and neuron can be blacklisted with a kill bit per word and are skipped during computation.

E. Readout core
The readout core transforms the event data into simple readable values and results.It can simultaneously calculate up to 16 spike class counts or moving averages, where the moving average lengths can be configured and also be set to time bin counting with no averaging.The readout layer optionally compares all values to fixed thresholds, computes the current maximum and makes these results easily accessible on the pins of the chip.All computed values can also be read out for further processing.From the variable time siding average units as seen in Fig. 8 onwards, the circuit performs a clock domain crossing from asynchronous to synchronous.After a time reference clock tick, the previous data is presented and held at the output by standard flip flops and can be interfaced with via standard synchronous processing elements.The time reference tick is supplied from off-chip and controlled by the receiving system.

IV. RESULTS AND DISCUSSION
We propose that a medium-scale pipeline architecture integrated with a sensor can offer improvements both in architecture design and in applicability.Matching an eventbased sensor with a direct connection to an event based processor avoids incurring delays caused by event batching, as commonly done in event cameras as well as delays due to full frame capture and conversion by industry standard image sensors.Each sensory event is transmitted instantaneously after being encoded by our arbitration system.Our pipeline architecture processes a single event with a latency of 3.36µs through a nine-layer convolution with pooling network with kernel size 3 × 3 in each layer.This latency was measured by sending an event to and reading the resulting event from the ASIC with the time between the input and output request edge on the I/O pads.Many events are processed in parallel inside the system as they move step-by-step through the fine-grained pipeline enabling high throughput, set by the neuron compute units, with a measured ≈ 30M events/s per unit.Compared to frame-based systems, the event-based nature of the SoC can give a classification output as soon as enough evidence is accumulated, as opposed to a frame-based camera with CNN accelerator needing to complete the frame processing in its entirety.As a comparison, the fully integrated, higher resolution ISSCC21 Sony [57] needs a fixed 10.1ms for a 2028 × 1520 or 21.3ms for a 4056 × 3040 full-frame image.
To obtain compelling a latency on Speck, the time-to-firstclassification (stimulus onset to first classification result) needs to be integrated into the loss function during network training.Looking at classical SNN processors, the time step synchronisation of TrueNorth [25] and Loihi1 [26]/2 do not allow cores to run controllably out of sync.The latency for each layer is significant when fed with real-time streamed sensory data, as event generation takes one whole simulation time step ∆t in contrast to our architecture.For those systems lowering ∆t results in a directly proportional trade-off in increase of power consumption that is not present in this work.The small architectures of Camunas12 [17] and Yousefzadeh15 [20]  display low latency as they are constrained to small toy networks that don't require a SoC and are significantly smaller in specification, reducing power dissipation and transmission delays.SCAMP5 [55] is another fully-integrated smart vision sensor that follows a very different approach, combining the pixel with simple but powerful analog cellular compute units.They can run simple binary CNN-based networks with quite some overhead [56].A second key point for our presented architecture is the synaptic memory utilisation.Especially for CNN based architectures, the on-the-fly computation of synaptic connections allows for minimising memory requirements.This in turn saves area and energy -in the case of SRAMs, both dynamic and static.Our dedicated sCNN approach allows for many more synaptic connections by using the kernel weights stored in memory and computing all the synapses that share weights compared to standard SNN implementations with minimal additional compute required.In the table in Fig. 9, the benefits of a dedicated sCNN approach are evident at the 3x3, 4x4 and 7x7 kernel examples where there are magnitudes of differences between the formed synaptic connections.The only other big-scale processor Loihi2 [52] that supports kernel convolution by design restricts its applicable number of formed synapses by its axon-based routing schema with fixed limits on incoming axons per core.For sCNN, this network results in either an underutilisation of the many more neurons Loihi2 has compared to our solution or a significant restriction on the in and out feature maps supported.The axonbased routing presents its strength in other network topologies and is widespread in SNN processors [24,26,39,54] which follow a general purpose SNN approach.
By offering a simple and small end-to-end system comprising the sensor, the processing, and the decision readout, we offer a unique and easily applicable solution.To show the performance of our sCNN capability of our SoC, we chose a benchmark that can also be run by the smaller systems presented to give a comparison: The N-MNIST dataset [60] is a spike-converted version of the MNIST dataset [61].It is recorded using a vibrating ATIS sensor [12] with the original images displayed on an LCD monitor.We train a four layer sCNN with Sinabs [62] using both ANN2SNN [63,64] and Back Propagation Through Time (BPTT) methods.For the training samples, we use the first raw 250ms data out of 300ms and a timestep of 1ms.For training each method, we carried out five repetitions of training with different random initialisations and five repetitions of testing.The testing experiments are performed offline on PC and as measurements on Speck1 using a deployed and quantised version of the PC network model.The events are fed into the SoC directly as a raw external sensor event stream bypassing the inbuilt sensor for comparable and repeatable results.For ANN2SNN, we Fig. 10.Speck1 performance compared to other architectures.We show that Speck1 matches state-of-the-art performance on SNN hardware systems like Loihi1, while our more dedicated approach consumes less energy despite being fabricated in a much older but significantly more cost effective technology.
In such a resource friendly benchmark, significantly smaller solutions like Spoon can perform well with some drop in accuracy while gaining a lot on energy efficiency.Most real world applications include more complex or temporal information and require larger networks that are not accessible to small sCNN systems.The reported power consumption excludes pad frame power consumption for Speck1.The network structure used for this work is 34x34x2-16C5-16C3-P2-8C3-F10 with 300ms sample exposure, 100 training epochs and a learning rate of 1e-3.16C5 represents convolutional layer with 5x5 kernel with 16 channels.P2 represents the 2x2 pooling layer and F represents a fully connected layer.
obtained an SNN model with an average offline testing accuracy of 94.2% and an on-chip accuracy of 86.17%.For BPTT training, it demonstrates state-of-art recognition testing results where offline testing provides an average testing accuracy of 99.3075% and on-chip measured testing accuracy of 98.50%.The samples are presented in real-time so for an inference it takes the length of a sample, with the classification result arriving during the sample presentation as the sample is more comparable to a video.The measured mean of the time-tofirst-classification is 18.405ms with a standard deviation of 4.818ms, please note that the network was purely optimised for accuracy and integration of the latency into the loss function can reduce the time-to-first-classification significantly as the hardware itself is optimised for low latency responses.In Fig. 10 the results are compared to other SNN capable systems.Most prominent are the effects of dedicated sCNN hardware systems compared to general purpose SNN hardware, so that our proposed system outperforms Loihi1 in accuracy and energy despite the multiple generations older fabrication technology (14nm vs 65nm).Additional evaluations on further datasets can be found in [65].Further applications and demos of Speck with the sensor and sCNN combined operation can be seen online.One fall detection demo, were the SoC is detecting if a human fell and one to detect and follow the human [66], as well as brief overview of possible application including face and obstacle detection, as well as gesture recognition [67].
The training of sCNN networks for Speck is supported by the rich, open source, high-level framework Sinabs based on PyTorch and a full development solution called Samna [68].Sinabs can be used for sCNN training for this ASIC.Specifically, it allows optimisations for sparsity and supports the estimation of synaptic operations (SOP) of its networks.

V. CONCLUSION
We presented a smart sensor fully integrated as a SoC, which shifts an efficient sCNN architecture directly to the sensor edge.By combining both sensor and processing on a single die into a smart sensor, we lower unit production costs significantly while saving energy on high-speed and low-latency data communication, as the raw sensory data never has to leave the chip.The on-the-fly synaptic kernel mapping system lowers the memory resource requirements significantly, making the architecture accessible to larger, more cost effective fabrication technologies.The event-driven nature of the embedded machine vision sensor delivers high-speed signals in a sparse data manner.The advantages of this combination are taken further by the implemented deep sCNN processing pipeline, which is optimised for low latency and exploits the benefits of highly sparse computation.This ultimately enables low-latency visual processing on a tiny energy budget for edge and end-to-end applications.Finally, we believe the opportunities and market for different types of smart sensors that follow a similar design principle are very promising.
(a) The physical time operation and the processing of always-on sensory signals.The computation speed is matched to natural signals such as bio-signals, visual data, speech, gestures, and a wide range of environmental and industrial signals.(b) The redundant information reduction to compute on signal change, sparse data availability, and statistical and prediction mismatch to considerably enhance power efficiency.(c) Massively parallel computation to keep latency to a necessary minimum.

Fig. 4
Fig.4.The sensor event pre-processor (Sensor event preprocessing in Fig.3).The pre-processor can process events from the built-in sensor as well as events from off-chip.The in-built sensor events can also be streamed off-chip for monitoring and further processing.The pipeline stages pool, cut, rotate, mirror, channel filter and shift the input event stream, before forwarding it to one or two destination layers.

Fig. 7 .
Fig.7.The neuron compute unit from Fig.5.It uses in-memory-controller compute to model the LIF neuron model.The flow control at the input ensures that the controller always has a bubble and is therefore deadlock-free.The signed 16-bit neuron state variable gets read, modified by the signed synaptic or bias input, compared and written back.In case a threshold condition is met, the {ncomp} is sent out to indicate the corresponding neuron spiked.The above-threshold condition can both trigger a subtraction operation or a reset to a fixed value of the corresponding neuron state variable.The state variable cannot cross a configured lower bound and will be clamped to that value in case any operation brings the variable below it.
(10)53]le shows the technical specifications of this and related ASICs with respect to sCNN networks.With the first 4 columns belonging to the group of medium and large systems, the last 2 are small resource-constrained systems built for a limited set of toy applications.The SCAMP5 is an exception as it is an analog cellular processor inside a visual sensor and ISSCC21 Sony is a fully integrated standard high resolution image sensor with CNN tensor processor sandwiched together.Only Speck1 and Loihi2 are specifically built to run larger sCNN models.This can be seen in the resulting synaptic counts that are a direct translation of how many different kernels and how often a kernel can be applied.Compared to the general purpose Loihi2 our pure sCNN processor archives similar or higher synaptic numbers while having significantly fewer memory resources.As Speck1 is specifically built to consume real-time data, its latency is significantly lower than Loihi1/2 and TrueNorth which architecturally introduce a latency of one timestep ∆t on event generation in each layer.The small systems are by principle advantages for latency and area as they are limited to running only compact toy networks.Notes: (F) fully connected layer.(1)Externalsensorinterfaceavailable.(2)Sensorcanbeconnected with additional hardware.(3)Readoutinterpretationviax86CPUonSoC.(4)Estimated,informationnotpubliclyavailable.(5)Bitpacking of 8bit synapse words assumed as described in[26].Formulae used: min(#neuron per core • #core • floor(#inaxon / #synapses in kernel) • #syn in kernel, synmem per core • #core / 8bit).(6)Whenconnected to a sensor and the simulation time is synchronised to real-time, latency per layer is minimum the simulation time resolution ∆t, 1ms for TrueNorth, as event generation takes one ∆t.(7)full-frameprocessed.(8) Measured input request to output request edge, when using a 3x3 kernel in every layer, with stride 1, padding 1, pooling has no effect on the latency.(9)For the sCNN on-the-fly mapping no restrictions on kernel sweep in x,y and f coordinates assumed: Formulae used: #neuron per core • #core • floor(#inaxon / #synapses in kernel) • #syn in kernel.#inaxon is not mentioned as improved[52,53]and assumed with 4096 as for Loihi1.(10)If#synapse in kernel•8 ≤ 256 then #neuron per core • #cores per chip • #synpses in kernel else not implementable.(11) Formulae used: min(lower( √ #kernel memory words/ √ #synapses in kernel) 2 , #max kernel)• #synapses in kernel • #neurons