Introduction

The human eye can quickly and efficiently perceive visual information in complex environments, including target feature extraction and classification1,2,3,4,5. In human visual sensory transduction, sensory neurons encode visual stimulus information into event-driven neural spikes and transmit this information to the brain for perception. These neural spikes are encoded with multiplexed spatial-temporal information, representing individual visual stimulus variables with rich features and high efficacy6,7,8,9. For example, stronger light stimuli could lead to both a higher firing frequency (rate coding) and a shorter latency for the first spike after the stimulus (time-to-first-spike, TTFS)10,11. Rate coding is a basic neural coding mechanism in which retinal stimuli are encoded based on the number of spikes that occur during a certain encoding window; however, rate coding cannot provide efficient temporal information or sufficient features to fully represent the stimulus12. On the other hand, TTFS coding (latency coding) is a fast temporal encoding method with robustness to noise and the highest efficiency in terms of spike counts, in which the stimulus onset is precisely ‘locked’ to the first spike time4. Thus, TTFS coding is superior to rate coding and more reliable in urgent situations, such as obstacle avoidance and threat or ally recognition13. With complementary rate and TTFS coding, the natural visual system can efficiently process complex visual information within 150 ms14.

In comparison, complementary metal‒oxide‒semiconductor (CMOS) image sensors work according to a frame-driven approach with high energy budgets15. The large mismatch between natural and machine visual efficacies has inspired the development of artificial visual neurons, which can encode visual information into binary spike trains and be implemented in spiking neural networks (SNN) to process visual data with high efficiency and high biological fidelity16,17,18,19. Rate coding is commonly used in SNN to represent the intensity of external stimuli; however, this approach cannot provide sufficient temporal information, and the processing speed is limited by the average firing rate20,21,22,23. In contrast, TTFS coding, which resembles natural vision, can provide important spatiotemporal information for the implementation of SNN to process dynamic visual data with high sparsity and high energy efficiency. Hardware-based SNN with TTFS coding is more efficient than SNN with rate coding, demonstrating faster speed and reduced energy consumption24,25,26,27. Although precise temporal encoding has been achieved in artificial visual neuron systems, fusing rate and TTFS coding in a single spike train has not yet been realised in SNN hardware, compromising the capacity of such networks to rapidly and accurately process visual data in complex visual environments28,29,30,31,32,33,34,35.

Here, we report a biomimetic artificial spiking visual neuron that can encode analogue visual stimuli into relevant spike trains with both rate and TTFS coding. The artificial neuron integrates an In2O3 synaptic phototransistor and an NbOx Mott memristor, which resemble biological photoreceptors and retinal ganglion neurons, respectively. With the integration of rate and TTFS coding, the biologically plausible artificial neuron can emulate natural vision. Stronger light intensity leads to incremental firing rates (from 0.35 to 1.85 MHz) and shortened first-spike arrival times (or spike latency, from 13.00 to 1.04 μs), outperforming biological counterparts in terms of spiking frequencies (0–100 Hz) and first spike latencies ( ≥ 10 ms). High-frequency event spikes can convey more information in a shorter time, allowing the system to quickly make decisions and execute responses. Moreover, with energy-efficient TTFS coding, the artificial visual neuron can rapidly and precisely detect and encode temporal changes in external stimuli, which is useful in applications requiring high temporal resolution. Furthermore, the artificial neuron is small (the Mott memristor has an active area of 7 × 7 μm2), durable (1010 operating cycles), and consumes a low energy of less than 1.06 nJ per spike to accomplish complementary rate and TTFS coding without the need for auxiliary reset circuits. Finally, we implement complementary rate and TTFS coding in a trained SNN, which provides more channels for information to be transferred and further improves SNN efficiency. The SNN can predict the speed and steering angle of autonomous vehicles in complex environments with a low loss function of < 0.5. Our results prove that SNN with the proposed rate-temporal fusion (RTF) encoding scheme can enhance the efficacy of artificial vision with a biologically plausible approach.

Results

Biomimetic signal encoding with artificial spiking visual neurons

An overview of the artificial spiking visual neuron capable of both rate and TTFS coding is shown in Fig. 1. Figure 1a shows a schematic of a retina composed of photoreceptors, synapses, and neurons. In natural visual transduction, photoreceptors detect external optical signals and transform them into graded potentials. These potentials influence synaptic plasticity and thus play critical roles in learning and memory. Subsequently, retinal ganglion cells, acting as neurons, encode the processed graded potentials from synapses as electrical spikes, which are action potentials that convey information to the brain for further processing. As shown in Fig. 1b, in rate coding, stronger stimuli lead to higher firing frequencies in an encoding window and vice versa. Moreover, stronger light stimuli lead to shorter latencies for the first neuron spikes and vice versa (TTFS coding). Furthermore, increasing the input synaptic weights can increase the membrane potential, allowing neurons to rapidly reach or exceed the threshold for spike firing. Thus, neurons fire at higher frequencies and emit their first spikes more quickly.

Fig. 1: Biomimetic signal encoding with an artificial spiking visual neuron.
figure 1

a Schematic illustration of the retina, which is composed of photoreceptors, synapses, and neurons. Photoreceptors can respond to external optical signals and convert them into graded potentials. In synapses, synaptic plasticity is responsible for learning and storing memories. Neurons (retinal ganglion cells) can encode synapse-processed graded potentials as action potentials (electrical spikes) to be processed by the brain. b Encoding of different input stimuli and synaptic weights by time-to-first-spike (TTFS) coding and rate coding schemes in a biological spiking visual neuron. The frequency (F) of rate coding depends on the number of spikes (Nspikes) within the time window (Ttotal), while TTFS coding depends on the first spike latency (T). Low stimulus input (orange) and high stimulus input (blue) along with synaptic weights w1 (black) and w2 (purple) correspond to neural spiking responses. c Schematic and an optical image of artificial neuron device composed of the integrated In2O3 optoelectronic synaptic transistor and NbOx Mott neuron (1T1R). d The optoelectronic retina enables synaptic plasticity and rate-temporal fusion coding. Spiking sensory neurons are activated when EPSPs reach a certain threshold. The rate-temporal fusion encoding of spiking neurons represents the characteristics of the stimulus in real-time through the change in the first spike latency and the spike frequency.

To emulate the visual phototransduction process, our artificial spiking visual neuron device is composed of an In2O3 optoelectronic synaptic transistor and an NbOx Mott spiking neuron (Fig. 1c). As shown in the schematic (Fig. 1d), the In2O3 synaptic transistor enables optoelectronic synaptic plasticity, and the NbOx Mott neuron encodes the received optoelectronic signals as stimuli-related electrical spikes via multiplexed RTF coding. The spiking visual neuron is activated when the excitatory postsynaptic potential reaches a threshold. With the RTF encoding scheme, the artificial neuron can rapidly and efficiently represent rich stimulus characteristics.

Electrical characteristics of artificial neurons and optoelectronic synapses

Threshold switching (TS) NbOx Mott memristors based on insulator-to-metal transitions (IMTs) can emulate the high-order neural dynamics of biological neurons36,37,38,39. The fabricated NbOx memristor had a crossbar structure with an active area of 7 × 7 μm2 (see “Methods” for the details of the fabrication processes). A cross-sectional transmission electron microscopy (TEM) image of the device is presented in Fig. 2a. The stacked layer-by-layer Pt/Ti/NbOx/Pt/Ti films were confirmed. Notably, a nanoscale crystalline region is observed in the high-resolution TEM (HRTEM) image after the initial formation process, corresponding to the NbO2 tetragonal structure (Supplementary Fig. 1). The NbOx layer in the pristine film is amorphous, and its stoichiometry was determined by X-ray photoelectron spectroscopy (XPS) (Supplementary Fig. 2). The formed NbOx memristors exhibit volatile TS characteristics (Supplementary Fig. 3).

Fig. 2: Device characteristics of the NbOx neurons and In2O3 optoelectronic synaptic transistors.
figure 2

a Cross-sectional TEM image of a NbOx Mott neuron. b Current-voltage characteristics of a NbOx Mott neuron. c Endurance characteristics of the NbOx Mott neuron, which can fire stably for more than 1010 cycles. Transient electrical measurements show that TS is triggered by a voltage pulse with a width of 1 μs and an amplitude of 1.60 V. d Cross-sectional TEM image of In2O3 optoelectronic synaptic transistors. e Transfer curves of the In2O3 synapse as a function of light power density (from 1.57 to 3.72 mW/cm2, λ = 365 nm). IDS versus VG measured at a drain bias of VD = 3 V. f EPSCs (red line) of the In2O3 synaptic TFT triggered by optical pulses (purple) (λ = 365 nm, 5 ms, 5 Hz).

Figure 2b shows the typical current-voltage (I-V) curves of the NbOx memristor with TS characteristics (Fig. 2b). The device exhibits a transition from a high resistance state (HRS) to low resistance state (LRS) when the applied voltage surpasses a threshold voltage (Vth) of ~1.37 V. Conversely, the device returns to its HRS when the applied voltage is less than the holding voltage (Vhold) of ~1.17 V. In addition, RLRS is 293.3 Ω, and RHRS is 5349.5 Ω. Current compliance (ICC) of 4 mA was applied to prevent permanent breakdown. The volatile resistive switching of NbO2 occurs due to the IMT Mott transition. Correspondingly, an “S”-shaped negative differential resistance (NDR) behaviour is observed when sourcing with a current sweep (Supplementary Fig. 4), which occurs due to the thermally induced changes in conductivity36. The Mott device exhibited the best endurance performance among various volatile TS nanodevices, and the device could operate for more than 1010 cycles driven by consecutive electrical pulses (Fig. 2c). Pulse operation with the endurance characteristics of the NbOx Mott memristors is further illustrated (Supplementary Fig. 5). In addition, the Mott transition enables the device to have a fast-switching speed, needing < 40 ns to switch from the off state to the on state and < 50 ns to return to the off state (Supplementary Fig. 6). The extracted coefficient of variation (Cv) values of different parameters (Vth, Vhold, and Vforming) of 100 NbOx Mott memristors are 0.0349, 0.0303, and 0.0233, respectively, demonstrating low device-to-device variability (Supplementary Fig. 7).

The optoelectronic synaptic transistors have a bottom gate, top contact (BGTC) configurations with solution-processed In2O3 as the active channel40. A cross-sectional HRTEM image reveals the presence of an ~5-nm-thick amorphous In2O3 channel layer (Fig. 2d). The corresponding device-to-device variation in mobility (μsat), threshold voltage (Vth), subthreshold swing (SS), and current on/off ratio (Ion/off) of 100 In2O3 phototransistors with values of 0.19, 0.36, 0.21, and 0.52, respectively demonstrate a low variability (Supplementary Fig. 7). The In2O3 film exhibits strong ultraviolet light absorption (< 400 nm), as shown in the ultraviolet‒visible (UV‒vis) absorbance spectrum (Supplementary Fig. 8). Figure 2e depicts the transfer curves of the In2O3 phototransistor under UV radiation with a wavelength of 365 nm at varying power densities ranging from 1.57 to 3.72 mW/cm2, illustrating significantly increased channel conductance due to the distinct UV photoresponse of the oxide semiconductor. With these intrinsic optoelectronic properties, the In2O3 transistor can feasibly be applied as an optoelectronic synapse. As shown in Fig. 2f, the device was illuminated under UV light (365 nm, 1.71 mW/cm2, 5 ms, 50 cycles), and a voltage bias (VDS = 3 V) was applied to read the change in the drain current, which corresponds to the excitatory postsynaptic current (EPSC). The EPSC decreased slowly over time when UV light input ceased due to the persistent photoconductivity (PPC) phenomenon41. With increased input UV pulses, the synaptic phototransistor showed typical long-term plasticity (LTP) behaviour. In addition, paired-pulse facilitation (PPF), a behaviour related to short-term plasticity (STP), was characterised by applying a pair of optical pulses (Supplementary Fig. 9). The synaptic behaviours of the In2O3 transistors demonstrate their applicability as optoelectronic synapses in artificial visual neural networks.

Demonstration of rate and TTFS coding

In the natural visual neural system, neurons appear to represent recognised features when they emit spikes6. Spike trains (action potentials) carry information about the average firing rate and spike time10. Figure 3a illustrates the major differences among rate coding, TTFS coding, and RTF coding. In rate coding, the intensity of an external stimulus is represented by the average spiking rate within a sampling window. However, rate coding has a limited range of stimulus changes, a long processing period, and slow information transmission. In TTFS coding, a neuron encodes its real-valued stimulus-induced response as its spike latency in response to that stimulus. This single-spike coding scheme enables fast and sparse information processing, and it enhances sensitivity to minor variations in input. Thus, multiplexed neural coding schemes operating on different timescales can encode complementary stimulus features, meeting the physiological constraints and reaction times observed in humans and animals, thereby enhancing the capacity of the coding scheme.

Fig. 3: Demonstration of rate-temporal fusion photoencoding.
figure 3

a Schematic representations of biological neural coding models, including rate coding (blue line), TTFS coding (purple line), and RTF coding (red line). b Schematic circuit of 1T1R synapses and neurons. The optical pulse and the current waveform were regarded as the input and output signals, respectively. An electric pulse VDD (3 V, 20 μs) was applied to read the spike behaviour with a VG bias of 5 V. c, d Spiking behaviours of the 1T1R device triggered by various UV (c) illumination intensities and (d) optical pulse numbers. e, f The effect of the light intensity on the (e) spike frequency and (f) first spike latency. g, h The effect of the optical pulse number on the (g) spike frequency and (h) first spike latency. i Inhibitory voltage input can modulate the firing rate and the first-spike times.

To emulate the multiplexed coding scheme observed in the natural visual system, we implement RTF coding in our artificial visual neural system by utilising the spike rate and TTFS to encode visual information. This fast and precise coding scheme could enable a biologically plausible neuromorphic hardware system with high accuracy and fast emulation of SNN. As described earlier, the NbOx memristor and In2O3 phototransistor have time-dependent neuronal and synaptic functionalities, respectively. The In2O3 phototransistor integrated in series with a two-terminal NbOx memristor results in a 1-transistor-1-memristor (1T1R) configuration that can fully represent the functionality of the visual neural pathway in the retina for data encoding and processing.

The circuit schematic of the 1T1R artificial spiking visual neuron is shown in Fig. 3b. An optical laser was used to provide light stimuli (365 nm) for the In2O3 phototransistor. After the light stimuli was ceased, a single electrical pulse (VDD = 3 V, 20 μs width) was applied to the drain of the phototransistor under a gate bias voltage (VG = 5 V) to measure the encoded current pulses. The optical pulses and the drain current (ID) were regarded as the light stimulus input and artificial neural signal output, respectively. Due to the TS characteristics of NbOx, self-sustained spiking can be obtained in a range of RLRS«Rch«RHRS, where Rch is the channel resistance of the In2O3 transistor and RHRS and RLRS are the insulating and metallic resistances of the NbOx memristor, respectively41. When VDD and VG are fixed, Rch is determined by the parameters of the optical pulses (Supplementary Fig. 10). The spiking behaviours of the artificial neuron can be altered by adjusting the Rch values based on a leaky integrate-and-fire (LIF) model, leading to different spiking durations τintegration (~RchC) (Supplementary Fig. 11). With this approach, we connect the neural coding behaviour of the artificial visual neuron to the Rch value of the series transistor, which is configured by visual stimuli42.

Then, we measured the spike patterns generated by the spiking neuron as a function of light intensity and pulse number. Before light pulse illumination, initial light exposure was applied (1.57–3.72 mW/cm2, 10 s) to adjust the baseline current value to ~405 μA and the Rch value to a reference value matching the TS characteristics of the NbOx memristor. After each test, an electrical pulse (VG: 10 V, 20 μs) is applied to the gate electrode of the phototransistor to initialise its state. This process ensures the restoration of ionised oxygen vacancies, stabilising the channel current to its initial state. Figure 3c shows the spiking behaviour of the artificial visual neuron in response to different light intensities with a fixed pulse number of 100. The spike frequency increased monotonically from 0.75 to 1.6 MHz as the light intensity increased from 1.57 to 3.72 mW/cm2 (Fig. 3e). This behaviour can be attributed to the light-generated photocurrent, which leads to a decrease in Rch. Importantly, as the light intensity increased, the first spike latency decreased from 7.29 to 1.2 μs (Fig. 3f). In addition, spiking behaviours related to the pulse number were demonstrated. Figure 3d shows the spike patterns obtained with various optical pulse numbers and a fixed light intensity of 1.71 mW/cm2. Similarly, increasing the optical pulse number leads to an increase in the number of photogenerated carriers due to the optoelectronic synaptic characteristics of the In2O3 phototransistor, resulting in lower Rch. As a result, the artificial neuron has an increased spiking frequency (0.35–1.85 MHz) and a reduced first spike latency (13–1.04 μs) with increasing light pulse number (50–500), as shown in Fig. 3g, h, respectively. Thus, the visual stimuli information was successfully encoded into fast and precise electrical spikes via the artificial spiking visual neuron. This resembles the behaviour of biological visual neurons and demonstrates the potential of SNN with integrated rate and TTFS coding. Meanwhile, the NbOx Mott memristor with a smaller footprint (1 μm × 1 μm) and In2O3 phototransistor could be monolithically integrated on the same chip with a conventional microfabrication process (Supplementary Fig. 12). The scale-down of the NbOx Mott memristor could further reduce both the light-induced firing threshold and power consumption. As a result, the RTF photo-encoding can be achieved under white light illumination without any initial light exposure (Supplementary Fig. 13).

In addition, the gate-controlled electrical properties of the synaptic phototransistor enable the representation of input stimuli with electrically configurable synaptic potentiation and depression behaviours; thus, the proposed device can mimic the process by which biological neurons recognise excitatory and inhibitory stimuli43. As shown in Fig. 3i, the spiking rate and latency can both be adjusted based on VG. The spiking rate increased from 0.85 to 1.8 MHz, and the latency decreased from 5.55 to 1.16 μs (Supplementary Fig. 14) as VG increased (from 3.5 to 6.0 V) because of the increase in the channel conductance of the phototransistor (Supplementary Fig. 15). Thus, with its multiterminal configurability, the proposed artificial visual neuron could emulate the heterosynaptic plasticity of biological sensory neurons42.

Correlated synaptic plasticity and neural spiking dynamics

In biological sensory neural systems, stimulus information is represented via distributed neurons and synapse networks43. Synaptic plasticity describes the strength of communication between pre- and postsynaptic neurons in response to action potentials and is important in learning, memory, and forgetting44. In the human visual system (Fig. 4a), a visual neuron does not fire spikes until sufficient optical stimuli are received because the accumulated charge leaks away. When the accumulated charges (synaptic weights) contributed by the synapse surpass the threshold of the neuron, the neuron fires stimuli-relevant spike trains, enabling complex sensory and cognitive functions. Biological visual sensitisation occurs when repetitive series of brief stimuli are delivered at constant intensity, which increases perception sensitivity, facilitating precise sensory encoding and high responsiveness during dynamic visual processing45,46,47.

Fig. 4: Rate-temporal fusion scheme encodes synaptic plasticity in real-time for sensing and memory.
figure 4

a Simplified schematic of sensitisation in the retina. b Operation scheme of optical (red line) and electrical input (blue line). UV illumination (pulse width: 5 ms, frequency: 5 Hz, intensity: 1.71 mW/cm2 at 365 nm). Electrical pulses VDD (3 V, 20 μs) are applied to evaluate the spiking behaviour at t1, t2, t3, and t4, corresponding to 0, 10, 30, and 60 s, and t5, t6, t7, and t8, corresponding to 5, 10, 15, and 25 min. The gate voltage bias VG is 5 V. c Measured output current waveforms of sensitisation (t1t4) with increasing synaptic weights under UV light illumination. d Measured output current waveforms of memory (t5t8) with decreasing synaptic weights after the light pulse. e, f The sensing and memory processes have a linear relationship with the (e) spike frequency and an exponential relationship with the (f) first spike latency. g, h The RTF coding scheme exhibits changes in (g) spike frequency and (h) first spike latency during the memory window after different light intensities. i Image memory with rate coding and TTFS coding for a mushroom pattern at 0, 2, 4, and 6 min after light stimuli ceased.

To emulate biological synaptic plasticity, we utilise the artificial spiking visual neuron to achieve highly correlated synaptic plasticity and implement spiking neural dynamics at the device level. Based on optoelectronic synaptic plasticity (Supplementary Fig. 16), the In2O3 phototransistor can feasibly emulate the synaptic behaviours related to sensing, memory, and forgetting. Figure 4b illustrates the optical and electrical input schemes to reveal the correlations between optoelectronic synaptic plasticity and relevant neural spiking patterns. The optical input pulses were applied for a series of time windows (t0t4, learning) and ceased, and the device was kept in the dark in the remaining time windows (t4t7, forgetting). Moreover, electrical input pulses were applied throughout the process to record the output spike patterns.

The measured results are shown in Fig. 4c. Initially, the device is at rest and does not fire spikes without optical input (t0 = 0 s). When a series of optical pulses is applied (t1 = 10 s), obvious spiking patterns can be observed. The LIF neuron fires spikes when the voltage applied to the memristor reaches a threshold, showing activity-dependent RTF coding. As the applied time of the input optical pulses increased from 10 to 60 s (t1t3), the spiking frequency increased from 1 to 1.55 MHz (Fig. 4e), and the first spike latency decayed exponentially from 3.51 to 1.19 μs (Fig. 4f). After the light pulse was stopped (t3 = 60 s), the neuron continued spiking due to the LTP properties of the synaptic phototransistor, with a linearly decreasing spike frequency (1.55 to 0.80 MHz) and an exponentially increasing first spike latency (1.19–4.90 μs) over time (t3t6, Fig. 4d). When the idle time is sufficient (t7 = 25 min), the artificial neuron returns to the resting state and stops spiking. The spiking frequency and latency values after light illumination are shown in Fig. 4e, f, respectively. In addition, the ‘forgetting’ behaviours can be adjusted based on the input light intensity. In the idle time, stronger input light intensity leads to neural spiking with both a longer period and longer latency, as shown in Fig. 4g, h, respectively. Furthermore, the RTF coding properties are related to the number of input light pulses (Supplementary Fig. 17). These correlated synaptic and neural spiking behaviours resemble natural learning and forgetting behaviours in a biologically plausible way at the device level.

To visualise the correlations between synaptic plasticity and the coding scheme, we constructed a simulation mushroom image map with 50 × 50 pixels to demonstrate the rate and TTFS coding results. The simulated mushroom image shows patterns of light illumination with intensities ranging from 1.71 to 3.73 mW/cm2 (Fig. 4i). The rate and TTFS encoding data from Fig. 4g, h were utilised to create the map; the light intensity was encoded as the spike frequency within the range of 0–1.6 MHz, and the first spike latency was within the range of 0–10 μs. By incorporating synaptic plasticity, artificial visual neurons can achieve dynamic encoding across a broad range of time scales. With the rate coding scheme, because the spiking frequency related to the opto-synaptic weights decayed linearly (Fig. 4g), the image intensity gradually weakened over time without obvious contrast change (Fig. 4i, upper panel). In contrast, with the TTFS coding scheme, as the first spike latency increased exponentially, the image exhibited sharpened contrast as the intensity weakened over time (Fig. 4i, lower panel). Thus, by integrating both rate and temporal TTFS coding, the artificial visual neuron can biomimetically mimic human learning, memory, and forgetting behaviours.

Implementation of rate and TTFS coding in SNN

We considered an autonomous driving task to demonstrate the advantages of our artificial visual neuron, including its sensory encoding and processing capabilities. Prediction tasks in complex traffic conditions, such as turning and overtaking at high speeds, require fast scene encoding and signal processing abilities2. An SNN with RTF coding and LIF neuron characteristics is proposed to satisfy these requirements. The RTF coding scheme has a high temporal resolution that is two orders of magnitude better than that of conventional image sensors. In addition, LIF neurons with temporal feature extraction abilities allow the SNN to process the input timing sequence signals better than other deep neural networks.

As shown in Fig. 5a, we utilised different coding schemes, including rate, TTFS, and RTF coding schemes, to encode the external road conditions as spike trains to implement the SNN. In the rate coding scheme, the light intensity of each pixel in an image is encoded as a spike train, where the spike number has a linear relationship with the light intensity (Supplementary Fig. 18). In the TTFS coding scheme, the input light intensity is converted into the first spike latency time, which follows an exponential decay law (Supplementary Fig. 19). In TTFS coding, the higher the value of the input intensity is, the earlier the spike firing time. In the RTF coding scheme, pixel values are encoded as spike trains with linear frequency and nonlinear temporal information (Supplementary Fig. 20). These three coding schemes are detailed in Supplementary Note 13. The multiplexed coding method has superior spiking temporal resolution and combines the advantages of the rate and TTFS coding schemes. Thus, the multiplexed coding scheme enables rapid and precise perception of visual stimuli in real-world environments.

Fig. 5: Demonstration of the fused rate and TTFS coding scheme in a SNN to predict the speed and steering angle of a self-driving vehicle.
figure 5

a The intensity of the pixels is encoded in terms of the firing frequency and first spike latency of the artificial visual neuron. Greyscale images were encoded as several series of spike trains by the rate coding, TTFS coding, and RTF coding schemes, and fed into a trained SNN to predict the vehicle speed and steering angle. b Comparison of the steering angle prediction results by the rate coding, TTFS coding, and RTF coding schemes in a scenario with complex road corners. c Loss value per epoch during training of the SNN for steering angle prediction under different coding schemes. d Comparison of three coding schemes in terms of the spike firing rate and loss in predicting the steering angle. e Comparison of the speed prediction results of the rate coding, TTFS coding, and RTF coding schemes in a scenario with smooth road corners. f Loss value per epoch during training of the SNN for speed prediction under different coding schemes. g Comparison of the spike firing rate and loss in the speed prediction task.

As a demonstration, an SNN model based on ResNet-18 with LIF neurons is proposed to predict the steering angle and speed of an autonomous car27,48,49,50,51,52,53. The proposed SNN model is composed of 17 spike layers and one fully connected layer. The spike layer employs convolution kernels to compute membrane potential updates with the LIF neurons. These updates are subsequently integrated into the neurons’ membrane potential and compared to a threshold to determine the spike firing rate (“Methods”). The external scene information, determined based on the public dataset (DAVIS Driving Dataset 2017), comprises records of driving variables collected under different road conditions over 1000 km (Fig. 5a)54. This data is captured at a resolution of 100 × 140 pixels and encoded by our artificial visual neuron as 14k spike trains, which serve as inputs to the SNN model. The final output layer includes a single neuron, representing either the steering angle or the speed in various prediction tasks.

Two different road conditions, a turning road and a high-speed driving road, are selected to demonstrate the performance of the hardware-based SNN. As shown in Fig. 5b, the steering angle predicted by the RTF coding scheme is approximately equal to the ground truth. The rate coding scheme predicts a value similar to the ground truth, while the TTFS coding method cannot fit the ground truth. The loss curves of different coding techniques for predicting the steering angle are depicted in Fig. 5c. The final loss of the rate coding scheme is lower than that of the TTFS coding scheme because the TTFS coding scheme emits only one spike per pixel and thus cannot provide sufficient information for the processing algorithm. In comparison, the RTF coding scheme with linear frequency and nonlinear temporal characteristics achieves the best performance, with a loss of 0.5 after 50 epochs.

As shown in Fig. 5d, TTFS coding with a lower spike firing rate leads to a higher mean square loss, which represents the poor fitting performance between the predicted results and the recorded values in the dataset. The rate coding scheme with a multispike encoding mechanism leads to a lower loss between the predicted values and ground truth. The RTF coding scheme with a spike firing rate similar to the rate coding scheme leads to a significantly decreased loss, achieving the best fitting performance for the recorded vehicle parameters of the driving car. In addition, as the change in the speed of the self-driving car is small at the curves in the road, the loss functions differ only slightly among the three encoding methods (Supplementary Fig. 21).

As shown in Fig. 5e, the comparison results indicate that the speed values predicted by the RTF coding scheme are similar to the ground truth when driving at high speeds. The RTF coding scheme has a lower loss than the TTFS and rate coding schemes, which proves the superiority of this fusion coding method with frequency and temporal characteristics (Fig. 5f, g). Similarly, high-speed driving roads have smaller corners, and the loss values of the steering angle predictions with the three coding schemes are approximately the same (Supplementary Fig. 22). Overall, based on the rapid and precise prediction ability of the RTF coding scheme and the temporal feature extraction ability of LIF neurons, our proposed SNN could accurately predict the steering angle and speed of autonomous vehicles in various tasks under different road conditions.

Discussion

In summary, an artificial visual spiking neuron composed of an In2O3 synaptic phototransistor and a NbOx memristor-based LIF neuron was experimentally demonstrated. The artificial neuron enables multiplexed rate and TTFS coding of external visual information. The proposed RTF encoding scheme can achieve precise timing and rapidly and accurately encode the original input information. The artificial neuron has fast spike latency coding from 13.00 to 1.04 μs, tunable firing frequency coding from 0.35 to 1.85 MHz, low energy consumption (1.06 nJ/spike), high endurance (> 1010), and multiplexed information encoding capabilities. The RTF encoding scheme results are consistent with real-world ground truth data, and an SNN with the proposed RTF coding scheme achieves high accuracy for steering and speed prediction for self-driving vehicles. The complementary coding capability of the artificial neuron ensures rapid and precise perception capability in complex environments, demonstrating the high efficacy of SNN.

Methods

Preparation of NbOx memristors and In2O3 optoelectronic synaptic transistors

The NbOx memristors were fabricated as follows: After photolithography and lift-off processes were applied, the bottom Ti (5 nm)/Pt (35 nm) electrodes were deposited on a Si/SiO2 substrate by e-beam evaporation. Then, 35 nm NbOx active layers were deposited by magnetron sputtering, and patterned with photolithography and lift-off processes at room temperature. Afterwards, the top Ti (5 nm)/Pt (35 nm) electrodes were grown by e-beam evaporation and patterned by photolithography and lift-off processes. The two-terminal metal-insulator-metal structure was integrated into the source of the transistor. The Mott memristors have a working area of 7 μm × 7 μm.

The In2O3 optoelectronic synaptic transistors were fabricated as follows: A precursor solution of In2O3 with a concentration of 0.1 M was prepared by dissolving indium nitrate in 2-methoxyethanol (2-ME). To enhance the exothermic combustion reaction, acetylacetone (AcAc) and ammonium hydroxide (NH3·H2O) were added to the solution in equimolar quantities to indium nitrate. The solution was vigorously stirred overnight and filtered using a 0.2 μm syringe filter before utilisation. A substrate consisting of a highly doped (p++) silicon wafer with a 100 nm thermally grown SiO2 layer was employed. Following a UV-ozone treatment for 10 min, the precursor solution was spin-coated at a speed of 3000 rpm for 45 s, and subsequently, the device was prebaked at 100 °C. The device was then baked at 200 °C for 5 min, and conventional photolithography was performed. The pattern was achieved by etching in a mixed solution of diluted hydrochloric acid and deionized water (1:15, v-v) for 5 s. The In2O3 channels were then annealed in air at a temperature of 300 °C for one hour. Finally, the Ni/Au (8 nm/50 nm) source/drain electrodes were thermally evaporated utilising the lift-off process to obtain a channel width/length (W/L) of 400 µm/10 µm.

Device characterisation

The cross-section TEM and HRTEM images were obtained with a transmission electron microscope (FEI Tecnai TF-20, UK) and analysed. Room-temperature electrical measurements were carried out using a Summit 1100B-M Probe Station. The DC mode was measured by an Agilent B1500 semiconductor parameter analyser. A B1530A fast measurement unit module was used to simultaneously generate the voltage pulse and measure the response current.

SNN processing framework for the autonomous driving task

The processing framework consists of neural coding techniques and the SNN model. Three different coding techniques, including rate, TTFS, and RTF coding, were adopted in this task. The frame signal of the road conditions was obtained from a public dataset (DAVIS Driving Dataset 2017). The rate coding scheme converts each pixel value in the frame into a spike train with a firing rate proportional to the light intensity. In addition, the generated spikes follow a Poisson distribution. The TTFS coding scheme utilises only one spike to encode the light intensity of each pixel, and the firing time attenuates exponentially with increasing light intensity. With this scheme, higher light intensity leads to earlier spike firing times. The RTF scheme coding combines the characteristics of the above two methods, and the input pixel grey values are encoded as pulse trains with a firing time (1.2–20.0 μs) and firing rate (0–1.6 MHz). Finally, these three coding schemes are all implemented over 15 time steps.

The SNN model used in this task has a ResNet-18 architecture, and LIF neurons are used to process the input spikes. The numbers of LIF neurons in layers 1, 2–5, 6–9, 10–13, and 14–17 are 14 k, 896 k, 448 k, 160 k, and 120 k, respectively. LIF neurons have dynamic features and are thus powerful and energy-efficient in predicting the steering angle and speed. The updated value of the membrane potential is calculated by inputting the presynaptic neuronal spikes \(Xi\) of the neuron multiplied by the synaptic weights \(Wi\), and the membrane potential of the postsynaptic neuron in the next layer is calculated as follows:

$${V}_{{mem}}(t+1)=\beta {V}_{{mem}}(t)+{\sum }_{i=1}^{k}{W}_{i}^{T}{X}_{i}(t)-R\,\left[\beta {V}_{{mem}}(t)+{\sum }_{i=1}^{k}{W}_{i}^{T}{X}_{i}(t)\right]$$
(1)
$$R=\left\{\begin{array}{cc}1,& if\,{V}_{{mem}} > {V}_{th}\\ 0,& otherwise\end{array}\right.$$
(2)

where \(\beta\) and \(k\) are the membrane potential decay rate and the number of neurons in this layer, respectively. \(T\) is the transposition operation. If the membrane potential of the LIF neuron is more than the threshold, the neuron generates a spike, which is used as the input to the next layer. Then, the membrane potential is reset to zero.

In this experiment, the mean absolute error (MAE) loss function is adopted to evaluate the difference between the predicted value and the ground truth label in the dataset. The loss is computed as follows:

$$Loss(x,y)=\frac{1}{n}{\sum }_{i=1}^{n}\bigg|{y}_{i}-{x}_{i}\bigg|$$
(3)

Then, to train the proposed SNN, a surrogate gradient descent algorithm—backpropagation through time (BPTT)—is used to update the synaptic weights. In the BPTT scheme, the network is unrolled in the time dimension to calculate the weight update value. The details of the weight update process are as follows:

$$\Delta {w}^{l}={\sum }_{n}\frac{\partial {L}_{total}}{\partial {o}_{t}^{l}}\frac{\partial {o}_{t}^{l}}{\partial {V}_{t}^{l}}\frac{\partial {V}_{t}^{l}}{\partial {w}^{l}}$$
(4)
$$\frac{\partial {o}_{t}^{l}}{\partial {V}_{t}^{l}}=\left\{\begin{array}{cc}H^{\prime} _{1}({V}_{i}-{V}_{th}),& if{o}_{t}^{l}={S}_{t}^{l}\\ 1,\hfill & if{o}_{t}^{l}={V}_{t}^{l}\end{array}\right.$$
(5)

where \({w}^{l}\) is the weight in the spike layers, and \({L}_{total}\) is the total loss between the prediction and the ground truth label in the dataset. \({o}_{t}^{l}\), \({V}_{t}^{l}\) and \({w}^{l}\) are the output of the neuron, the membrane potential of the LIF neuron at time \(t\), and the weight in the spike layers. \({S}_{t}^{l}\) is the output spike of the LIF neuron at time \(t\). The shifted Arctan function \({H}_{1}(x)=\frac{1}{\pi }\arctan (\pi x)+\frac{1}{2}\) is used as the surrogate function to replace the Heaviside function of the LIF neuron during the backpropagation process. In addition, the dataset consists of 20 k scene frames, which are divided into training and testing sets. The first 80% of the data are used to train the model, and the remaining 20% of the data are used to validate the model’s performance. Finally, we use dropout and regularisation to prevent overfitting and improve model performance.