Abstract
Efficient dynamic vision requires capturing instantaneous changes and temporal context, yet existing image and event sensors rely on power-hungry digital processing. Here, we introduce an in-sensor dual-response architecture that concurrently generates analog event spikes and persistent memory tails. A prototype sensor integrates phosphor pairs with silicon photodiodes and transimpedance amplifiers to achieve microsecond- and millisecond-scale dual kinetics. Measurements during light-emitting diode replay reconstruct event frames that match software frame differences, while the slow channel behaves as a linear reservoir of motion history. A single memory frame fed to a convolutional neural network enables accurate classification of human actions (93.1%) and vehicle trajectories (98.0%), as well as speed estimation with errors of 2.15 km/h. Integration with a compressive optical neural network front end mapping 4900 inputs to 16 per frame yields 93.3% action classification accuracy. By eliminating analog-to-digital conversion and digital accumulation, this approach enables ultralow-latency, ultralow-power neuromorphic vision.
Similar content being viewed by others
Introduction
Efficiently perceiving dynamic scenes in real time is critical for applications such as robotics, autonomous driving, and intelligent surveillance1,2,3,4,5,6. Traditional frame-based vision sensors capture full images at fixed intervals (Fig. 1a, top), providing rich spatial context but suffering from high data redundancy: static backgrounds are repeatedly processed, elevating bandwidth requirements, power consumption, and processing latency5,7,8,9.
a Raw, event, and memory frames. Raw frames (RFs; top) are extracted from the Weizmann Human Action dataset. Event frames (EFs; middle) are generated via pixel-wise differencing of successive RFs, while memory frames (MFs; bottom) are computed by accumulating EFs over time. b Conventional dynamic-vision signal processing. Event signals are obtained either by differencing RFs or directly via a DVS. To reconstruct motion persistence, the event signals undergo ADC and are digitally buffered or accumulated, adding latency, power consumption, and hardware complexity. c In-sensor dual-response signal-processing. Each pixel contains two parallel photosensor channels with mismatched fast and slow response dynamics. A real-time analog differential of their outputs (SA–SB) at the sensor plane directly yields instantaneous event spikes (from fast-kinetics mismatch) and slowly decaying memory tails (from persistence luminescence mismatch), eliminating the need for ADC or external buffering.
Event-based vision sensors—also known as neuromorphic or dynamic vision sensors (DVSs)—have emerged as compelling alternatives. Unlike conventional imagers, DVSs asynchronously output pixel-level brightness changes, creating sparse event streams characterized by microsecond latency, high temporal resolution, and high dynamic range (Fig. 1a, middle)10,11,12,13. By filtering out redundant static information, DVSs drastically reduce data throughput and energy consumption while minimizing motion blur during rapid movements12,14. These advantages have positioned event-based sensors as key enabling technologies for energy-constrained, real-time tasks such as robotic navigation and automotive safety.
However, DVS outputs are inherently stateless, limiting their ability to capture persistent context or motion trajectories15,16. In conventional pipelines (Fig. 1a, b), motion persistence is reconstructed by (i) computing pixel-wise differences between raw frames (RFs) to produce event frames (EFs)—or directly capturing EFs with a DVS— (ii) converting those event signals via analog-to-digital conversion (ADC), and (iii) digitally accumulating them into memory frames (MFs). Such multi-step digital processing introduces additional latency, power consumption, and hardware complexity12,15,17,18,19. Hybrid/event-intensity sensors such as Dynamic and Active-pixel Vision Sensor (DAVIS) and Asynchronous Time-based Image Sensor (ATIS) provide both asynchronous events and absolute intensity12,20,21,22. Recovering longer-term temporal context, however, typically requires off-sensor buffering or digital accumulation, which increases system-level latency and complexity.
Recent advances have shown that embedding short-term memory functionality directly within the sensor can greatly enhance dynamic perception9,10,23,24,25. Retinomorphic photomemristor arrays10 and 2-dimensional (2D) heterostructure-based sensors26,27,28 inherently encode temporal motion history, enabling “reservoir-in-sensor” processing that eliminates redundant data transfer and provides instantaneous access to accumulated motion context. Yet most existing implementations rely on luminescent or photoconductive materials (e.g., long-persistent phosphors or cumulative photoconductors) whose memory responses are primarily triggered by externally encoded event spike-like pulses, reflecting per-pixel intensity changes29,30,31,32,33,34. This limitation prevents individual pixels from capturing both instantaneous events and persistent motion context in a single stage. Extending in-sensor temporal processing beyond conventional DVS pipelines, a retinomorphic photodiode consolidates event sensing, band-pass dynamics, and light adaptation within a single diode and reports a dynamic range exceeding 200 dB35, demonstrating retina-like front-end temporal behavior under extreme lighting. Event-driven sensors with in-sensor computation generate programmable, amplitude-encoded spikes using dual-polarity photodiodes for in-sensor spiking-neural-network operations7. These advances validate the promise of in-sensor temporal processing, yet their external interfaces remain predominantly spike-centric, with longer-term temporal context typically reconstructed off-sensor via event accumulation or windowing.
Here, we introduce an in-sensor dual-response pixel architecture that executes instantaneous event detection and temporal integration directly in the analog domain at the pixel level (Fig. 1c). Each pixel comprises two parallel photosensors that differ in their fast-response kinetics and in the amplitude of their persistent luminescence tails (ms scale). A real-time analog differential measurement of their outputs yields (i) event spikes whenever illumination changes, arising from mismatches in fast response kinetics, and (ii) memory tails that persist after each spike, arising from differences in slow-rise and slow-decay dynamics. This approach dramatically reduces system latency, power consumption, and complexity.
We validate our architecture on the Weizmann Human Action dataset, achieving a structural similarity index (SSIM) of 0.94 for reconstructed EFs and good agreement with a linear reservoir model36,37 for memory signals. When fed into lightweight convolutional neural networks (CNNs), our MF input outperforms event- or frame-only baselines in action recognition. We further demonstrate that single MF inputs enable vehicle trajectory classification with 98% accuracy and speed estimation with a mean absolute error (MAE) of 2.15 km/h. Finally, an optical neural network (ONN)-based compressive encoder38,39,40,41,42 (4900 → 16) achieves 93.3 % accuracy, enabling efficient analog feature encoding. This architecture eliminates the need for ADC and digital accumulation, reduces end-to-end latency by more than an order of magnitude, and projects per-channel power below 1 mW—marking a significant step toward ultralow-latency, ultralow-power neuromorphic vision.
Results
Event and memory signal detection by response speed mismatch
Figure 2a illustrates the pixel-level sensor architecture, hereafter referred to as an analog event-memory sensor (AEMS). Sensors A and B incorporate silicate (Sr2SiO4:Eu2+) and garnet (Lu3Al5O12:Ce3+) phosphors, respectively, each coupled to a Si photodiode (PD) and transimpedance amplifier (TIA) (see Methods and Supplementary Note 1). Time-resolved photoluminescence (TRPL) measurements showed that the intrinsic photoluminescence lifetimes of the silicate and garnet are approximately 1.5 μs and 200 ns, respectively (Fig. 2b). In addition, the silicate exhibits a pronounced persistent luminescence (after-glow) tail, whereas garnet’s persistent luminescence is negligible.
a Schematic of a single AEMS pixel, comprising two parallel photosensor channels: Sensor A (silicate phosphor + Si PD + TIA) and Sensor B (garnet phosphor + Si PD + TIA). b TRPL of silicate and garnet phosphors, showing intrinsic lifetimes of approximately 1.5 µs and 0.2 µs, respectively. The silicate channel exhibits a more pronounced, slowly decaying persistent luminescence compared with the garnet channel. c Measured outputs under a 30 ms LED square wave input: VA (blue) and VB (orange) from the individual channels, and their differential signal Vout = VA−VB (red), which reveals both fast event spikes and slowly decaying memory tails. d Full-width at half-maximum (FWHM) of the event spikes with respect to the Rf in the paired TIAs, showing spike durations decreasing from ~100 µs at low-bandwidth Rf to ~2.4 µs at high-bandwidth Rf, with a lower bound set by the silicate phosphor decay. e AEMS response to stepped-intensity LED inputs, confirming that each brightness transition elicits repetitive spikes and cumulative–decaying tails. f, Memory-tail amplitude (inverted sign applied to match spike polarity) versus input transition frequency, demonstrating frequency-dependent fading; dashed lines are fits to the linear reservoir model xi(t + 1) = α·xi(t) + β·ui(t) with α = 0.85, 0.90 and 0.95.
In our standard configuration (Fig. 2c), Sensor A’s TIA (feedback resistance, Rf = 1.5 MΩ) exhibits a 3 dB bandwidth (f3dB) of 9 kHz (\({{{{\rm{\tau }}}}}_{{{{\rm{TIA}}}}}=1/2{{{\rm{\pi }}}}{{{\rm{f}}}}3{{{\rm{dB}}}}=\) ~18 μs), whereas Sensor B’s TIA (Rf = 4.75 MΩ) has a f3dB of 3 kHz (\({{{{\rm{\tau }}}}}_{{{{\rm{TIA}}}}}=\) ~53 μs). In this case, the effective electrical response is dominated by the TIAs’ low-pass characteristics, rather than the much shorter intrinsic PL lifetimes of the phosphors. Under a 100-ms LED on/off square-wave, Sensors A and B show 10–90% rise/fall times tr of approximately 50 μs and 120 μs, respectively, closely matching the estimate tr ≈ 0.35/f3dB (39 μs and 117 μs).
The differential output Vout(t) = VA(t) − VB(t) exhibits an immediate positive spike at LED turn-on owing to Sensor A’s faster kinetics and a negative spike at turn-off (Fig. 2c, red). In this configuration, spikes span approximately 100 μs, but adjusting the TIAs’ feedback resistances can reduce them to as short as 2.4 μs (Supplementary Figs. 3; 2d). Raising the cutoff frequencies above the phosphors’ decay rates could narrow spikes further, approaching the intrinsic PL timescales of the materials.
Following each spike, the silicate’s persistent luminescence, which is nearly absent (less than 5% amplitude) in the garnet channel, produces a slowly decaying tail in Vout (Fig. 2c, blue). Details of the long-tail characterization—combining TRPL for the early (µs) regime and broadband PD-TIA measurements for the long-time (ms) regime—are provided in Supplementary Note 1.3. The convolution of the two TIAs’ time constants with the silicate after-glow kinetics yields analog traces encoding both instantaneous events and temporal memory (dominant decay time τ ~ 45 ms). Stepped-intensity experiments confirm that each brightness transition elicits repetitive spikes, each of which causes an immediate increase or decrease in the tails that gradually decay over time (Fig. 2e).
Under our configuration, EF spikes and MF tails have opposite polarity, enabling real-time separation (see Supplementary Note 2 for the EF/MF recovery method). The opposite case‒when the silicate channel is read with a TIA whose bandwidth is lower than that of the garnet channel‒yields spikes and tails with the same polarity (Supplementary Fig. 5). We also observed memory-fading behavior, in which the degree of tail signal accumulation and decay varied systematically with the temporal profile of input brightness steps, closely matching residual-state dynamics in reservoir computing. Fitting these dynamics with a linear reservoir (leaky-integrator) model36,37
where \({T}_{i}[n]\) is the internal memory tail state of channel i at frame n, and \({S}_{i}[n]\) is the external event spike input. The per-frame retention (decay) factor α is estimated from the after-glow fits (α = ~0.895 for τ = ~45 ms with Δt = 5 ms). We set β = 1−α to enforce unity steady-state gain. Using α ∈ {0.85, 0.90, 0.95} in Fig. 2f brackets the fitted range and faithfully reproduces the observed analog-memory tails. The τ is tunable at the materials stage43 and, in principle, can be reconfigured in operando44, with details provided in Supplementary Note 1.5.
Real video tests with analog event–memory sensors
To validate our analog event + memory sensing approach, we replayed pixel-wise brightness time-traces from dynamic video clips onto AEMSs using controlled LED stimuli. We selected 93 clips from the Weizmann Human Action Dataset45 (each comprising 21 RFs at 70 × 70 pixels across ten action classes; representative frames are shown in Fig. 3a and Supplementary Fig. 11) and converted each pixel’s 8-bit intensity into a corresponding LED intensity sequence (see Supplementary Note 2).
a Representative RFs from the Weizmann Human Action dataset (21 frames per clip, 70 × 70 pixels). b Sample LED drive signals (blue) for selected pixels and corresponding AEMS voltage responses (red), showing rapid differential spikes at brightness transitions and slowly decaying tails. c Spatiotemporal heat map of Vout across column 43 (70 pixels) over 21 LED steps, illustrating per-pixel event and memory dynamics. d Top: EFs reconstructed by aggregating AEMS spikes at each time step. Bottom: MFs reconstructed from tail amplitudes (inverted sign applied). See Supplementary Fig. 13 for full RFs/EFs/MFs. e Extension to the UCSD Pedestrian dataset: RFs (left) and AEMS-reconstructed EFs (right, top) and MFs (right, bottom). See Supplementary Fig. 15 for complete sequences.
Figure 3b presents example LED intensity waveforms for selected pixels alongside the corresponding AEMS voltage responses. Each luminance transition produces a rapid differential spike followed by a slower decaying tail. Figure 3c shows a 2D spatiotemporal map of the AEMS voltage outputs over time for the 70 pixels in the 43rd column of each frame from the dataset in Fig. 3a. By aggregating spike occurrences across all pixels at each time step, we reconstructed EFs (Fig. 3d, top panels), which closely match digitally computed frame differences (see Supplementary Fig. 13 and Supplementary Video 1 for comparisons between AEMS-reconstructed and digitally computed EFs). This process was carried out across the entire dataset, and the system operated stably over the long acquisitions without observable drift in either spike or memory responses (see also Supplementary Fig. 9 for accelerated-cycling data). Consequently, the reconstructed EFs achieve an average SSIM ~ 0.94 and MAE ~ 0.02 (Supplementary Fig. 14).
In parallel, we integrated the tail amplitudes across pixels to form MF images (Fig. 3d, bottom), capturing motion persistence in a manner analogous to frame accumulation. These analog memory images align qualitatively with the outputs of a linear reservoir model applied to the EF inputs (Supplementary Fig. 13), supporting their functional fidelity. Because they reflect gradual decay after spike accumulation, the MFs effectively preserve motion traces over time. Notably, unlike the case where a linear reservoir model is applied directly to RF inputs34, our MFs are generated from EF-based dynamics and therefore preserve spike polarity (±) and suppress redundant background, yielding richer temporal history with reduced data redundancy (see Supplementary Note 3).
We further evaluated our method on the UCSD Pedestrian Dataset46, which features real-world urban scenes with complex visual clutter, including multiple overlapping pedestrians and moving vehicles (Fig. 3e; Supplementary Fig. 15). Despite this complexity, our analog framework enabled robust background suppression and clear delineation of motion trajectories. In the reconstructed EFs, static background elements such as trees and roads were suppressed, while dynamic objects exhibited sharp intensity transitions (SSIM of 0.95 when compared to digitally computed frame differences), confirming spatial fidelity. The corresponding MFs revealed smooth decaying trails consistent with the underlying motion paths.
Learning and evaluation for dynamic image classification
To quantify the benefit of each information channel, we trained lightweight CNNs on three synthetic modalities derived from the Weizmann Human Action Dataset (93 clips, 21 frames each; Fig. 4a). Each clip was augmented twentyfold, and only the last 10 frames—where sufficient temporal context exists—were used to generate RFs, EFs, and MFs. This yielded 18,600 samples per modality, split 80/20 for training and validation. Training used hardware-calibrated EF/MF datasets that embed measured AEMS characteristics—noise statistics, short-/long-term drift, nonlinear responses captured by look-up table (LUT)-based calibration curves, and inter-frame spatiotemporal correlations—and testing used independent AEMS-measured RFs and EFs/MFs (Supplementary Note 2.3).
a Architecture of the lightweight CNN used for all experiments: a single convolution–pooling block with 64 kernels of size 3 × 3, followed by two fully connected layers and a softmax output over ten classes. b Plots of training, validation, and test accuracies (left) and losses (right), for EF-only, MF, and EF + MF fusion (red) on the augmented Weizmann dataset. c Confusion matrix for the MF-based model on the AEMS test set.
In single-modality experiments, RF-based models exhibited the poorest performance, with both training and validation accuracies remaining below 45% after 20 epochs (Fig. 4b; Supplementary Fig. 16a). EF benefited modestly from calibration—training/validation behavior improved—but test accuracy remained 69% (Supplementary Fig. 16b), consistent with EF’s frame-difference-like nature and ambiguity in single-frame cues. MF showed the largest gain with calibration, improving from pre-calibration accuracies of about 91%, 85%, and 82% (training/validation/test) to 97%, 96%, and 93%, with stable convergence by epoch 17 (Supplementary Fig. 16c). This reflects the calibrated inclusion of accumulated noise, drift, LUT nonlinearity, and inter-frame correlation absent in purely digital simulations.
For a minimal two-channel fusion, EF plus MF reached accuracies of about 96%, 96%, and 92% (training/validation/test)—an improvement of about 3.3 percentage points over the pre-calibration fusion—yet slightly below MF alone (Fig. 4b; Supplementary Fig. 16d). Confusion matrices (Fig. 4c) show that action pairs ambiguous under EF are effectively separated when MF is used, underscoring MF’s strength in encoding accumulated, continuous spatiotemporal patterns. Overall, progressing from RF to EF to MF under hardware-calibrated training narrows the train/validation/test gap and improves generalization, with MF delivering the highest test accuracy on measured data.
Evaluation of vehicle dynamics in intersection environments
For an object moving at speed v, the continuous-space memory trail along its trajectory, A(ξ), follows an exponential profile A(ξ)∝exp(−ξ/ℓ), where ξ denotes the arc length along motion, and ℓ=vτ is the characteristic trail length. This property implies that the AEMS-based memory filter (MF) provides a natural cue for trajectory and speed estimation (see Supplementary Note 4 for the quantitative EF and MF framework). To evaluate our in-sensor event-plus-memory (EF + MF) pipeline under deployment-relevant intersection scenarios, we generated a synthetic dataset of 510 short video clips depicting vehicles traversing a four-way intersection (Fig. 5a). Vehicles entered from the top, left, or right roads and exited via straight, left-turn, or right-turn trajectories (Fig. 5b shows 50 representative paths), with speeds varying from 30 to 60 km/h by adjusting per-frame displacements. From each clip, we extracted a 36 × 36 pixels patch covering the exit region (red box) and sampled 21 frames at 10 ms intervals (see Supplementary Note 5.1 for details).
a Synthetic intersection layout (100 m × 100 m) with three entry roads (top, left, right) and a single exit road (bottom). b Overlay of 50 representative trajectories, where dots indicate vehicle centres sampled at 50 ms intervals, illustrating the three movement types and speed variations. c Reconstructed final-frame EFs (left) and MFs (right) for vehicles moving at 30, 45, and 60 km/h. EFs exhibit only subtle differences across speeds, whereas MFs clearly reveal motion trails encoding both trajectory and velocity. d Trajectory classification accuracy using only the final-frame AEMS-derived EFs and MFs on the train and test sets. e Speed estimation results comparing EF-based (left) and MF-based (right) regression models. The shaded regions represent ±1σ (dark band, ~68% confidence level) and ±2σ (light band, ~95% confidence level) around the predicted mean.
Per-pixel intensity time series from each patch were replayed through an AEMS, yielding analog responses from which we reconstructed EFs and MFs. For both classification and regression, we split these EF and MF sequences into 90% training and 10% unseen test sets. Figure 5c shows final-frame EFs and MFs for three representative combinations of speeds and trajectory (see Supplementary Fig. 23 for the full 21-frame sequences). While EFs vary only subtly with speed and direction, MFs—by integrating the analog tails—clearly reveal motion trails encoding both trajectory and velocity.
Under a strict single-frame, equal-latency setting (decision at the end of the last frame), a lightweight CNN on the final frame shows that the EF-only model overfits (58.8% test accuracy), whereas the MF-only model achieves 98.0% (Fig. 5d; Supplementary Fig. 25). For speed estimation, a CNN regressor trained with Huber loss47,48 attains MAE (MAPE) of 4.77 km h⁻¹ (10.6%) with EF and 2.15 km h⁻¹ (4.9%) with MF (Fig. 5e; Supplementary Fig. 25c), corresponding to about ±5.3 km h⁻¹ at 95% confidence for a single MF frame. All speed-estimation experiments use intersection scenes with high-contrast crosswalk backgrounds, in which alternating bright/dark stripes overlap the MF tail. To contextualize background effects, Supplementary Fig. 24 profiles the MF tail with and without crosswalks—showing near-exponential decay without crosswalks and local envelope distortion with crosswalks—and demonstrates that background-aware training mitigates this bias. While multi-frame optical-flow and hybrid optical-flow-plus-event pipelines can achieve lower MAE when given longer temporal windows, they entail higher decision latency and memory; detailed accuracy–latency trade-offs and model/training specifics are provided in Supplementary Note 5.3.
Efficient motion analysis via ONN-AEMS hybrid pipeline
High-resolution dynamic image sequences impose a significant computational burden when processed digitally. To address this, we propose integrating an ONN with our AEMS-based analog processing pipeline, enabling most computationally intensive operations to be performed optically and in the analog domain (Fig. 6a). Each 70 × 70 pixels frame is compressed to a 16D feature vector via a single-pass matrix–vector multiplication (MVM). In our implementation, the RF is replicated into 16 parallel channels (optical “fan-out”), each modulated by a pre-trained 4900 × 16 weight mask encoded in greyscale. The modulated outputs of all 4900 pixels per channel are summed onto a photodetector (optical “fan-in”), completing a full 4900 × 16 MVM in one shot (see Supplementary Note 6.1).
a Schematic of the hybrid front end: each 70 × 70 raw frame undergoes a single-pass 4900 × 16 optical MVM in the ONN to yield a 16-D feature vector, which drives LEDs. Parallel LED illumination of the AEMS produces analog spike-and-tail responses that feed the classifier. b Sequence of 16-D ONN outputs over 21 frames (rows = feature channels; columns = frame indices). c Measured AEMS spike-and-tail voltage waveforms for all 16 channels when replaying ONN outputs via LEDs. d Classifier input: from each AEMS output, 13 cEFs and 13 cMFs are selected and flattened into a 416-D classifier input. e Confusion matrix showing 93.3 % test accuracy on the AEMS-measured dataset. f Test accuracy versus compression ratio as paired cEF + cMF inputs are reduced from 40 to 2 (compression from 160× to 3216×). Accuracy remains > 90 % up to ~300× (red), closely tracking idealized end-to-end simulations (blue), with minor deviations attributable to hardware limitations.
Over 21 consecutive frames, the ONN produces a 16D output vector per frame (Fig. 6b). These are converted into a 16-channel LED drive, and the AEMS’s analog spike-and-tail responses are recorded (Fig. 6c). From these recordings, we extract: (i) compressed EFs (cEFs)—the spike amplitudes at each of the 20 inter-frame transitions—and (ii) compressed MFs (cMF)—the tail amplitudes immediately following those transitions. Both cEFs and cMFs are thus represented as 20-frames × 16-channel matrices.
From the AEMS outputs of 4650 augmented video sequences (90% training including 20% for validation, and 10% testing), we select 13 cEFs and 13 cMFs (26 frames total × 16 channels = 416D) per sequence (Fig. 6d). This corresponds to a 247-fold reduction from the original 102,900D input (21 frames × 70 × 70 pixels).
A lightweight classifier (64-unit ReLU dense layer followed by a 10-unit softmax) was trained to predict one of ten action classes. After training (Supplementary Fig. 28), we achieved ~93.3% classification accuracy on the test set (Fig. 6e). Most action classes are well separated, except for some confusion between “Two-hands wave” and “One-hand wave,” likely due to their similar spatiotemporal signatures in the compressed 16D feature space.
We next investigated the impact of frame count on classification accuracy. Starting from the full set of 20 cEFs and 20 cMFs, we progressively reduced the number of paired inputs from 40 to 2 (i.e., n cEF + n cMF, for n = 20 to 1). Figure 6f shows the resulting test accuracy as a function of compression ratio, which increases from 160-fold (n = 20) to 3216-fold (n = 1). While accuracy declines moderately beyond approximately one-thousand-fold compression, it remains above 90% even at three-hundred-fold (red dots). Compared to idealized simulations (blue dots), our measured accuracy is a few percentage points lower—likely due to optical misalignments, sensor noise, and the fact that only the classifier (not the ONN compressor) is retrained on real data. These simulations suggest that with improved alignment, hardware fidelity, and full end-to-end training, accuracy could exceed 96% even under extreme compression.
By offloading MVM to the ONN and extracting event and memory signals using AEMS, our hybrid pipeline dramatically reduces computational load and enables real-time, low-power action recognition at the sensor front end. In our prototype, ONN outputs are still digitally stored and replayed via LEDs due to OLED and camera refresh limitations, which prevent fully real-time optical–analog processing. These constraints also dominate the end-to-end latency—OLED refresh (16.7 ms at 60 Hz) and camera integration (10–20 ms)—for a total of about 27–37 ms. In a production-ready system, the OLED fan-out would be replaced by a microlens array41,49 (or a diffractive optical element, DOE), and high-speed photodetector arrays41,50 would capture each channel directly, eliminating all digital storage and replay steps. The weight mask would be implemented as a fixed-pattern passive transmissive element (e.g., chromium-on-glass attenuation, etched-glass phase, or a metasurface), so no periodic refresh would be required during operation. The remaining task—amplifying, offsetting, and rescaling each of the 16 continuous-valued ONN outputs—would likewise be performed entirely in the analog domain via integrated banks of low-noise TIAs and simple sample-and-hold (S/H) or level-shifter circuits co-packaged with the pixel array. Under this architecture, the latency would be set by the PD-TIA bandwidth and S/H timing, enabling μs-scale operation. A detailed description of the deployable architecture and latency assumptions, along with process and packaging considerations for monolithic (or highly integrated) implementations, is provided in Supplementary Note 6.2 (Supplementary Table 3).
Discussion
A comparative analysis against existing dynamic vision architectures highlights the novelty and system-level strengths of our analog spike-and-tail framework. Unlike conventional DVS (events only) and hybrid sensors such as DAVIS/ATIS (events + absolute intensity), which typically reconstruct longer-term temporal context off-sensor via buffering or digital accumulation, AEMS preserves temporal information as an analog state at the pixel plane. This removes per-frame ADC and external frame-memory accesses from the latency-critical path. As a result, latency and power are reduced, and the data path is simplified. This dual modality (“event + local memory”) simplifies tracking, separation, and prediction in scenes containing both fast and slow objects while maintaining DVS-class timing (see Supplementary Note 8, Supplementary Table 5 for a quantitative comparison with DAVIS/ATIS)51,52. With event detection and memory integration performed directly at the pixel plane, our system achieves 50–100 μs event latency in prototype measurements and is feasible in less than 2 μs with higher-bandwidth TIAs. Array-level scalability and high–frame-rate temporal fidelity follow from the complementary EF–MF scaling with the frame interval Δt. EF fidelity benefits from decreasing Δt until bounded by spike width w, frame-boundary splitting, and operator choice, whereas MF signal-to-noise ratio increases with larger Δt (see Supplementary Note 4.3 and Supplementary Figs. 21 and 22). This complementarity enables array-scale operation that maintains DVS-class timing while providing local analog memory for temporal context.
The analog approach further offers major advantages in power efficiency and hardware simplicity. While the current prototype draws less than 5 W (including drivers and interfaces), ASIC projections53,54—assuming 180-nm analog complementary metal–oxide–semiconductor (CMOS), VDD = 1.2 V, effective closed-loop bandwidths of approximately 30 kHz (EF) and 3 kHz (MF), and average activity of less than 10 kevents s−1pixel−1—indicate about 0.3 mW per pixel, with average system power well below 1 W under realistic duty cycling (see Supplementary Note 8.1 and Supplementary Table 4). By removing per-frame ADC and off-sensor frame-memory accesses from the latency-critical path—and pushing any necessary digitization to low-rate summaries at the system boundary—integration is simplified, form factor is reduced, and manufacturing complexity is lowered10 (see Supplementary Notes 8.2, 8.3). A core innovation is the analog “tail” signal, which encodes temporal information locally at the sensor plane, avoiding the latency and energy overheads of digital memory accumulation.
Regarding area and manufacturability, we quantify the pixel-layout overhead of the dual-channel EF/MF pixel and its CMOS compatibility. Under a 10–15 μm pitch assumption, logic sharing—reusing the comparator, address-event representation (AER) interface, and column periphery—limits the area overhead to approximately 20–50%, avoiding a naïve two-fold penalty (see Supplementary Note 9.1). We also detail a CMOS-compatible back-end-of-line (BEOL) integration flow for the persistent-luminescence phosphor using a photopatternable polymer–phosphor composite with a thermal budget of no more than 150 °C, akin to CMOS image-sensor color-filter-array and microlens processing, with a practical scaling path from 100 μm prototype pitch to 10–15 μm (see Supplementary Notes 9.1 and 9.3). In summary, AEMS delivers high-fidelity event plus local-memory streams at DVS-class speeds, while improving efficiency, hardware compactness, and data richness. These attributes enable more accurate downstream perception and simplify end-to-end system design.
Methods
Fabrication of fluorescent PDMS films
To enable analog optical sensing, two types of phosphors—Sr2SiO4:Eu2+ (silicate) and Lu3Al5O12:Ce3+ (garnet)—were embedded in a PDMS matrix (Silgard™ 184, Dow Corning) to form flexible fluorescent films. The PDMS base and curing agent were mixed at a 10:1 weight ratio, and phosphor powders were added at a 1:8 phosphor-to-PDMS ratio and homogenized. The mixture was degassed under vacuum for 30 minutes, cast into a 60 mm petri dish, and pre-cured at 70 °C for 1 hour. TRPL decays were measured using a Horiba DeltaDiode-375L pulsed laser diode (λ = 375 nm) and a time-correlated single-photon counting (TCSPC) module integrated into a Fluoromax-4 system.
AEMS assembly and photoelectrical response measurement
Each AEMS unit consisted of two parallel optical channels, each comprising a phosphor-PDMS film, a long-pass filter (>495 nm, Thorlabs), and a silicon photodiode with integrated TIA (PDA100A2, Thorlabs). To ensure identical responsivity to input light, the two sensor arms (silicate and garnet) were carefully aligned in both position and angle. Analog illumination was driven by a 255-nm UV LED modulated by a National Instruments USB-6423 DAQ, amplified using an HP 6827 A amplifier. The differential voltage Vout = VA − VB was sampled at 40 kS/s using the same DAQ device.
EF/MF image construction and motion analysis
To evaluate dynamic visual sensing, video clips from the Weizmann Human Action dataset (21 frames, 70 × 70 pixels) and the Crossroad dataset (21 frames, 36 × 36 pixels) were replayed frame-by-frame into the AEMS via UV LEDs. Each 8-bit pixel intensity (0–255) was mapped to a corresponding LED brightness level. The AEMS output signals were processed to extract differential spike responses (event data) and decaying tail responses (memory data), which were used to construct EFs and MFs.
These EF/MF representations then served as training data for CNN-based models. For human action classification (Weizmann), we used hardware-calibrated, simulation-generated EF/MF datasets parameterized by measured device characteristics. For vehicle trajectory and speed prediction (Crossroad), we used EF and MF images reconstructed from AEMS measurements. The CNN architecture consisted of a single convolutional layer with 64 filters (3 × 3), ReLU activation, 2 × 2 max-pooling, and fully connected layers. The classifier output layer predicted either one of ten action classes (Weizmann) or three trajectory types (Intersection). Training used the Adam optimizer (learning rate = 0.001, batch size = 32), with early stopping based on validation accuracy.
ONN–AEMS hybrid pipeline implementation
To demonstrate optical–electronic hybrid computing, we constructed a pipeline integrating an ONN with the AEMS array. We first trained a simulation-based neural network with the same architecture as the ONN-AEMS pipeline—comprising a fully connected compressor (FCNN), temporal differencing, reservoir integration, and a dense classifier. The trained FCNN weights (4900 × 16) were then implemented optically.
Each RF (70 × 70 pixels) was split into 16 replicated images and displayed on an OLED panel. These were optically projected onto 16 distinct greyscale weight masks encoded on an LCD panel, enabling one-shot 4900 × 16 optical MVM. The modulated light was focused onto a scientific camera (Thorlabs CC215MU), and each of the 16 projected images was integrated over its 4900 pixels to yield a 16-D compressed output vector.
This feature vector was replayed via LED into the AEMS array to extract dynamic temporal signals. Each sample consisted of 13 cEFs and 13 cMFs, flattened into a 416-D input vector (13 × 16 × 2). A lightweight classifier with one hidden layer (64 ReLU units) and a final softmax layer was trained to classify human actions from these temporally compressed analog signals.
Data availability
The data supporting the findings of this study are available within this article and its Supplementary Information. Source data are provided with this paper.
Code availability
The code used for data analysis and figure generation in this study is available from the corresponding author upon request.
References
Lee, J. et al. An asynchronous wireless network for capturing event-driven data from large populations of autonomous sensors. Nat. Electron. 7, 313–324 (2024).
D’Angelo, G. et al. Event-driven figure-ground organisation model for the humanoid robot iCub. Nat. Commun. 16, 1874 (2025).
Chai, Y. In-sensor computing for machine vision. Nature 579, 32–33 (2020).
Yang, Y. et al. In-sensor dynamic computing for intelligent machine vision. Nat. Electron. 7, 225–233 (2024).
Gehrig, D. & Scaramuzza, D. Low-latency automotive vision with event cameras. Nature 629, 1034–1040 (2024).
Wu, H., Li, Y., Xu, W., Kong, F. & Zhang, F. Moving event detection from lidar point streams. Nat. Commun. 15, 345 (2024).
Zhou, Y. et al. Computational event-driven vision sensors for in-sensor spiking neural networks. Nat. Electron. 6, 870–878 (2023).
Lin, S. et al. Embodied neuromorphic synergy for lighting-robust machine vision to see in extreme bright. Nat. Commun. 15, 10781 (2024).
Kaiser, M. A.-A. et al. Neuromorphic-P2M: processing-in-pixel-in-memory paradigm for neuromorphic image sensors. Front. Neuroinform. 17, 1144301 (2023).
Tan, H. & van Dijken, S. Dynamic machine vision with retinomorphic photomemristor-reservoir computing. Nat. Commun. 14, 2169 (2023).
Colonnier, F., Della Vedova, L. & Orchard, G. ESPEE: Event-based sensor pose estimation using an extended Kalman filter. Sensors 21, 7840 (2021).
Gallego, G. et al. Event-based vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 44, 154–180 (2020).
Everding, L. & Conradt, J. Low-latency line tracking using event-based dynamic vision sensors. Front. Neurorobot. 12, 4 (2018).
Rebecq, H., Horstschäfer, T., Gallego, G. & Scaramuzza, D. Evo: A geometric approach to event-based 6-dof parallel tracking and mapping in real time. IEEE Robot. Autom. Lett. 2, 593–600 (2016).
Posch, C., Serrano-Gotarredona, T., Linares-Barranco, B. & Delbruck, T. Retinomorphic event-based vision sensors: bioinspired cameras with spiking output. Proc. IEEE 102, 1470–1484 (2014).
Vidal, A. R., Rebecq, H., Horstschaefer, T. & Scaramuzza, D. Ultimate SLAM? Combining events, images, and IMU for robust visual SLAM in HDR and high-speed scenarios. IEEE Robot. Autom. Lett. 3, 994–1001 (2018).
Tenzin, S., Rassau, A. & Chai, D. Application of event cameras and neuromorphic computing to VSLAM: A survey. Biomimetics 9, 444 (2024).
Chakravarthi, B., Verma, A. A., Daniilidis, K., Fermuller, C. & Yang, Y., in European Conference on Computer Vision. 342-376 (Springer).
Hsu, T.-H. et al. A 0.5-V real-time computational CMOS image sensor with programmable kernel for feature extraction. IEEE J. Solid-State Circuits 56, 1588–1596 (2020).
Haessig, G. et al. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3963-3972.
Liao, F., Zhou, F. & Chai, Y. Neuromorphic vision sensors: Principle, progress and perspectives. J. Semicond. 42, 013105 (2021).
Fischer, T. & Milford, M. Event-based visual place recognition with ensembles of temporal windows. IEEE Robot. Autom. Lett. 5, 6924–6931 (2020).
Wang, H., Sun, B., Ge, S. S., Su, J. & Jin, M. L. On non-von Neumann flexible neuromorphic vision sensors. npj Flex. Electron. 8, 28 (2024).
Feng, G., Zhang, X., Tian, B. & Duan, C. Retinomorphic hardware for in-sensor computing. InfoMat 5, e12473 (2023).
Lao, J. et al. Ultralow-power machine vision with self-powered sensor reservoir. Adv. Sci. 9, 2106092 (2022).
Jo, H. et al. Physical Reservoir Computing Using Tellurium-Based Gate-Tunable Artificial Photonic Synapses. ACS Nano 18, 30761–30773 (2024).
Liao, F. et al. Bioinspired in-sensor visual adaptation for accurate perception. Nat. Electron. 5, 84–91 (2022).
Zhang, Z. et al. All-in-one two-dimensional retinomorphic hardware device for motion detection and recognition. Nat. Nanotechnol. 17, 27–32 (2022).
Hong, S. et al. Neuromorphic active pixel image sensor array for visual memory. ACS Nano 15, 15362–15370 (2021).
Vats, G., Hodges, B., Ferguson, A. J., Wheeler, L. M. & Blackburn, J. L. Optical memory, switching, and neuromorphic functionality in metal halide perovskite materials and devices. Adv. Mater. 35, 2205459 (2023).
Gao, C. et al. Toward grouped-reservoir computing: organic neuromorphic vertical transistor with distributed reservoir states for efficient recognition and prediction. Nat. Commun. 15, 740 (2024).
Marunchenko, A. et al. Memlumor: A Luminescent Memory Device for Energy-Efficient Photonic Neuromorphic Computing. ACS Energy Lett. 9, 2075–2082 (2024).
Wi, S. et al. Multi-Color Synaptic Luminescence in RE-Doped Ca2SnO4 (RE= Sm3+, Er3+, and La3+). Adv. Funct. Mater. 2414860 (2025).
Talanti, S. et al. CMOS-integrated organic neuromorphic imagers for high-resolution dual-modal imaging. Nat. Commun. 16, 1–9 (2025).
Lin, Q. et al. Event-driven retinomorphic photodiode with bio-plausible temporal dynamics. Nat. Nanotechnol. 1–8 (2025).
Jaeger, H., Lukoševičius, M., Popovici, D. & Siewert, U. Optimization and applications of echo state networks with leaky-integrator neurons. Neural Netw. 20, 335–352 (2007).
Lukoševičius, M. & Jaeger, H. Reservoir computing approaches to recurrent neural network training. Comput. Sci. Rev. 3, 127–149 (2009).
Bernstein, L. et al. Single-shot optical neural network. Sci. Adv. 9, eadg7904 (2023).
Kim, M. et al. Overcoming Hardware Imperfections in Optical Neural Networks Through a Machine Learning-Driven Self-Correction Mechanism. IEEE Photonics J. 16, 1–8 (2024).
Kim, B. et al. Optical convolution operations with optical neural networks for incoherent color image recognition. Opt. Lasers Eng. 185, 108740 (2025).
Wang, T. et al. Image sensing with multilayer nonlinear optical neural networks. Nat. Photonics 17, 408–415 (2023).
Kim, M., Kim, Y. & Park, W. I. Image processing with Optical matrix vector multipliers implemented for encoding and decoding tasks. Light Sci. Appl. 14, 248 (2025).
Dutczak, D. et al. Yellow persistent luminescence of Sr2SiO4: Eu2+, Dy3+. J. Lumin. 132, 2398–2403 (2012).
Feng, A., Joos, J. J., Du, J. & Smet, P. F. Revealing trap depth distributions in persistent phosphors with a thermal barrier for charging. Phys. Rev. B 105, 205101 (2022).
Gorelick, L., Blank, M., Shechtman, E., Irani, M. & Basri, R. Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell. 29, 2247–2253 (2007).
Li, W., Mahadevan, V. & Vasconcelos, N. Anomaly detection and localization in crowded scenes. IEEE Trans. Pattern Anal. Mach. Intell. 36, 18–32 (2013).
Kumar, K. & Kostina, E. Machine learning in parameter estimation of nonlinear systems. Eur. Phys. J. B 98, 60 (2025).
Wang, Q., Ma, Y., Zhao, K. & Tian, Y. A comprehensive survey of loss functions in machine learning. Ann. Data Sci. 9, 187–212 (2022).
Müller, M. et al. Mixed photonic/electronic neural network based on microLED arrays. Neuromorp. Comput. Eng. 5, 024005 (2025).
Song, A., Murty Kottapalli, S. N., Goyal, R., Schölkopf, B. & Fischer, P. Low-power scalable multilayer optoelectronic neural networks enabled with incoherent light. Nat. Commun. 15, 10692 (2024).
iniVation. DAVIS 346, https://inivation.com/wp-content/uploads/2019/08/DAVIS346.pdf (2019).
Lichtsteiner, P., Posch, C. & Delbruck, T. A. 128 × 128 120 dB 15 μs latency asynchronous temporal contrast vision sensor. IEEE J. Solid-state Circuits 43, 566–576 (2008).
Xiao, T. P., Bennett, C. H., Feinberg, B., Agarwal, S. & Marinella, M. J. Analog architectures for neural network acceleration based on non-volatile memory. Appl. Phys. Rev. 7 (2020).
Chen, Y.-H., Krishna, T., Emer, J. S. & Sze, V. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-state Circuits 52, 127–138 (2016).
Acknowledgements
This work was supported by the National Research Foundation (NRF) of Korea, funded by the Ministry of Science, ICT, and Future Planning (MSIP) of Korea (Nos. RS-2021-NR060087 and RS-2024-00353762). S.N. gratefully acknowledges support from the Office of Naval Research (N000142412533) and the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (RS-2024-00408180).
Author information
Authors and Affiliations
Contributions
Y.L.K. and W.I.P. conceived the concept and designed the experiments. Y.L.K. fabricated the phosphors-PDMS films and assembled the AEMS setup. Y.L.K., H.S.P., and M.J.K. developed the ONN–AEMS hybrid system and data acquisition. Y.L.K., H.S.P., and S.W.N. performed the optoelectronic signal measurements and analysis. Y.L.K., H.S.P., and W.I.P. implemented the classification and regression frameworks and performance evaluations. S.H.J., D.Y.J., L.S.H., H.C.Y., and S.Y.K. analyzed the experimental data. N.Y.K., N.R.O., and S.I.Y. carried out PL and TRPL measurements and analyzed the phosphor materials. Y.L.K. and W.I.P. co-wrote the manuscript. All authors discussed the results and provided critical feedback.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Weida Hu and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Kim, Y., Park, H., Kim, M. et al. In-sensor analog optoelectronic processing of concurrent event and memory signals for dynamic vision sensing. Nat Commun 17, 1250 (2026). https://doi.org/10.1038/s41467-025-68013-8
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-025-68013-8








