In-sensor analog optoelectronic processing of concurrent event and memory signals for dynamic vision sensing

Kim, Yelim; Park, Hyeonsu; Kim, Minjoo; Jang, Suhee; Jeong, Dae Yeop; Handriani, Lia Saptini; Yun, Hyuncheol; Gwak, Namyoung; Oh, Nuri; Yang, Sung Ik; Kwon, Soyeong; Nam, SungWoo; Park, Won Il

doi:10.1038/s41467-025-68013-8

Download PDF

Article
Open access
Published: 26 December 2025

In-sensor analog optoelectronic processing of concurrent event and memory signals for dynamic vision sensing

Nature Communications volume 17, Article number: 1250 (2026) Cite this article

9859 Accesses
3 Citations
Metrics details

Subjects

Abstract

Efficient dynamic vision requires capturing instantaneous changes and temporal context, yet existing image and event sensors rely on power-hungry digital processing. Here, we introduce an in-sensor dual-response architecture that concurrently generates analog event spikes and persistent memory tails. A prototype sensor integrates phosphor pairs with silicon photodiodes and transimpedance amplifiers to achieve microsecond- and millisecond-scale dual kinetics. Measurements during light-emitting diode replay reconstruct event frames that match software frame differences, while the slow channel behaves as a linear reservoir of motion history. A single memory frame fed to a convolutional neural network enables accurate classification of human actions (93.1%) and vehicle trajectories (98.0%), as well as speed estimation with errors of 2.15 km/h. Integration with a compressive optical neural network front end mapping 4900 inputs to 16 per frame yields 93.3% action classification accuracy. By eliminating analog-to-digital conversion and digital accumulation, this approach enables ultralow-latency, ultralow-power neuromorphic vision.

Photonic neuromorphic accelerators for event-based imaging flow cytometry

Article Open access 15 October 2024

High-order dynamics in an ultra-adaptive neuromorphic vision device

Article Open access 15 August 2025

Reconfigurable optoelectronic sensor with seven-order dynamic response time range for photodetection and neuromorphic visual perception within monolithic device

Article Open access 09 June 2026

Introduction

Efficiently perceiving dynamic scenes in real time is critical for applications such as robotics, autonomous driving, and intelligent surveillance^1,2,3,4,5,6. Traditional frame-based vision sensors capture full images at fixed intervals (Fig. 1a, top), providing rich spatial context but suffering from high data redundancy: static backgrounds are repeatedly processed, elevating bandwidth requirements, power consumption, and processing latency^5,7,8,9.

**Fig. 1: In-sensor spike-and-tail paradigm versus conventional dynamic-vision pipelines.**

Event-based vision sensors—also known as neuromorphic or dynamic vision sensors (DVSs)—have emerged as compelling alternatives. Unlike conventional imagers, DVSs asynchronously output pixel-level brightness changes, creating sparse event streams characterized by microsecond latency, high temporal resolution, and high dynamic range (Fig. 1a, middle)^10,11,12,13. By filtering out redundant static information, DVSs drastically reduce data throughput and energy consumption while minimizing motion blur during rapid movements^12,14. These advantages have positioned event-based sensors as key enabling technologies for energy-constrained, real-time tasks such as robotic navigation and automotive safety.

However, DVS outputs are inherently stateless, limiting their ability to capture persistent context or motion trajectories^15,16. In conventional pipelines (Fig. 1a, b), motion persistence is reconstructed by (i) computing pixel-wise differences between raw frames (RFs) to produce event frames (EFs)—or directly capturing EFs with a DVS— (ii) converting those event signals via analog-to-digital conversion (ADC), and (iii) digitally accumulating them into memory frames (MFs). Such multi-step digital processing introduces additional latency, power consumption, and hardware complexity^{12,15,17,18,19}. Hybrid/event-intensity sensors such as Dynamic and Active-pixel Vision Sensor (DAVIS) and Asynchronous Time-based Image Sensor (ATIS) provide both asynchronous events and absolute intensity^12,20,21,22. Recovering longer-term temporal context, however, typically requires off-sensor buffering or digital accumulation, which increases system-level latency and complexity.

Recent advances have shown that embedding short-term memory functionality directly within the sensor can greatly enhance dynamic perception^{9,10,23,24,25}. Retinomorphic photomemristor arrays¹⁰ and 2-dimensional (2D) heterostructure-based sensors^26,27,28 inherently encode temporal motion history, enabling “reservoir-in-sensor” processing that eliminates redundant data transfer and provides instantaneous access to accumulated motion context. Yet most existing implementations rely on luminescent or photoconductive materials (e.g., long-persistent phosphors or cumulative photoconductors) whose memory responses are primarily triggered by externally encoded event spike-like pulses, reflecting per-pixel intensity changes^{29,30,31,32,33,34}. This limitation prevents individual pixels from capturing both instantaneous events and persistent motion context in a single stage. Extending in-sensor temporal processing beyond conventional DVS pipelines, a retinomorphic photodiode consolidates event sensing, band-pass dynamics, and light adaptation within a single diode and reports a dynamic range exceeding 200 dB³⁵, demonstrating retina-like front-end temporal behavior under extreme lighting. Event-driven sensors with in-sensor computation generate programmable, amplitude-encoded spikes using dual-polarity photodiodes for in-sensor spiking-neural-network operations⁷. These advances validate the promise of in-sensor temporal processing, yet their external interfaces remain predominantly spike-centric, with longer-term temporal context typically reconstructed off-sensor via event accumulation or windowing.

Here, we introduce an in-sensor dual-response pixel architecture that executes instantaneous event detection and temporal integration directly in the analog domain at the pixel level (Fig. 1c). Each pixel comprises two parallel photosensors that differ in their fast-response kinetics and in the amplitude of their persistent luminescence tails (ms scale). A real-time analog differential measurement of their outputs yields (i) event spikes whenever illumination changes, arising from mismatches in fast response kinetics, and (ii) memory tails that persist after each spike, arising from differences in slow-rise and slow-decay dynamics. This approach dramatically reduces system latency, power consumption, and complexity.

We validate our architecture on the Weizmann Human Action dataset, achieving a structural similarity index (SSIM) of 0.94 for reconstructed EFs and good agreement with a linear reservoir model^36,37 for memory signals. When fed into lightweight convolutional neural networks (CNNs), our MF input outperforms event- or frame-only baselines in action recognition. We further demonstrate that single MF inputs enable vehicle trajectory classification with 98% accuracy and speed estimation with a mean absolute error (MAE) of 2.15 km/h. Finally, an optical neural network (ONN)-based compressive encoder^{38,39,40,41,42} (4900 → 16) achieves 93.3 % accuracy, enabling efficient analog feature encoding. This architecture eliminates the need for ADC and digital accumulation, reduces end-to-end latency by more than an order of magnitude, and projects per-channel power below 1 mW—marking a significant step toward ultralow-latency, ultralow-power neuromorphic vision.

Results

Event and memory signal detection by response speed mismatch

Figure 2a illustrates the pixel-level sensor architecture, hereafter referred to as an analog event-memory sensor (AEMS). Sensors A and B incorporate silicate (Sr₂SiO₄:Eu²⁺) and garnet (Lu₃Al₅O₁₂:Ce³⁺) phosphors, respectively, each coupled to a Si photodiode (PD) and transimpedance amplifier (TIA) (see Methods and Supplementary Note 1). Time-resolved photoluminescence (TRPL) measurements showed that the intrinsic photoluminescence lifetimes of the silicate and garnet are approximately 1.5 μs and 200 ns, respectively (Fig. 2b). In addition, the silicate exhibits a pronounced persistent luminescence (after-glow) tail, whereas garnet’s persistent luminescence is negligible.

**Fig. 2: Concurrent extraction of event spikes and memory tails using an AEMS.**

In our standard configuration (Fig. 2c), Sensor A’s TIA (feedback resistance, R_f = 1.5 MΩ) exhibits a 3 dB bandwidth (f₃dB) of 9 kHz (${{{{\rm{\tau }}}}}_{{{{\rm{TIA}}}}}=1/2{{{\rm{\pi }}}}{{{\rm{f}}}}3{{{\rm{dB}}}}=$ ~18 μs), whereas Sensor B’s TIA (R_f = 4.75 MΩ) has a f₃dB of 3 kHz (${{{{\rm{\tau }}}}}_{{{{\rm{TIA}}}}}=$ ~53 μs). In this case, the effective electrical response is dominated by the TIAs’ low-pass characteristics, rather than the much shorter intrinsic PL lifetimes of the phosphors. Under a 100-ms LED on/off square-wave, Sensors A and B show 10–90% rise/fall times t_r of approximately 50 μs and 120 μs, respectively, closely matching the estimate t_r ≈ 0.35/f₃dB (39 μs and 117 μs).

The differential output V_out(t) = V_A(t) − V_B(t) exhibits an immediate positive spike at LED turn-on owing to Sensor A’s faster kinetics and a negative spike at turn-off (Fig. 2c, red). In this configuration, spikes span approximately 100 μs, but adjusting the TIAs’ feedback resistances can reduce them to as short as 2.4 μs (Supplementary Figs. 3; 2d). Raising the cutoff frequencies above the phosphors’ decay rates could narrow spikes further, approaching the intrinsic PL timescales of the materials.

Following each spike, the silicate’s persistent luminescence, which is nearly absent (less than 5% amplitude) in the garnet channel, produces a slowly decaying tail in V_out (Fig. 2c, blue). Details of the long-tail characterization—combining TRPL for the early (µs) regime and broadband PD-TIA measurements for the long-time (ms) regime—are provided in Supplementary Note 1.3. The convolution of the two TIAs’ time constants with the silicate after-glow kinetics yields analog traces encoding both instantaneous events and temporal memory (dominant decay time τ ~ 45 ms). Stepped-intensity experiments confirm that each brightness transition elicits repetitive spikes, each of which causes an immediate increase or decrease in the tails that gradually decay over time (Fig. 2e).

Under our configuration, EF spikes and MF tails have opposite polarity, enabling real-time separation (see Supplementary Note 2 for the EF/MF recovery method). The opposite case‒when the silicate channel is read with a TIA whose bandwidth is lower than that of the garnet channel‒yields spikes and tails with the same polarity (Supplementary Fig. 5). We also observed memory-fading behavior, in which the degree of tail signal accumulation and decay varied systematically with the temporal profile of input brightness steps, closely matching residual-state dynamics in reservoir computing. Fitting these dynamics with a linear reservoir (leaky-integrator) model^36,37

$${T}_{i}[n]=\alpha \cdot {T}_{i}[n-1]+\beta \cdot {S}_{i}[n-1]$$

where ${T}_{i}[n]$ is the internal memory tail state of channel i at frame n, and ${S}_{i}[n]$ is the external event spike input. The per-frame retention (decay) factor α is estimated from the after-glow fits (α = ~0.895 for τ = ~45 ms with Δt = 5 ms). We set β = 1−α to enforce unity steady-state gain. Using α ∈ {0.85, 0.90, 0.95} in Fig. 2f brackets the fitted range and faithfully reproduces the observed analog-memory tails. The τ is tunable at the materials stage⁴³ and, in principle, can be reconfigured in operando⁴⁴, with details provided in Supplementary Note 1.5.

Real video tests with analog event–memory sensors

To validate our analog event + memory sensing approach, we replayed pixel-wise brightness time-traces from dynamic video clips onto AEMSs using controlled LED stimuli. We selected 93 clips from the Weizmann Human Action Dataset⁴⁵ (each comprising 21 RFs at 70 × 70 pixels across ten action classes; representative frames are shown in Fig. 3a and Supplementary Fig. 11) and converted each pixel’s 8-bit intensity into a corresponding LED intensity sequence (see Supplementary Note 2).

**Fig. 3: Real-video validation of in-sensor event + memory sensing.**

Figure 3b presents example LED intensity waveforms for selected pixels alongside the corresponding AEMS voltage responses. Each luminance transition produces a rapid differential spike followed by a slower decaying tail. Figure 3c shows a 2D spatiotemporal map of the AEMS voltage outputs over time for the 70 pixels in the 43rd column of each frame from the dataset in Fig. 3a. By aggregating spike occurrences across all pixels at each time step, we reconstructed EFs (Fig. 3d, top panels), which closely match digitally computed frame differences (see Supplementary Fig. 13 and Supplementary Video 1 for comparisons between AEMS-reconstructed and digitally computed EFs). This process was carried out across the entire dataset, and the system operated stably over the long acquisitions without observable drift in either spike or memory responses (see also Supplementary Fig. 9 for accelerated-cycling data). Consequently, the reconstructed EFs achieve an average SSIM ~ 0.94 and MAE ~ 0.02 (Supplementary Fig. 14).

In parallel, we integrated the tail amplitudes across pixels to form MF images (Fig. 3d, bottom), capturing motion persistence in a manner analogous to frame accumulation. These analog memory images align qualitatively with the outputs of a linear reservoir model applied to the EF inputs (Supplementary Fig. 13), supporting their functional fidelity. Because they reflect gradual decay after spike accumulation, the MFs effectively preserve motion traces over time. Notably, unlike the case where a linear reservoir model is applied directly to RF inputs³⁴, our MFs are generated from EF-based dynamics and therefore preserve spike polarity (±) and suppress redundant background, yielding richer temporal history with reduced data redundancy (see Supplementary Note 3).

We further evaluated our method on the UCSD Pedestrian Dataset⁴⁶, which features real-world urban scenes with complex visual clutter, including multiple overlapping pedestrians and moving vehicles (Fig. 3e; Supplementary Fig. 15). Despite this complexity, our analog framework enabled robust background suppression and clear delineation of motion trajectories. In the reconstructed EFs, static background elements such as trees and roads were suppressed, while dynamic objects exhibited sharp intensity transitions (SSIM of 0.95 when compared to digitally computed frame differences), confirming spatial fidelity. The corresponding MFs revealed smooth decaying trails consistent with the underlying motion paths.

Learning and evaluation for dynamic image classification

To quantify the benefit of each information channel, we trained lightweight CNNs on three synthetic modalities derived from the Weizmann Human Action Dataset (93 clips, 21 frames each; Fig. 4a). Each clip was augmented twentyfold, and only the last 10 frames—where sufficient temporal context exists—were used to generate RFs, EFs, and MFs. This yielded 18,600 samples per modality, split 80/20 for training and validation. Training used hardware-calibrated EF/MF datasets that embed measured AEMS characteristics—noise statistics, short-/long-term drift, nonlinear responses captured by look-up table (LUT)-based calibration curves, and inter-frame spatiotemporal correlations—and testing used independent AEMS-measured RFs and EFs/MFs (Supplementary Note 2.3).

**Fig. 4: Dynamic image classification using single and fused modalities.**

In single-modality experiments, RF-based models exhibited the poorest performance, with both training and validation accuracies remaining below 45% after 20 epochs (Fig. 4b; Supplementary Fig. 16a). EF benefited modestly from calibration—training/validation behavior improved—but test accuracy remained 69% (Supplementary Fig. 16b), consistent with EF’s frame-difference-like nature and ambiguity in single-frame cues. MF showed the largest gain with calibration, improving from pre-calibration accuracies of about 91%, 85%, and 82% (training/validation/test) to 97%, 96%, and 93%, with stable convergence by epoch 17 (Supplementary Fig. 16c). This reflects the calibrated inclusion of accumulated noise, drift, LUT nonlinearity, and inter-frame correlation absent in purely digital simulations.

For a minimal two-channel fusion, EF plus MF reached accuracies of about 96%, 96%, and 92% (training/validation/test)—an improvement of about 3.3 percentage points over the pre-calibration fusion—yet slightly below MF alone (Fig. 4b; Supplementary Fig. 16d). Confusion matrices (Fig. 4c) show that action pairs ambiguous under EF are effectively separated when MF is used, underscoring MF’s strength in encoding accumulated, continuous spatiotemporal patterns. Overall, progressing from RF to EF to MF under hardware-calibrated training narrows the train/validation/test gap and improves generalization, with MF delivering the highest test accuracy on measured data.

Evaluation of vehicle dynamics in intersection environments

For an object moving at speed v, the continuous-space memory trail along its trajectory, A(ξ), follows an exponential profile A(ξ)∝exp(−ξ/ℓ), where ξ denotes the arc length along motion, and ℓ=vτ is the characteristic trail length. This property implies that the AEMS-based memory filter (MF) provides a natural cue for trajectory and speed estimation (see Supplementary Note 4 for the quantitative EF and MF framework). To evaluate our in-sensor event-plus-memory (EF + MF) pipeline under deployment-relevant intersection scenarios, we generated a synthetic dataset of 510 short video clips depicting vehicles traversing a four-way intersection (Fig. 5a). Vehicles entered from the top, left, or right roads and exited via straight, left-turn, or right-turn trajectories (Fig. 5b shows 50 representative paths), with speeds varying from 30 to 60 km/h by adjusting per-frame displacements. From each clip, we extracted a 36 × 36 pixels patch covering the exit region (red box) and sampled 21 frames at 10 ms intervals (see Supplementary Note 5.1 for details).

**Fig. 5: Evaluation of vehicle trajectory classification and speed estimation in a dynamic intersection task.**

Per-pixel intensity time series from each patch were replayed through an AEMS, yielding analog responses from which we reconstructed EFs and MFs. For both classification and regression, we split these EF and MF sequences into 90% training and 10% unseen test sets. Figure 5c shows final-frame EFs and MFs for three representative combinations of speeds and trajectory (see Supplementary Fig. 23 for the full 21-frame sequences). While EFs vary only subtly with speed and direction, MFs—by integrating the analog tails—clearly reveal motion trails encoding both trajectory and velocity.

Under a strict single-frame, equal-latency setting (decision at the end of the last frame), a lightweight CNN on the final frame shows that the EF-only model overfits (58.8% test accuracy), whereas the MF-only model achieves 98.0% (Fig. 5d; Supplementary Fig. 25). For speed estimation, a CNN regressor trained with Huber loss^47,48 attains MAE (MAPE) of 4.77 km h⁻¹ (10.6%) with EF and 2.15 km h⁻¹ (4.9%) with MF (Fig. 5e; Supplementary Fig. 25c), corresponding to about ±5.3 km h⁻¹ at 95% confidence for a single MF frame. All speed-estimation experiments use intersection scenes with high-contrast crosswalk backgrounds, in which alternating bright/dark stripes overlap the MF tail. To contextualize background effects, Supplementary Fig. 24 profiles the MF tail with and without crosswalks—showing near-exponential decay without crosswalks and local envelope distortion with crosswalks—and demonstrates that background-aware training mitigates this bias. While multi-frame optical-flow and hybrid optical-flow-plus-event pipelines can achieve lower MAE when given longer temporal windows, they entail higher decision latency and memory; detailed accuracy–latency trade-offs and model/training specifics are provided in Supplementary Note 5.3.

Efficient motion analysis via ONN-AEMS hybrid pipeline

High-resolution dynamic image sequences impose a significant computational burden when processed digitally. To address this, we propose integrating an ONN with our AEMS-based analog processing pipeline, enabling most computationally intensive operations to be performed optically and in the analog domain (Fig. 6a). Each 70 × 70 pixels frame is compressed to a 16D feature vector via a single-pass matrix–vector multiplication (MVM). In our implementation, the RF is replicated into 16 parallel channels (optical “fan-out”), each modulated by a pre-trained 4900 × 16 weight mask encoded in greyscale. The modulated outputs of all 4900 pixels per channel are summed onto a photodetector (optical “fan-in”), completing a full 4900 × 16 MVM in one shot (see Supplementary Note 6.1).

**Fig. 6: Efficient dynamic motion analysis via ONN–AEMS hybrid pipeline.**

Over 21 consecutive frames, the ONN produces a 16D output vector per frame (Fig. 6b). These are converted into a 16-channel LED drive, and the AEMS’s analog spike-and-tail responses are recorded (Fig. 6c). From these recordings, we extract: (i) compressed EFs (cEFs)—the spike amplitudes at each of the 20 inter-frame transitions—and (ii) compressed MFs (cMF)—the tail amplitudes immediately following those transitions. Both cEFs and cMFs are thus represented as 20-frames × 16-channel matrices.

From the AEMS outputs of 4650 augmented video sequences (90% training including 20% for validation, and 10% testing), we select 13 cEFs and 13 cMFs (26 frames total × 16 channels = 416D) per sequence (Fig. 6d). This corresponds to a 247-fold reduction from the original 102,900D input (21 frames × 70 × 70 pixels).

A lightweight classifier (64-unit ReLU dense layer followed by a 10-unit softmax) was trained to predict one of ten action classes. After training (Supplementary Fig. 28), we achieved ~93.3% classification accuracy on the test set (Fig. 6e). Most action classes are well separated, except for some confusion between “Two-hands wave” and “One-hand wave,” likely due to their similar spatiotemporal signatures in the compressed 16D feature space.

We next investigated the impact of frame count on classification accuracy. Starting from the full set of 20 cEFs and 20 cMFs, we progressively reduced the number of paired inputs from 40 to 2 (i.e., n cEF + n cMF, for n = 20 to 1). Figure 6f shows the resulting test accuracy as a function of compression ratio, which increases from 160-fold (n = 20) to 3216-fold (n = 1). While accuracy declines moderately beyond approximately one-thousand-fold compression, it remains above 90% even at three-hundred-fold (red dots). Compared to idealized simulations (blue dots), our measured accuracy is a few percentage points lower—likely due to optical misalignments, sensor noise, and the fact that only the classifier (not the ONN compressor) is retrained on real data. These simulations suggest that with improved alignment, hardware fidelity, and full end-to-end training, accuracy could exceed 96% even under extreme compression.

By offloading MVM to the ONN and extracting event and memory signals using AEMS, our hybrid pipeline dramatically reduces computational load and enables real-time, low-power action recognition at the sensor front end. In our prototype, ONN outputs are still digitally stored and replayed via LEDs due to OLED and camera refresh limitations, which prevent fully real-time optical–analog processing. These constraints also dominate the end-to-end latency—OLED refresh (16.7 ms at 60 Hz) and camera integration (10–20 ms)—for a total of about 27–37 ms. In a production-ready system, the OLED fan-out would be replaced by a microlens array^41,49 (or a diffractive optical element, DOE), and high-speed photodetector arrays^41,50 would capture each channel directly, eliminating all digital storage and replay steps. The weight mask would be implemented as a fixed-pattern passive transmissive element (e.g., chromium-on-glass attenuation, etched-glass phase, or a metasurface), so no periodic refresh would be required during operation. The remaining task—amplifying, offsetting, and rescaling each of the 16 continuous-valued ONN outputs—would likewise be performed entirely in the analog domain via integrated banks of low-noise TIAs and simple sample-and-hold (S/H) or level-shifter circuits co-packaged with the pixel array. Under this architecture, the latency would be set by the PD-TIA bandwidth and S/H timing, enabling μs-scale operation. A detailed description of the deployable architecture and latency assumptions, along with process and packaging considerations for monolithic (or highly integrated) implementations, is provided in Supplementary Note 6.2 (Supplementary Table 3).

Discussion

A comparative analysis against existing dynamic vision architectures highlights the novelty and system-level strengths of our analog spike-and-tail framework. Unlike conventional DVS (events only) and hybrid sensors such as DAVIS/ATIS (events + absolute intensity), which typically reconstruct longer-term temporal context off-sensor via buffering or digital accumulation, AEMS preserves temporal information as an analog state at the pixel plane. This removes per-frame ADC and external frame-memory accesses from the latency-critical path. As a result, latency and power are reduced, and the data path is simplified. This dual modality (“event + local memory”) simplifies tracking, separation, and prediction in scenes containing both fast and slow objects while maintaining DVS-class timing (see Supplementary Note 8, Supplementary Table 5 for a quantitative comparison with DAVIS/ATIS)^51,52. With event detection and memory integration performed directly at the pixel plane, our system achieves 50–100 μs event latency in prototype measurements and is feasible in less than 2 μs with higher-bandwidth TIAs. Array-level scalability and high–frame-rate temporal fidelity follow from the complementary EF–MF scaling with the frame interval Δt. EF fidelity benefits from decreasing Δt until bounded by spike width w, frame-boundary splitting, and operator choice, whereas MF signal-to-noise ratio increases with larger Δt (see Supplementary Note 4.3 and Supplementary Figs. 21 and 22). This complementarity enables array-scale operation that maintains DVS-class timing while providing local analog memory for temporal context.

The analog approach further offers major advantages in power efficiency and hardware simplicity. While the current prototype draws less than 5 W (including drivers and interfaces), ASIC projections^53,54—assuming 180-nm analog complementary metal–oxide–semiconductor (CMOS), V_DD = 1.2 V, effective closed-loop bandwidths of approximately 30 kHz (EF) and 3 kHz (MF), and average activity of less than 10 kevents s⁻¹pixel⁻¹—indicate about 0.3 mW per pixel, with average system power well below 1 W under realistic duty cycling (see Supplementary Note 8.1 and Supplementary Table 4). By removing per-frame ADC and off-sensor frame-memory accesses from the latency-critical path—and pushing any necessary digitization to low-rate summaries at the system boundary—integration is simplified, form factor is reduced, and manufacturing complexity is lowered¹⁰ (see Supplementary Notes 8.2, 8.3). A core innovation is the analog “tail” signal, which encodes temporal information locally at the sensor plane, avoiding the latency and energy overheads of digital memory accumulation.

Regarding area and manufacturability, we quantify the pixel-layout overhead of the dual-channel EF/MF pixel and its CMOS compatibility. Under a 10–15 μm pitch assumption, logic sharing—reusing the comparator, address-event representation (AER) interface, and column periphery—limits the area overhead to approximately 20–50%, avoiding a naïve two-fold penalty (see Supplementary Note 9.1). We also detail a CMOS-compatible back-end-of-line (BEOL) integration flow for the persistent-luminescence phosphor using a photopatternable polymer–phosphor composite with a thermal budget of no more than 150 °C, akin to CMOS image-sensor color-filter-array and microlens processing, with a practical scaling path from 100 μm prototype pitch to 10–15 μm (see Supplementary Notes 9.1 and 9.3). In summary, AEMS delivers high-fidelity event plus local-memory streams at DVS-class speeds, while improving efficiency, hardware compactness, and data richness. These attributes enable more accurate downstream perception and simplify end-to-end system design.

Methods

Fabrication of fluorescent PDMS films

To enable analog optical sensing, two types of phosphors—Sr₂SiO₄:Eu²⁺ (silicate) and Lu₃Al₅O₁₂:Ce³⁺ (garnet)—were embedded in a PDMS matrix (Silgard™ 184, Dow Corning) to form flexible fluorescent films. The PDMS base and curing agent were mixed at a 10:1 weight ratio, and phosphor powders were added at a 1:8 phosphor-to-PDMS ratio and homogenized. The mixture was degassed under vacuum for 30 minutes, cast into a 60 mm petri dish, and pre-cured at 70 °C for 1 hour. TRPL decays were measured using a Horiba DeltaDiode-375L pulsed laser diode (λ = 375 nm) and a time-correlated single-photon counting (TCSPC) module integrated into a Fluoromax-4 system.

AEMS assembly and photoelectrical response measurement

Each AEMS unit consisted of two parallel optical channels, each comprising a phosphor-PDMS film, a long-pass filter (>495 nm, Thorlabs), and a silicon photodiode with integrated TIA (PDA100A2, Thorlabs). To ensure identical responsivity to input light, the two sensor arms (silicate and garnet) were carefully aligned in both position and angle. Analog illumination was driven by a 255-nm UV LED modulated by a National Instruments USB-6423 DAQ, amplified using an HP 6827 A amplifier. The differential voltage V_out = V_A − V_B was sampled at 40 kS/s using the same DAQ device.

EF/MF image construction and motion analysis

To evaluate dynamic visual sensing, video clips from the Weizmann Human Action dataset (21 frames, 70 × 70 pixels) and the Crossroad dataset (21 frames, 36 × 36 pixels) were replayed frame-by-frame into the AEMS via UV LEDs. Each 8-bit pixel intensity (0–255) was mapped to a corresponding LED brightness level. The AEMS output signals were processed to extract differential spike responses (event data) and decaying tail responses (memory data), which were used to construct EFs and MFs.

These EF/MF representations then served as training data for CNN-based models. For human action classification (Weizmann), we used hardware-calibrated, simulation-generated EF/MF datasets parameterized by measured device characteristics. For vehicle trajectory and speed prediction (Crossroad), we used EF and MF images reconstructed from AEMS measurements. The CNN architecture consisted of a single convolutional layer with 64 filters (3 × 3), ReLU activation, 2 × 2 max-pooling, and fully connected layers. The classifier output layer predicted either one of ten action classes (Weizmann) or three trajectory types (Intersection). Training used the Adam optimizer (learning rate = 0.001, batch size = 32), with early stopping based on validation accuracy.

ONN–AEMS hybrid pipeline implementation

To demonstrate optical–electronic hybrid computing, we constructed a pipeline integrating an ONN with the AEMS array. We first trained a simulation-based neural network with the same architecture as the ONN-AEMS pipeline—comprising a fully connected compressor (FCNN), temporal differencing, reservoir integration, and a dense classifier. The trained FCNN weights (4900 × 16) were then implemented optically.

Each RF (70 × 70 pixels) was split into 16 replicated images and displayed on an OLED panel. These were optically projected onto 16 distinct greyscale weight masks encoded on an LCD panel, enabling one-shot 4900 × 16 optical MVM. The modulated light was focused onto a scientific camera (Thorlabs CC215MU), and each of the 16 projected images was integrated over its 4900 pixels to yield a 16-D compressed output vector.

This feature vector was replayed via LED into the AEMS array to extract dynamic temporal signals. Each sample consisted of 13 cEFs and 13 cMFs, flattened into a 416-D input vector (13 × 16 × 2). A lightweight classifier with one hidden layer (64 ReLU units) and a final softmax layer was trained to classify human actions from these temporally compressed analog signals.

Data availability

The data supporting the findings of this study are available within this article and its Supplementary Information. Source data are provided with this paper.

Code availability

The code used for data analysis and figure generation in this study is available from the corresponding author upon request.

References

Lee, J. et al. An asynchronous wireless network for capturing event-driven data from large populations of autonomous sensors. Nat. Electron. 7, 313–324 (2024).
Article PubMed PubMed Central Google Scholar
D’Angelo, G. et al. Event-driven figure-ground organisation model for the humanoid robot iCub. Nat. Commun. 16, 1874 (2025).
Article ADS PubMed PubMed Central Google Scholar
Chai, Y. In-sensor computing for machine vision. Nature 579, 32–33 (2020).
Article ADS CAS PubMed Google Scholar
Yang, Y. et al. In-sensor dynamic computing for intelligent machine vision. Nat. Electron. 7, 225–233 (2024).
Article CAS Google Scholar
Gehrig, D. & Scaramuzza, D. Low-latency automotive vision with event cameras. Nature 629, 1034–1040 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Wu, H., Li, Y., Xu, W., Kong, F. & Zhang, F. Moving event detection from lidar point streams. Nat. Commun. 15, 345 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhou, Y. et al. Computational event-driven vision sensors for in-sensor spiking neural networks. Nat. Electron. 6, 870–878 (2023).
Article Google Scholar
Lin, S. et al. Embodied neuromorphic synergy for lighting-robust machine vision to see in extreme bright. Nat. Commun. 15, 10781 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Kaiser, M. A.-A. et al. Neuromorphic-P2M: processing-in-pixel-in-memory paradigm for neuromorphic image sensors. Front. Neuroinform. 17, 1144301 (2023).
Article PubMed PubMed Central Google Scholar
Tan, H. & van Dijken, S. Dynamic machine vision with retinomorphic photomemristor-reservoir computing. Nat. Commun. 14, 2169 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Colonnier, F., Della Vedova, L. & Orchard, G. ESPEE: Event-based sensor pose estimation using an extended Kalman filter. Sensors 21, 7840 (2021).
Article ADS PubMed PubMed Central Google Scholar
Gallego, G. et al. Event-based vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 44, 154–180 (2020).
Article ADS Google Scholar
Everding, L. & Conradt, J. Low-latency line tracking using event-based dynamic vision sensors. Front. Neurorobot. 12, 4 (2018).
Article PubMed PubMed Central Google Scholar
Rebecq, H., Horstschäfer, T., Gallego, G. & Scaramuzza, D. Evo: A geometric approach to event-based 6-dof parallel tracking and mapping in real time. IEEE Robot. Autom. Lett. 2, 593–600 (2016).
Article Google Scholar
Posch, C., Serrano-Gotarredona, T., Linares-Barranco, B. & Delbruck, T. Retinomorphic event-based vision sensors: bioinspired cameras with spiking output. Proc. IEEE 102, 1470–1484 (2014).
Article ADS Google Scholar
Vidal, A. R., Rebecq, H., Horstschaefer, T. & Scaramuzza, D. Ultimate SLAM? Combining events, images, and IMU for robust visual SLAM in HDR and high-speed scenarios. IEEE Robot. Autom. Lett. 3, 994–1001 (2018).
Article Google Scholar
Tenzin, S., Rassau, A. & Chai, D. Application of event cameras and neuromorphic computing to VSLAM: A survey. Biomimetics 9, 444 (2024).
Article PubMed PubMed Central Google Scholar
Chakravarthi, B., Verma, A. A., Daniilidis, K., Fermuller, C. & Yang, Y., in European Conference on Computer Vision. 342-376 (Springer).
Hsu, T.-H. et al. A 0.5-V real-time computational CMOS image sensor with programmable kernel for feature extraction. IEEE J. Solid-State Circuits 56, 1588–1596 (2020).
Article ADS Google Scholar
Haessig, G. et al. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3963-3972.
Liao, F., Zhou, F. & Chai, Y. Neuromorphic vision sensors: Principle, progress and perspectives. J. Semicond. 42, 013105 (2021).
Article Google Scholar
Fischer, T. & Milford, M. Event-based visual place recognition with ensembles of temporal windows. IEEE Robot. Autom. Lett. 5, 6924–6931 (2020).
Article Google Scholar
Wang, H., Sun, B., Ge, S. S., Su, J. & Jin, M. L. On non-von Neumann flexible neuromorphic vision sensors. npj Flex. Electron. 8, 28 (2024).
Article Google Scholar
Feng, G., Zhang, X., Tian, B. & Duan, C. Retinomorphic hardware for in-sensor computing. InfoMat 5, e12473 (2023).
Article Google Scholar
Lao, J. et al. Ultralow-power machine vision with self-powered sensor reservoir. Adv. Sci. 9, 2106092 (2022).
Article Google Scholar
Jo, H. et al. Physical Reservoir Computing Using Tellurium-Based Gate-Tunable Artificial Photonic Synapses. ACS Nano 18, 30761–30773 (2024).
Article CAS PubMed Google Scholar
Liao, F. et al. Bioinspired in-sensor visual adaptation for accurate perception. Nat. Electron. 5, 84–91 (2022).
Article Google Scholar
Zhang, Z. et al. All-in-one two-dimensional retinomorphic hardware device for motion detection and recognition. Nat. Nanotechnol. 17, 27–32 (2022).
Article ADS PubMed Google Scholar
Hong, S. et al. Neuromorphic active pixel image sensor array for visual memory. ACS Nano 15, 15362–15370 (2021).
Article CAS PubMed Google Scholar
Vats, G., Hodges, B., Ferguson, A. J., Wheeler, L. M. & Blackburn, J. L. Optical memory, switching, and neuromorphic functionality in metal halide perovskite materials and devices. Adv. Mater. 35, 2205459 (2023).
Article CAS Google Scholar
Gao, C. et al. Toward grouped-reservoir computing: organic neuromorphic vertical transistor with distributed reservoir states for efficient recognition and prediction. Nat. Commun. 15, 740 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Marunchenko, A. et al. Memlumor: A Luminescent Memory Device for Energy-Efficient Photonic Neuromorphic Computing. ACS Energy Lett. 9, 2075–2082 (2024).
Article CAS Google Scholar
Wi, S. et al. Multi-Color Synaptic Luminescence in RE-Doped Ca₂SnO₄ (RE= Sm3+, Er3+, and La3+). Adv. Funct. Mater. 2414860 (2025).
Talanti, S. et al. CMOS-integrated organic neuromorphic imagers for high-resolution dual-modal imaging. Nat. Commun. 16, 1–9 (2025).
Article Google Scholar
Lin, Q. et al. Event-driven retinomorphic photodiode with bio-plausible temporal dynamics. Nat. Nanotechnol. 1–8 (2025).
Jaeger, H., Lukoševičius, M., Popovici, D. & Siewert, U. Optimization and applications of echo state networks with leaky-integrator neurons. Neural Netw. 20, 335–352 (2007).
Article ADS PubMed Google Scholar
Lukoševičius, M. & Jaeger, H. Reservoir computing approaches to recurrent neural network training. Comput. Sci. Rev. 3, 127–149 (2009).
Article Google Scholar
Bernstein, L. et al. Single-shot optical neural network. Sci. Adv. 9, eadg7904 (2023).
Article CAS PubMed PubMed Central Google Scholar
Kim, M. et al. Overcoming Hardware Imperfections in Optical Neural Networks Through a Machine Learning-Driven Self-Correction Mechanism. IEEE Photonics J. 16, 1–8 (2024).
Article ADS CAS Google Scholar
Kim, B. et al. Optical convolution operations with optical neural networks for incoherent color image recognition. Opt. Lasers Eng. 185, 108740 (2025).
Article Google Scholar
Wang, T. et al. Image sensing with multilayer nonlinear optical neural networks. Nat. Photonics 17, 408–415 (2023).
Article ADS CAS Google Scholar
Kim, M., Kim, Y. & Park, W. I. Image processing with Optical matrix vector multipliers implemented for encoding and decoding tasks. Light Sci. Appl. 14, 248 (2025).
Article ADS PubMed PubMed Central Google Scholar
Dutczak, D. et al. Yellow persistent luminescence of Sr2SiO4: Eu2+, Dy3+. J. Lumin. 132, 2398–2403 (2012).
Article CAS Google Scholar
Feng, A., Joos, J. J., Du, J. & Smet, P. F. Revealing trap depth distributions in persistent phosphors with a thermal barrier for charging. Phys. Rev. B 105, 205101 (2022).
Article ADS CAS Google Scholar
Gorelick, L., Blank, M., Shechtman, E., Irani, M. & Basri, R. Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell. 29, 2247–2253 (2007).
Article ADS PubMed Google Scholar
Li, W., Mahadevan, V. & Vasconcelos, N. Anomaly detection and localization in crowded scenes. IEEE Trans. Pattern Anal. Mach. Intell. 36, 18–32 (2013).
ADS Google Scholar
Kumar, K. & Kostina, E. Machine learning in parameter estimation of nonlinear systems. Eur. Phys. J. B 98, 60 (2025).
Article ADS CAS Google Scholar
Wang, Q., Ma, Y., Zhao, K. & Tian, Y. A comprehensive survey of loss functions in machine learning. Ann. Data Sci. 9, 187–212 (2022).
Article Google Scholar
Müller, M. et al. Mixed photonic/electronic neural network based on microLED arrays. Neuromorp. Comput. Eng. 5, 024005 (2025).
Article Google Scholar
Song, A., Murty Kottapalli, S. N., Goyal, R., Schölkopf, B. & Fischer, P. Low-power scalable multilayer optoelectronic neural networks enabled with incoherent light. Nat. Commun. 15, 10692 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
iniVation. DAVIS 346, https://inivation.com/wp-content/uploads/2019/08/DAVIS346.pdf (2019).
Lichtsteiner, P., Posch, C. & Delbruck, T. A. 128 × 128 120 dB 15 μs latency asynchronous temporal contrast vision sensor. IEEE J. Solid-state Circuits 43, 566–576 (2008).
Article ADS Google Scholar
Xiao, T. P., Bennett, C. H., Feinberg, B., Agarwal, S. & Marinella, M. J. Analog architectures for neural network acceleration based on non-volatile memory. Appl. Phys. Rev. 7 (2020).
Chen, Y.-H., Krishna, T., Emer, J. S. & Sze, V. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-state Circuits 52, 127–138 (2016).
Article ADS Google Scholar

Download references

Acknowledgements

This work was supported by the National Research Foundation (NRF) of Korea, funded by the Ministry of Science, ICT, and Future Planning (MSIP) of Korea (Nos. RS-2021-NR060087 and RS-2024-00353762). S.N. gratefully acknowledges support from the Office of Naval Research (N000142412533) and the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (RS-2024-00408180).

Author information

Authors and Affiliations

Division of Materials Science and Engineering, Hanyang University, Seoul, Republic of Korea
Yelim Kim, Hyeonsu Park, Minjoo Kim, Suhee Jang, Dae Yeop Jeong, Lia Saptini Handriani, Hyuncheol Yun, Namyoung Gwak, Nuri Oh & Won Il Park
Department of Applied Chemistry, Kyung Hee University, Yongin, Republic of Korea
Sung Ik Yang
Department of Mechanical and Aerospace Engineering, University of California, Irvine, Irvine, CA, USA
Soyeong Kwon & SungWoo Nam
Department of Optical Engineering, Kongju National University, Cheonan, Republic of Korea
Soyeong Kwon

Authors

Yelim Kim
View author publications
Search author on:PubMed Google Scholar
Hyeonsu Park
View author publications
Search author on:PubMed Google Scholar
Minjoo Kim
View author publications
Search author on:PubMed Google Scholar
Suhee Jang
View author publications
Search author on:PubMed Google Scholar
Dae Yeop Jeong
View author publications
Search author on:PubMed Google Scholar
Lia Saptini Handriani
View author publications
Search author on:PubMed Google Scholar
Hyuncheol Yun
View author publications
Search author on:PubMed Google Scholar
Namyoung Gwak
View author publications
Search author on:PubMed Google Scholar
Nuri Oh
View author publications
Search author on:PubMed Google Scholar
Sung Ik Yang
View author publications
Search author on:PubMed Google Scholar
Soyeong Kwon
View author publications
Search author on:PubMed Google Scholar
SungWoo Nam
View author publications
Search author on:PubMed Google Scholar
Won Il Park
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.L.K. and W.I.P. conceived the concept and designed the experiments. Y.L.K. fabricated the phosphors-PDMS films and assembled the AEMS setup. Y.L.K., H.S.P., and M.J.K. developed the ONN–AEMS hybrid system and data acquisition. Y.L.K., H.S.P., and S.W.N. performed the optoelectronic signal measurements and analysis. Y.L.K., H.S.P., and W.I.P. implemented the classification and regression frameworks and performance evaluations. S.H.J., D.Y.J., L.S.H., H.C.Y., and S.Y.K. analyzed the experimental data. N.Y.K., N.R.O., and S.I.Y. carried out PL and TRPL measurements and analyzed the phosphor materials. Y.L.K. and W.I.P. co-wrote the manuscript. All authors discussed the results and provided critical feedback.

Corresponding author

Correspondence to Won Il Park.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Weida Hu and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information (download PDF )

Description of Additional Supplementary Files (download PDF )

Supplementary Video 1 (download MP4 )

Transparent Peer Review file (download PDF )

Source data

Source data (download XLSX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Kim, Y., Park, H., Kim, M. et al. In-sensor analog optoelectronic processing of concurrent event and memory signals for dynamic vision sensing. Nat Commun 17, 1250 (2026). https://doi.org/10.1038/s41467-025-68013-8

Download citation

Received: 11 July 2025
Accepted: 15 December 2025
Published: 26 December 2025
Version of record: 02 February 2026
DOI: https://doi.org/10.1038/s41467-025-68013-8