An event-oriented diffusion-refinement method for sparse events completion

Event cameras or dynamic vision sensors (DVS) record asynchronous response to brightness changes instead of conventional intensity frames, and feature ultra-high sensitivity at low bandwidth. The new mechanism demonstrates great advantages in challenging scenarios with fast motion and large dynamic range. However, the recorded events might be highly sparse due to either limited hardware bandwidth or extreme photon starvation in harsh environments. To unlock the full potential of event cameras, we propose an inventive event sequence completion approach conforming to the unique characteristics of event data in both the processing stage and the output form. Specifically, we treat event streams as 3D event clouds in the spatiotemporal domain, develop a diffusion-based generative model to generate dense clouds in a coarse-to-fine manner, and recover exact timestamps to maintain the temporal resolution of raw data successfully. To validate the effectiveness of our method comprehensively, we perform extensive experiments on three widely used public datasets with different spatial resolutions, and additionally collect a novel event dataset covering diverse scenarios with highly dynamic motions and under harsh illumination. Besides generating high-quality dense events, our method can benefit downstream applications such as object classification and intensity frame reconstruction.


Introduction
As a novel bio-inspired sensor, event cameras work in a different way from conventional intensity cameras via sensing asynchronous pixel-wise brightness changes.The working principle renders the sensor unique characteristics such as high sensitivity, low latency and high temporal resolution, which provide reliable visual information and wide applications in extreme environments, e.g.fast avoidance, 1 low-light/high dynamic range perception 2 and high-speed imaging 3,4 etc.However, such features are traded with spatial and temporal sparsity in the data stream.Firstly, only brightness changes exceeding a threshold can be recorded, and the outputs mostly locate salient moving edges and patterns.Although this issue can be alleviated by raising the sensitivity level but would bring more noise, imposing great challenges to the successive processing and analysis.Secondly, the readout speed may be limited by the hardware's bandwidth (e.g.drone, PC etc.) even if the camera itself works in the full-capacity mode and causes missing entries in the data stream, which is especially severe in high-resolution and busy scenes.The above issues would degenerate or even fail many off-the-shelf event analysis algorithms working well on event streams in ordinary scenarios.To fully utilize the advantages of event cameras in challenging cases (e.g.low-light, high-speed), recovering the missing signals from the sparse recorded event streams is of crucial importance, but remains an under-explored area.
In analog to other event quality enhancement tasks, such as super-resolution , [5][6][7][8] joint denoising and super-resolution 8 , 5 one can convert raw events into 2D grid-based representation for algorithm development.This intuitive solution facilitates adapting the algorithms working on conventional image/video frames, but faces limitations in multiple aspects: assigning no or random timestamps to the output events would lose the temporal ordering information; the output event frames are non-binary, which deviates from the format of event data; the grid-based representation includes such mismatch between the representation and the intrinsic structure might further lead to artificial results in recovered event streams and even harm the downstream analysis.In comparison, Li et al. 7 proposed an inspiring strategy to super-resolve the events while maintaining the temporal information.However, the temporal precision is limited to milliseconds when using spiking neural network (SNN) for simulation, and far insufficient for microsecond responses of event cameras.An efficient algorithm making extensive use of the unique structure of event sequence and conforming to its format is highly demanded.
Event sequence can be formulated as a binary 3D data given a time duration (shown in the left of Fig. 1), with a four-element tuple (x, y, t, p) denoting the location, time instant and polarity of each event.In other words, the event occurrences compose a cloud, similar to the point cloud in 3D vision.Built on this representation, we propose an event data completion method based on the powerful generative discrete diffusion probabilistic model (DDPM), and develop an event-oriented deep network as the cornerstone.Our method works in a coarse-to-fine manner that firstly predicts a coarse distribution on the condition of sparse event sequences and then refines the generated events with the conditional input with a second sub-network.The final output of the network is a completed set of 3D events that can be transformed back to sequential format without losing  temporal ordering information (middle column, Fig. 1).To validate our effectiveness in diverse scenarios, we collect a new dataset consisting of diverse challenging scenes and conduct experiments on it together with three public datasets with different spatial resolutions.Furthermore, we also show that our method can benefit downstream applications, including object classification and frame reconstruction.
In sum, this paper contributes in the following aspects: • We propose to represent an event stream as a cloud and recover the dense event signals underlying the recorded sparse event streams via a diffusion-based generative model.
• We develop an event-oriented network as the cornerstone of the diffusion model, which outputs complete dense events with better visual quality while maintaining temporal ordering information.
• We validate the advantageous performance of our approach and its wide applicability to diverse scenes on three public datasets with different resolutions, and show our superior performance on real challenging cases with a self-captured dataset.
• We conduct two downstream tasks using the completed events, i.e. object classification and intensity frame reconstruction, and obtain satisfying results, demonstrating the wide applications of our method.

Problem Statement
Event Formation.
As an asynchronous sensor, event camera senses pixel-wise illuminance changes, with each pixel independently responding to the change in the logarithmic brightness L(q k , t k ) .= log(q k , t k ).Specifically, at pixel q k = (x k , y k ) T and time t k , an event occurs when the brightness change since the last event at this location reaches a threshold ±T (T > 0) where ∆t k is the time since the last event at q and p k is a binary value (1 or -1) indicating the sign of the change in brightness.An event sequence can be represented as a set of four-element tuples Event Completion Formulation.Sparsity is the intrinsic characteristic of event data, while too sparse events contain limited information for any application.The event completion task arises when event cameras capture insufficient events in challenging environments such as high-speed and dark scenarios, especially for a large-pixel-number sensor which would also encounter extreme spatial sparsity.In Eq. 1, the threshold T is corresponding to the sensitivity of the sensor.
Physically raising T can raise the density of the sensed events but also induces more noise, which is a fundamental trade-off for event camera.
For this task, suppose e L and e H denote the events captured with T L and T H for the same scene, where T L < T H from the latent clean dense events e CD e L = C(e CD , T L , δ) or e H = C(e CD , T H , δ), where C is the capture process of the sensor and δ denotes the camera settings except sensitivity.
As analyzed above, e L contains clean sparse events while e H contains noisy dense events.The objective of enhancing event quality is to recover clean dense events e CD from either of the captured degraded inputs, i.e. e L or e H . Event denoising task is defined as recovering e CD from e H . Similarly, event completion task can be defined as the recovery of e CD from the clean sparse where F denotes the reconstruction algorithm and θ represents its parameters.For most of the time, the paired e CD and e L cannot be acquired simultaneously, so we use a random sampling strategy to simulate sparse events from real dense events.Given a complete event set e CD , the task is to reconstruct e CD from the down-sampled event set S(e CD ).Therefore, Eq. 3 turns into

Related work
Event Representation and Quality Enhancement Event Representation.
Event signals have been proven to provide auxiliary help in video deblurring and frame interpolation, [9][10][11] image reconstruction and super-resolution, 8,12,13 and downstream applications such as object recognition 14,15 and detection. 16,17With the rapid development of deep learning, many network architectures that embed event streams for either image restoration or pattern recognition have been proposed, such as HFirst, 18 event frame, 19 event histograms, 20 event-based time surfaces, 21 event spike tensor 22 and event volume 23 etc.[26] Event Quality Enhancement.Raw event signals suffer from severe noise and spatio-temporal sparsity, which challenges the visualization, analysis and downstream applications.If the camera operates in extreme cases, the quality will dramatically decrease further.6][7][8] Consid-ering the noise would hamper super-resolution, Duan et al. 5 proposed a deep-learning method to jointly denoise and super-resolve neuromorphic events using an encoder-decoder network, which takes the temporally binned events as input and allocate random timestamps to the output highresolution events.Such irreversible practice will lose temporal ordering information in the output and may harm downstream applications.As the first attempt to super-resolve events while keeping timestamps, Li et al. 6 proposed a two-stage scheme that first acquires spatial event-count map and temporal rate function,and then obtains the event of each new pixel with a thinning based event sampling algorithm.Further, Li et al. 7 proposed a spatio-temporal constraint learning method that optimizes the spatial and temporal event distribution based on SNN model and a simple three-layer CNN.This method achieves pleasant visual quality but requires sufficient events in the sparse input to learn the spatio-temporal distribution and is limited in millisecond resolution due to the numerical simulation of SNN. 31 Therefore, an event-to-event recovery method maintaining spatial distribution and sharp details, high temporal resolution and ordering information is highly desired.

Point Cloud Completion
With the maturity of 3D sensors, point clouds have become an important form of modeling 3D scenes.A high quality point cloud is essential for downstream tasks and significant progress has been made in generating a complete point cloud from a degraded input.In the past decades, many algorithms have been proposed by using 3D CNNs, 32,33 graph CNNs, 34,35 transformer. 36These methods learn a complete point cloud representation under direct supervision of ground truth data.
As a new generative model, the denoising diffusion probabilistic model (DDPM) 40,41 decomposes the generation process into multiple steps by learning to steadily denoise the random input noise.
Due to its powerful generation capability, diffusion model has been applied for point cloud completion 39,42,43 and achieved the state-of-the-art performance.As found in, 43   and event cloud, we introduce a conditional DDPM model with an event-oriented encoder-decoder network to generate a dense event sequence with fine details in a coarse-to-fine manner.

Method
In this section, we introduce the event-oriented diffusion refinement (EDR) method, a conditional denoising diffusion probabilistic model for event completion, with the overview illustrated in Fig. 2 and the key modules described in the following subsections.

Event Cloud Representation
Raw event data takes the form of a sequence of four-element tuples with each event e r = (x, y, t, p), which are converted into binary points in the 3D coordinate system before being fed into the network for training or inference.Firstly, we cut the event streams sequentially into slices containing N events e r = {e i r ; i = 0, • • • , N }, ranked by timestamp.Then, the event slice is normalized by the sensor's pixel count along the spatial dimension and by the time duration along the temporal dimension, i.e.
So far, we can denote a sample e with N events whose x, y and t values are between −1 and 1.These processed events can be regarded as a set of event entries in the 3D coordinate system (similar to point cloud), as shown in Fig. 1.Besides, the polarity of each point is also attached as its feature.The network completing this set of events is built on this representation and the output conforms exactly to the same form, which can be converted back to the set of four-element tuples ordered by the timestamp.
Despite the high similarity with point cloud, the event cloud differs in multiple aspects.First of all, the t− dimension has different metric with spatial dimensions x− and y−, thus the event entries are unevenly distributed in the 3D space.Secondly, the normalized event points cannot form a 3D shape with smooth surfaces and have discontinuous and even scattered details instead.
Besides, the events has its polarity information which is of specific meanings in physics.Therefore, we need to develop networks matching well with the unique representation.

Revisiting Conditional DDPM
The denoising diffusion probabilistic model consists of two processes-diffusion and reverse.In the diffusion process (the blue left-arrow in Fig. 2), Gaussian noise is added to the clean complete events step by step.In the reverse process (the blue right-arrow in Fig. 2), the noise is predicted by the proposed event diffusion network and clean complete events are gradually recovered from the degraded version gradually.

The diffusion process.
Denoting the index of time steps as t, the Markov diffusion process from clean complete events e 0 to e T is defined as q(e t |e t−1 ), where q(e t |e t−1 ) = N (e t ; √ 1 − β t e t−1 , β t I), with the Gaussian noise values β t being pre-defined small positive constants.Following, 41 let α t = 1 − β t and ᾱt = t i=1 α i , the diffusion process q(e t |e 0 ) = N (e t ; √ ᾱt e 0 , (1 − ᾱt )I).When T is large enough, ᾱt approaches zero, and q(e T |e 0 ) gets close to the latent distribution which is a Gaussian prior.Then, e t can be sampled with the simplified equation where ϵ is standard Gaussian noise.
The reverse process.
The reverse process is also a Markov process in which the added noise is predicted and removed afterwards.Conditioned on the input sparse events c, the reverse from noisy e T to clean events e 0 is defined as where p θ (e t−1 |e t , c) = N (e t−1 ; µ θ (e t , c, t), σ 2 t I), with µ θ (e t , c, t) and σ 2 t denoting the predicted shape from our generative model and the variance, respectively.To generate a sample conditioned on sparse events c, we start from sampling x T from a Gaussian distribution and then progressively sample x t−1 from p θ (x t−1 |x t , c) for t = T, • • • , 1, and finally obtain x 0 .
The training process.To simplify the training objective, we follow Ho et al. 41 's parameteriza- , in which ϵ θ is a neural network estimating noise from noisy point cloud x t , diffusion step t and the conditioner c.The objective reduces to where U([T ]) is the uniform distribution over 1, • • • , T , ϵ is the added standard Gaussian noise.
The neural network ϵ θ can be reparameterized to predict the noise added to the clean event set e 0 , which can be used to denoise the noisy event set: During training we use l 2 loss to penalize the difference between model's output ϵ θ (e t , c, t) and the true noise ϵ.

Event Diffusion-Refinement Network
The network design.
Considering the high similarity between event cloud and point cloud for shape representation, we make event-oriented adaption to the point-version encoder-decoder network-PointNet++ 44 and use it as the backbone of two sub-networks, i.e. event diffusion network (EDN) and event refinement network (ERN) in Fig. 2, which complete event clouds at coarse and fine scales respectively.The detailed architecture is shown in Fig. 3.The backbone is composed of three main modules: set abstraction (SA), feature propagation (FP) and feature transfer (FT).Specifically, SA module subsamples the input event points and propagates the input features.
SA block consists of a grouping layer to query neighbors for each point, a set of shared multilayer perceptrons (MLPs) to extract features, and a reduction layer to aggregate features within the neighbors.FP module consists of a PA-Deconv module to upsample the intermediate event cloud representation, a set of shared MLPs to process features, and an attention mechanism to aggregate features.FT module transmits the information from conditional cloud to denoise the noisy input, and also consists of a grouping layer, a shared MLP and an attention mechanism to extract and aggregate features from the condition.Besides, we embed the diffusion step in the SA and FP module.
As introduced in Secions and , we pre-process a set of events (x, y, t, p) by normalizing the first three elements which fall into the range of -1 to 1 and treating the polarity (-1 or 1) as a feature for each event point.DDPM firstly generates 3D Gaussian noise with a random polarity feature and during model training, the noise is gradually removed and the polarity is predicted as a feature for each generated point.The main structure is the same between EDN and ERN, but the diffusion step is not used in the ERN.
To match the metric difference between spatial and temporal dimension of the event cloud, we first propose to use a cube query instead of the ball query or KNN query, encouraging the network to aggregate the events in a cube rather than a ball.In this way, the aggregated events resemble the overall distribution of all events and the network is expected to learn a better representation.
Further, we lengthen the cube query along t− dimension, as shown in Fig. 4, to let the network Ball Query Cuboid Query Cuboid query consumes more events in the temporal dimension which is important for 3D event cloud representation.
pay more attention to the temporally neighboring events than those along x− and y− dimensions, because temporally adjacent events are more informative for event completion.
The network learning.
We use the proposed EDN to generate coarse complete events and ERN for refinement.The latter predicts the relative displacement and add it to the coarse events to obtain the refined version.We use the Chamfer Distance (CD) loss between the refined event set x and ground truth e to supervise the learning of ERN ϵ f where |x| denotes the number of events in x.As the generation process is slow, we adopt a fast sampling algorithm 45 to generate and save the coarse events in advance.This practice endures small performance drop compared with 1000-step generation but offers a 99.7% speedup.

Experiments Datasets
To quantitatively evaluate our method and baseline methods, we perform extensive experiments on three public event datasets, i.e.N-MNIST, 46 Event Camera Dataset, 47 1Mpx Detection Dataset 25 at different spatial resolutions.We also collect a dataset to test the performance in diverse real challenging scenarios.

N-MNIST. N-MNIST is an event version of MNIST dataset, which contains around 50000
training samples and 10000 test samples with 10 classes of digits, and the spatial resolution is 34 × 34.We use 1024 events as the ground-truth and 256/128 events as incomplete input.
Event Camera Dataset.Event Camera Dataset is composed of events captured in daily scenarios with 180 × 240 resolution.To avoid repetitive scenes, we select 11 snippets (50552 samples) for training and 7 (45388 samples) for test following . 7Since the scenes are of complex structures and with rich semantic information, we use a 50% sampling rate to down-sample 8192-point ground-truth events to 4096 point sparse input.
1Mpx Detection Dataset.1Mpx Detection Dataset is captured in a driving environment with a 720 × 1280 spatial resolution sensor and contains complex scenes.Since the original dataset is very large, we use 80000 and 20000 samples for training and test respectively.The sparse input contains 4096 points and the dense output 16384 points.
To evaluate the methods in real challenging scenarios, we capture a new dataset using an iniVation DVXplorer with resolution 480 × 640, consisting of rich scenes including moving camera, highly dynamic objects, dim illumination etc.We include data with various challenges for training, and target to recover 16384 events from down-sampled 4096 events during training.The training set contains 21355 sample.We test on a continuous sequence of 4096 events to qualitatively validate the effectiveness of our approach in real scenarios.

Baselines and Metrics
Since there is no published work for event completion to the best of our knowledge, we compare our approach with a couple of closely related methods, including event super-resolution algorithm-STCL 7 and point cloud completion algorithms-PoinTr 36 and VRCNet. 38STCL is originally proposed for event super-resolution and we modify its last layer to obtain output events with the same resolution as input.Besides, since STCL only has millisecond resolution, we set the simulation duration as 25ms for 1Mpx Detection Dataset and 50ms for other datasets.PoinTr and VRCNet are easier to be adapted for event completion.Considering that they cannot learn the polarity of event points, we assign the polarity for each entry in the completed event set according to its nearest neighbor in the input sparse events.
Since CD loss is sensitive to outliers and cannot reflect the overall distribution, we also use Earth Mover Distance (EMD) to evaluate the quality of the completed events.EMD loss penalizes the distribution discrepancy between the predicted events x and the ground-truth version e, by optimizing a transportation problem.Specifically, it estimates a bijection ϕ : x ↔ e between x and e Comparatively, EMD is more appropriate for measuring the distance between two distributions.
Despite the fact that CD and EMD are originally for measuring point cloud distances and currently, there is no perfect metric for event sequence as far as we know, both of them are able to measure the distance between 3D event data (x, y, t) since raw 4-element tuple are already converted to 3D event points with 1/-1 polarity feature after normalization.By definition, CD loss does not require two event sets to contain the same number of events and can measure data with any dimension number.Therefore, the use of CD for training is feasible in our method.As a supplement, we use EMD to penalize the overall distribution for 3D event points.

Implementation Details
We learn our model in a coarse-to-fine manner.Firstly, we train the coarse network for 120 epochs for other three public datasets and 300 epochs for the self-captured one, with learning rate of 2e −4 using Adam optimizer.Since generating a sample with 1000 steps is too time-consuming, we adopt a fast sampling method-DPMSolver 45 for acceleration and generate a sample after only 27 steps with slight performance degradation.Afterwards, we feed the generated coarse event clouds into the refinement network, which takes 30 epochs to converge.We empirically found the optimal Figure 5: The event completion results on N-MNIST dataset from input with 256 events (a) and 128 events (b).STCL leads to too dense events which may lose local shape, e.g.'7' in (b), while results of PoinTr and VRCNet tend to suffer from missing entries.Our method maintains both overall event completeness and local shape.
t of the proposed cuboid query varies for different datasets.Let the bottom edge length be r, the optimal length of t dimension is 1r, 1r, 1.2r and 1.5r for the four datasets respectively.

Event Completion Results
Quantitative Results.In Table 1, we report the CD and EMD loss of our method and baseline algorithms.STCL leads to high CD and EMD compared to the point-based method.It is attributed to the fact that STCL is limited to millisecond resolution in SNN simulation, so cannot learn the latent structure of the events sparsely distributed in both spatial and temporal dimensions.Instead, it is more appropriate for recovering high-resolution event points from dense data.As modern point completion networks, PoinTr and VRCNet yield low CD loss on the three datasets, since they use CD loss for optimization.However, the EMD losses are very high, which indicates that it encounters difficulty in completing complex event data with high spatial resolution and sharp details.Still, we notice that our event-oriented method achieves the best EMD across all groups of experiments for all datasets and comparative CD loss to the second competitor, which validates the feasibility of the generative model in predicting missing events and demonstrates that eventspecific modules can better represent the distribution of event data.In sum, the superiority of our method is attributed to both the generative nature and event-oriented modules.

Qualitative Results.
We visualize the completed event data by accumulating the completed events into 2D frames, as shown in Figs. 5, 7, 6 and 8. STCL leads to pleasant results for most cases, but it tends to generate occluded structures.In many situations, the predicted events densely gather at certain locations, losing the sharp thin structures.Although PoinTr and VRCNet obtain low CD loss, the visual results are unpleasant.The visualization indicates that PoinTr fails to learn the data shape or distribution for the whole event set, and instead, the adopted point-wise loss misleads the completed points to adhere to the input sparse events.VRCNet learns coarse distribution but fails to recover sharp details.In comparison, our proposed model can recover details and sharp edges for all datasets.Especially on the self-captured dataset, as shown in Fig. 8, baseline methods fail to complete informative and sharp shapes, but our method achieves promising visual results in challenging high-speed and low-illumination environments.For example, the legs of the tea table, the contents and the frame of the painting are all clearly reconstructed.Based on the platform for data collection, our method supports the completion of event data captured at a al

Ablation Study.
We conduct an ablation study by replacing the event-oriented cuboid query with a ball query.The EMD loss rises by 0.

Downstream Applications
To further demonstrate the benefits brought by our event completion method with precise timestamps for the downstream tasks, we conduct two downstream experiments-object classification and intensity frame reconstruction, on the completed event streams.

Object Classification.
We test the completed results of N-MNIST for digit classification.We  of Event Camera Dataset using E2VID 3 and report the PSNR and SSIM 48 of the results from completed events compared with the reference from ground-truth events in Table 3.Our method achieves the best PSNR for most of the settings and the best SSIM in all settings.

Conclusion
In this paper, we target for addressing the lacking density of event streams in challenging cases (e.g., high-speed and low-light conditions) by introducing an event-oriented diffusion-refinement method for event completion to rebuild the missing events.We formulate an event stream as a 3D cloud and design an event-oriented conditional diffusion probabilistic model to generate the completed event points in a coarse-to-fine manner.To the best of our knowledge, this is the first work defining and exploring this task.We compare our method with relevant algorithms to validate its superiority both quantitatively and visually.Furthermore, the performance on two downstream applications, i.e. object classification and intensity frame reconstruction, demonstrates the usability of our method.Our approach would unlock the potential of event cameras and broaden their applications.
Due to the multi-step sampling process during inference, the generation of coarse events is rather slow, so the training/inference of the proposed method cannot be realized on the fly.In the future, faster and better sampling mechanism can be applied to enable end-to-end training/inference, which will further permit real-time event completion such as on-board deployment.

Figure 1 :
Figure 1: An exemplar demonstration of our event completion performance, in terms of 3D spatiotemporal cloud (upper) and accumulated 2D image (lower).left: the sub-sampled sparse sequence consisting of 128 events; middle: the completed counterpart; right: the ground truth.

Figure 2 :
Figure2: The overview of diffusion-based coarse-to-fine event completion pipeline.First, we use an event-oriented network to generate coarse distributions of events based on conditional sparse events.Then, we use a second network to yield final completed dense events.

Figure 3 :
Figure3: The architecture of EDR network.The upper branch extracts features from the conditional input, which is absorbed into the lower branch to denoise the noisy input.The proposed event-inspired cuboid query is extensively used in the three main modules-event-oriented set abstraction, feature propagation and feature transfer.

Figure 4 :
Figure 4: The illustration of the original ball query (left) and the proposed cuboid query (right).Cuboid query consumes more events in the temporal dimension which is important for 3D event cloud representation.

Figure 6 :
Figure 6: The event completion results on examples from 1Mpx Detection Dataset.STCL leads to completed events that tend to gather around certain positions.The results of PoinTr are too sparse, while VRCNet can only learn coarse distribution.In comparison, our method can recover dense events while maintaining sharp details.

Figure 7 :
Figure 7: Visual illustration of event completion results and the reconstructed intensity frames on two examples from Event Camera Dataset.STCL still tends to generate unevenly distributed events gathering together, and STCL and PoinTr are inclined to generate coarse structures and lose details.Our method is free of such artifacts.Intensity frame reconstruction results also validate the superiority of our method.

Table 1 :
Performance comparison between our method and baseline methods.CD loss indicates event-to-event difference and is multiplied by 10 3 .EMD loss penalizes distribution discrepancy and is multiplied by 10 2 .Bold denotes the best score.

Table 2 :
Comparison of classification accuracy on the N-MNIST dataset, using the model trained on high-resolution events to test the completed clouds and LR input.Bold denotes the best score.

Table 3 :
Comparison of reconstructions on the Event Camera Dataset in terms of PSNR and SSIM.Bold denotes the best score.