Video reconstruction from a single motion blurred image using learned dynamic phase coding

Video reconstruction from a single motion-blurred image is a challenging problem, which can enhance the capabilities of existing cameras. Recently, several works addressed this task using conventional imaging and deep learning. Yet, such purely digital methods are inherently limited, due to direction ambiguity and noise sensitivity. Some works attempt to address these limitations with non-conventional image sensors, however, such sensors are extremely rare and expensive. To circumvent these limitations by simpler means, we propose a hybrid optical-digital method for video reconstruction that requires only simple modifications to existing optical systems. We use learned dynamic phase-coding in the lens aperture during image acquisition to encode motion trajectories, which serve as prior information for the video reconstruction process. The proposed computational camera generates a sharp frame burst of the scene at various frame rates from a single coded motion-blurred image, using an image-to-video convolutional neural network. We present advantages and improved performance compared to existing methods, with both simulations and a real-world camera prototype. We extend our optical coding to video frame interpolation and present robust and improved results for noisy videos.


Introduction
Modern cameras are required to satisfy two conflicting requirements: to provide excellent imaging performance while decreasing the space and weight of the system.To address this inherent contradiction, novel design methods attempt to harness fundamental imaging limitations and leverage them as a design advantage.One such example is motion blur, which is a known limitation in photography of dynamic scenes.It is caused due to objects' movements during exposure, whose duration is set according to the lighting conditions and noise requirements.As most scenes are dynamic, light from moving objects is accumulated by the sensor in several consecutive pixels along their trajectory, resulting in image blur.Although blur is an undesirable effect, in this work, we use it for video generation from a single image.
In contrast to motion deblurring methods that aim at sharp image reconstruction, in video generation the goal is to exploit this 'artifact' for reconstructing a sharp video frame burst that represents the scene at different times during acquisition.Yet, as signal averaging in the acquisition process eliminates the motion direction in the captured image, this task is highly ill-posed.The pioneering work of [22] suggests a pairwise frames order invariant loss to mitigate this ambiguity.Yet, as the global motion direction is lost in the acquisition, the processing stage can only assume the direction of the motion for the video reconstruction but cannot really resolve the global direction ambiguity.
To overcome this deficiency, some works suggested capturing multiple frames with different exposures during the acquisition process [44] or alternatively replacing the sensor with coded two-bucket [49,50] or event measurements [40].Yet, these solutions do not fit with a standard optical system or require capturing multiple images.Contribution.To overcome the limitations of conventional cameras in dynamic scenes acquisition, we suggest a computational coded-imaging approach (see Figs. 1 and 2) that can be easily integrated in many conventional cam- .Overview of our suggested method.An acquisition of a dynamic scene using our dynamically phase-coded camera provides an intermediate image BC which contains scene dynamics cues in its coded motion-blur.We reconstruct sharp video frames of the scene at desired timesteps t from the single coded-blurred image BC using a time-dependent CNN.The optical coding parameters are jointly optimized with the reconstruction network weights using end-to-end learning.
eras (equipped with a focusing mechanism) by just adding a phase-mask to their lens (which is a simple process).Joint operation of the phase-mask and focus variation during exposure generates a dynamic phase coding, which encodes scene motion information in the intermediate image as chromatic cues.The cues are generated by the PSF nature of our solution (plotted in Fig. 4), which encodes the beginning of the movement by blue and the end by red, e.g., see the zoomed left and right edges of the moving flower in Fig. 1b (enhanced for visualization).These cues serve as guidance for generating a video of the scene motion by post-processing the captured coded image (see Fig. 1c).
Our method is capable of generating a sharp frame at any user-controlled time in the exposure interval.Therefore, a video burst at any user-desired frame rate can be produced from a single coded image.The proposed coding and reconstruction approach is based on a learnable imaging layer and a Convolutional Neural Network (CNN), which are jointly end-to-end optimized; the learnable imaging layer simulates the physical image acquisition process by applying the coded spatiotemporal point spread function (PSF), and the CNN reconstructs the sharp frames from the coded image.
The main contributions of our method are: • A learning based coding method, which requires only a conventional sensor and a lens that has a focusing mechanism and is equipped with a simple add-on optical phasemask.
• The learned code allows using low-power neural networks while achieving high quality video reconstruction from a single image.
• A novel neural architecture that leads to a flexible and modular video from a single image reconstruction, which can be easily adjusted to any desired frame rate video by a simple change in the neural network parameters that do not require re-training.
• End-to-end optimization framework of optical and digital processing parameters for dynamic scene acquisition, which accurately models a spatio-temporal acquisition process.
• Improved video from motion reconstruction, with unambiguous directionality, higher accuracy and lower noise sensitivity, tested in both simulation and real-world experiments.
• A novel video perceptual loss for training neural networks and a metric for video reconstruction evaluation that takes into account both spatial and temporal information.
• A method for video frame interpolation using the proposed coding that achieves improved results for realworld dynamic scenes.
• SNR tradeoff analysis for dynamic scenes acquisition and frame interpolation.

Related Work
Given a motion blurred image, various methods attempted to reconstruct a sharp image of the scene from it.Some techniques were developed for the case of conventional imaging and then the reconstruction is only computational, which recently is usually based on training a neural network for this task [26,38,63,72].Other holistic design approaches utilize a computational imaging strategy to encode motion information in an intermediate image and recover the sharp image using corresponding post-processing methods [6,9,27,36,43,56].

Acquisition
Input Output size Computational only [22,42,73] conventional image fixed Multiple exposures [44] short-long-short exposures 3 images dynamic Coded two-bucket [49,50] C2B sensor two coded images fixed Event camera [40] event sensor 50-100 events vector dynamic Proposed method dynamic phase coding coded image dynamic Table 1.Overview of existing solutions for video reconstruction from a motion-blurred scene.
The problem of video reconstruction from a single image takes motion deblurring a step forward by attempting to reconstruct a frame burst of the dynamic scene that resulted in the blurred image, and not only the central sharp frame (see Tab. 1 for an overview).Some works are based on images taken using a conventional camera and apply processingonly methods to obtain a frame burst of the scene.However, without optical coding, this problem is highly ill-posed as even if the edges and textures are reconstructed perfectly, various motion permutations can generate the same motion blurred image (e.g., see Fig. 2 in [22]).Thus, coded imaging approaches were proposed to acquire additional information about the scene dynamics and achieve higher quality results.Conventional imaging based methods.Generating a video sequence from a single motion blurred image is a challenging task: since the temporal order of the reconstructed frames is ambiguous, the problem is highly ill-posed.[22] address the temporal order ambiguity and present a pioneering approach for this task using several reconstruction networks and a novel pairwise frames orderinvariant loss.Their suggested method consists of iterative generation of seven sequential frames of the scene, starting from the central frame reconstruction and proceeding to the edge frames of the dynamic scene using the preceding reconstruction results.The method's architecture limits the reconstruction to only seven frames of the scene in the exposure interval, and it uses three different trained models for the reconstruction process.[42] present a solution for video reconstruction using motion representations of the scene learned by recurrent video autoencoder network.[73] suggested a detail-aware network using a cascaded generator.[33] suggest a solution for rendering sharp video of a face from new viewpoints from a single motion-blurred image.All of these methods suffer from the inherent motion direction ambiguity, and their reconstruction performance is more sensitive to noise (as discussed in [10,32] and empirically presented in Sec. 4).Coded imaging based methods.To handle the inherent limitations of conventional imaging, some works adopted computational photography methods for image deblurring and video frames recovery.[43] introduced an amplitude coded exposure technique using fluttered shutter for motion deblurring.This method performs a temporal binary ampli-tude coding, resulting in a wider frequency response, which is utilized for improved motion deblurring results.[9,27] presented a parabolic motion camera with motion invariant PSF utilized for non-blind motion deblurring.Both of these approaches are limited to the reconstruction of a single image.[15] presented a spatial-temporal coding exploiting the rolling shutter of CMOS sensor.Dynamic phase coding in the lens aperture for motion coding was presented by [13] for motion deblurring.This coding embeds motion cues in the intermediate image for improved deblurring performance.For video restoration from a single coded-blurred image, several approaches had been presented, such as using an event camera [40], or coded two-bucket (C2B) sensor [49,50,68], which both require a non-conventional sensor or lensless imaging and rolling shutter effect [2], which omits the lens and therefore changes the entire imaging concept even for static scenes.We adopt the coding method of [13], which is based on a commercial sensor and lens (with focusing mechanism), equipped with a simple addon optical element, which allows unambiguous motion cues encoding.Different from [13] that performs only image deblurring, we aim at reconstructing a video of the motion in the scene from a single motion blurred image, which is a vaster and more ill-posed task.
A closely related problem is a reconstruction of a sharp and high frame rate video from motion blurred and low frame rate video, using either processing of conventional camera videos [21,44,52,74] or computational imaging methods [17,28,30,31].These methods require a video input (which enables solving the direction ambiguity) and are not applicable for single image input.Deep optics.As the end-to-end backpropagation-based optimization process of deep models proved itself to be very efficient for various tasks, its power was also harnessed for optical design, either for a standalone optical system design process or jointly with a post-processing algorithm (for recent review on this topics see [5,69]).Specifically for enhanced optical imaging applications, this scheme had been presented for extended depth of field [1,12,55,61], depth estimation [7,8,16,70], high dynamic range [35,58], ray tracing [59,65], other tasks [11,41,45,51,60] and several microscopy applications [24,39,48,71], to name a few.Yet, it was not considered for the problem of video from blur.

Method
As our goal is to reconstruct video frames from a motion blurred image of the scene, we engineer the camera's PSF to encode cues in the motion blur of dynamic objects.The coded PSF is achieved using a spatiotemporal dynamic phase coding in the lens aperture, which results in motioncoded blur.The coded blur serves as prior information for the image-to-frames CNN, trained to generate sharp video frames from the coded image.Utilizing the end-to-end optimization ability, the optical coding process is modeled as a layer in the model, and its physical parameters are optimized along with the conventional CNN layers in a supervised manner.The learned optical coding is then implemented in a prototype camera, and images taken using it are processed using the digital processing layers of the CNN.

Camera Dynamic Phase Coding
Moving objects in a scene during exposure result in motion blur, as the light from a moving object is integrated in different pixels along the motion trajectory.In addition, both static and dynamic objects are blurred by the lens PSF which is never perfect (due to aberrations/diffraction etc.).This imaging process is formulated in Eq. ( 1); the twodimensional PSF is spatially convolved with the instantaneous scene at any dt and integrated during exposure, B is the acquired blurred image1 , T is the exposure time, S(t) and h denote the instantaneous sharp scene and the PSF respectively, and ( *

Sp
) denotes the spatial convolution operator (the spatial coordinates are omitted for ease of notation).
The averaging nature of image sensors results in the loss of the motion direction, which introduces inherent ambiguity.Also, as every object moves independently of others, general motion blur is shift-variant.Thus, video reconstruction from undirected motion blur is a highly ill-posed task.
To address both issues, we implement a coded lens designed to embed motion cues in the acquired image, and the prior knowledge about the camera's time-variant behavior serves as guidance to the reconstruction process of the video burst.We adopt dynamic phase coding in the lens aperture, similar to the motion deblurring method presented by [13].This method is based on a spatiotemporally coded PSF that encodes motion information in the intermediate image without attenuation of the signal compared to amplitude coding methods such as [17,43].Such an approach improves the signal to noise ratio (SNR) compare to amplitude coding methods.Since video reconstruction from a blurred image is a more ill-posed task than deblurring, we improve the imaging method by optimizing the coding parameters for our task using end-to-end learning of them with the reconstruction network.
The camera PSF is generated using a conventional camera equipped with a simple add-on phase-mask; the temporal coding is achieved using a joint operation of the static phase-mask designed to introduce color-focus cues, and a dynamic focus sweep performed during exposure (using a simple focusing mechanism).The phase-mask (originally designed for depth estimation [16] and extended depth of field imaging [12]) introduces a predesigned chromatic aberration to the lens, generating a controlled dependence between the defocus condition and the color distribution of the PSF.Based on Fourier optics, the PSF of the camera is computed conditioned on the mask specifications, the wavelength and the defocus condition as previous works (extended in Appendix A).To get a time-varying PSF the defocus condition (denoted as ψ) is changed during exposure, and a temporally coded PSF (denoted as h(ψ(t))) is achieved.The instantaneous scene S(t) is spatially convolved with the corresponding PSF h(ψ(t)), resulting in the motion-coded image B c described in the following formula: Using the proposed spatiotemporally coded imaging scheme, the dynamics of the scene are encoded in the intermediate image acquired by the camera.Moving objects are smeared in the image with color cues along their trajectories, based on the spatiotemporal PSF h(ψ(t)).The acquired coded image is then fed to the reconstruction network trained to decode these cues as guidance for improved video reconstruction.Fig. 2 presents these steps visually.

PSF Design
The time varying PSF conditioned on the defocus parameter ψ is computed based on Fourier optics and described in Appendix A. To achieve optimal motion cues encoding in the intermediate image, the imaging process ( Appendix A) is modeled as a learnable layer (with corresponding forward and backward models using automatic differentiation).The dynamic phase coding acquisition is simulated using the phase-mask characteristics and the focus variation parameters.The defocus parameters are optimized in the endto-end training process along with the CNN layers, while the phase mask design is fixed (constant) in our setup and described in Appendix C. The initialization method of the optical parameters was tested using different approaches in the acceptable physical range, including linear as [13], ran-  3)).The decoder part is controlled by the time parameter (using AdaIN [19]), to set the relative time of the reconstructed frame.dom, and even some approaches combined with periodic functions (i.e.sine).The linear initialization produced the best convergence results over all other attempts.Note that as we are using temporal phase coding, we do not change the intensity during the change of the focus and we optimize only the focus as a function of time.

Reconstruction Network
Our proposed model for video frames reconstruction from a coded motion-blurred image is based on a single time-dependent convolutional neural network (CNN) with AdaIN mechanism [19].Our model inputs are the codedblurred intermediate image and a normalized time parameter t ∈ [−1, 1].The time parameter controls the relative time of the generated sharp frame in the normalized exposure time interval.The output of the model is the estimated sharp scene frame at time t, denoted as S(t).Hence, the architecture is designed to reconstruct the scene at any desired instant in the exposure time interval, and thus to create a video at any desired frame rate.We denote it by Our reconstruction CNN (presented in Fig. 3) is based on the UNet architecture [46], consisting of a four levels encoder-decoder network structure with skip connections between the encoder to the decoder in each level.The double convolution blocks (presented in the original UNet architecture) are improved by adding skip-connections modifying them to the form of dense blocks [18].The output of the last layer of the model is added to the input image, such that the network learns only the residual correction required to reconstruct the desired frame.
The time parameter is used to reconstruct the frame corresponding to the desired normalized time in the exposure interval.It controls the network by both the AdaIN mechanism [19] and concatenating it to the input as an additional channel (by expanding the scalar value to image dimensions).To bridge between the shift-invariant convolutional operations of the CNN and the shift-variant (and scene dependent) motion blur that exists in our target application, we leverage positional encoding to add image position dependency to the model.We provide additional details on the architecture and these changes in the following.Positional encoding.Assuming a general scene in which every object might move in a different direction and velocity, an intermediate image captured using our proposed coded lens will contain a shift-variant blur kernel, which is a composition of the color-temporal PSF coding and the spatial movement of the objects.Since convolutions are shift-invariant, we want to add a position dependency to the model, such that we can utilize the local information of the coding in the surrounding area that relates to the same object with the same motion characteristics and blurring profile.We adopt Fourier features to get a better representation of the position coordinates [62].Similar to [34], we add a positional dependency to the model by concatenating the Fourier features of the pixel coordinates as additional channels to the input.Five log-linear spaced frequencies {w j } 5 j=1 were sampled in the range [1,20] to generate 20 positional features in total for each pixel coordinate (u, v) of the image.Each frequency w j contributes the following four positional features to each pixel using the normalized pixel coordinate in the range of [0,1]: Time encoding.To achieve a time-dependent CNN, the batch normalization layers in the UNet architecture are replaced with AdaIN layers [19] controlled by a normalized time parameter.The exposure time interval is normalized to the range of [−1, 1] such that t = 0 corresponds to the middle of exposure time.The time parameter t ∈ [−1, 1] is mapped to a higher dimension vector w ∈ R 64 , using an MLP network consisting of two sequential blocks of a linear layer followed by a leaky-ReLU activation function.The encoded time-representation vector w is shared across all AdaIN layers and controls the mean and standard deviation of the features in each AdaIN layer.
In each AdaIN layer with an input x of p feature channels, the mean β ∈ R p and standard deviation γ ∈ R p are obtained from w by designated MLP mapping networks with two layers of the same structure mentioned above.The AdaIN transformation (Eq.( 5)) is performed along the features dimension, where µ(x) and σ(x) are computed across spatial dimensions (instance normalization).
As our scheme is designed to utilize the optically en-coded motion cues to generate a sharp frame in a relative time t, the encoder part of the UNet is generic, and we apply the temporally controlled AdaIN only on the decoder part of the architecture (as in Fig. 3).We set the encoder part of the UNet model to be time independent by performing instance normalization followed by a learnable affine transformation instead of the AdaIn blocks.In this setting, the encoder is optimized to encode more general information about the image and scene dynamics regardless of the normalized time parameter.The generic encoder and timespecific decoder design enable the network to converge better.Note though that we concatenate the time parameter to the input channels which contain the input image and the positional encoding features.This improves reconstruction performance as shown in the ablation in Sec.4.3.Dataset.To train our network and evaluate its performance quantitatively, we used the REDS dataset [37], consisting of scenes captured at 120 frames per second (FPS).To achieve smoother motion-blur simulation we used ×8 frame interpolation using the DAIN method [4] (similarly to the process presented in [37]), to achieve video frames at 960 FPS.Inverse camera response function (CRF) was applied on the frames to convert them from gamma space to signal space, using the inverse crf transform given with the dataset.To simulate the acquisition of a dynamic scene by our coded camera, the spatiotemporal PSF had been applied to 49 consecutive frames in signal space, which were then averaged along the time axis as in Eq. ( 2) (where N = 49).For performance comparison with [22], conventional camera images were simulated by only averaging the frames without applying the PSF.Due to the applied frame interpolation, not all the 49 frames are true images; therefore only the seven real frames (in indices n = 8k, k ∈ [0, 6]) are used as our GT images for the training/validation/test metrics.For improved generalization, we add additive white Gaussian noise (AWGN) to the simulated blurred images in the signal space, which partially simulates the imaging process noise and improves the robustness of our model and generalization to the camera prototype (different noise levels were set according to the application, as discussed in Sec. 4).Loss Functions.We use a linear combination of three losses for the training: pixel-values smooth-L1 loss (l L1 ), perceptual loss (l percep ) using VGG features [23], and a video-consistency perceptual loss (l vid ).Thus, our loss is The perceptual loss is a known practice for image reconstruction tasks [23].In this loss, we compute the smooth-L1 distance between the VGG [54] features of the reconstructed image and the ground truth image.
To improve temporal consistency and perceptuality between consecutive reconstructed video frames, we developed a video loss using a 3D convolution network over the video time-space volume.We use 3D-ResNet [64], a spatiotemporal convolution network for video action recognition, and compare the network-extracted feature maps between the reconstruction and the GT videos.We used the output of the first three convolution layers of the 3D-ResNet network and averaged the smooth-L1 loss between the features of the ground truth video and the reconstructed video.More details are provided in Appendix B.

Experiments
As an experimental validation to our proposed approach, we first train our system (optical coding layer and reconstruction network) and evaluate our results quantitatively (while the optical coding process is simulated), and compare the performance to the previous work by [22].Following the satisfying simulative experiment, we built a prototype camera implementing our spatiotemporal coding and examined our method qualitatively (as pixelwise GT sharp frame bursts are almost impossible to acquire).Lastly, we present an ablation study for our architecture and used methods.Some of the results are presented below, and additional results are presented in Appendix G and the video.Training details.We train our model on a training set consisting of 9,680 scenes for 40 epochs, with a batch size of 72 samples of patches in size 128x128x3 each.We used Adam optimizer [25] with learning rate of 10 −3 and weight decay of 10 −8 .The loss weightings (as defined as Eq. ( 6)) are α L1 = 1; α percep = 0.1; α vid = 0.1.Additional 2460 scenes are dedicated for validation/testing, such that the quantitative reconstruction performance (Sec.4.1) was evaluated using 1,968 scenes dedicated for testing.In the optical coding layer we define a learnable defocus condition vector ψ ∈ R 49 .We optimize these focus sweep parameters of the camera, which defines the camera time-varying PSF following the computation presented in Appendix A , and the result obtained following the imaging Eq. ( 2).These parameters initialized linearly as discussed in Sec.3.2.To improve robustness we apply flip augmentations and add AWGN to the input image (1% as in [22] for Sec.4.1, and 3% for Sec.4.2).
The optimized focus sweep parameters (of the imaging simulation layer) presented in Fig. 5, and resulted in a coding compactly demonstrated in Fig. 4b (the individual PSF kernels are presented in Fig. 17).In this example, a motion blur of a white dot moving right is simulated with a coding based on either linear or learned focus sweep (Fig. 4a and 4b respectively).Compared to the white trace that would have been captured in a conventional camera, the color coding of the motion profile is clearly visible.The learned pattern provides improved coding for video reconstruction, thanks to the end-to-end optimization with the image-tovideo CNN.Following the different initialization methods (discussed in Sec.3.2) we infer that a clearly changing code   with prominent characteristics is required for the reconstruction, and injective code function (such that each color appears once) helps the reconstruction and provides better results.The learned coding is also validated experimentally on a moving point source (Fig. 4c).

Simulative Experiment
To evaluate the reconstruction results we used a test dataset consisting of motion blurred simulated images (both conventional and coded).We compare our results to the performance of [22] that presented a method for video reconstruction from a conventional camera (uncoded) motion blurred images. 2 We evaluate the models with respect to the GT sharp scene images using PSNR, structural similarity index measure (SSIM) [67] and our VID metric which we designed to assess a video frame sequence reconstruction quality.
The VID metric uses the output of the first three 3Dconvolutional layers of 3D-ResNet network [64] (similar approach as the video loss), and is computed by taking their average in log scale (in [dB], higher is better, more details are provided in Appendix B).Indeed, the VID metric is sim-Figure 6. Per-frame performance evaluation.PSNR and SSIM averaged per-frame reconstruction performance for a 7-frames burst, for our method and [22].Since the motion blur of conventional camera is undirected, we also evaluate the reverse order of [22] reconstructed frames (compared to the ground truth) for each input scene, and considered the higher results for the 'best order' presented evaluation.Figure 7. Noise sensitivity analysis.Averaged PSNR results vs. noise level (as percent of the image dynamic range) of our method and [22] (in both predicted and best order).Our method has better noise robustness, due to the optically embedded cues.ilar to the video loss that we use.Yet, we believe that using VID as a performance measure is far due to the following reasons: (i) We observed visually that better VID correlates with improved visual quality, which confirms the use of this loss; (ii) In the same way that it is valid to train a network using a MSE loss and report performance in terms of PSNR, it is valid to use the video loss in our training (which is not the only loss used) and report performance in terms of the VID metric.
A visual example of our reconstruction performance is presented in Fig. 8, where improved results along the entire frame burst can be clearly seen.Figure 6 presents the per frame performance in PSNR and SSIM (for a 7-frames burst, as [22] is limited to such burst length only) averaged over all the test scenes.Tab. 2 shows the overall statis-  The PSNR and SSIM metrics averaged over all the reconstruction timesteps during the acquisition interval (7 timesteps), while the VID metric evaluates the whole scene sequence internally.Comparison of reconstruction quality by our method and [22] presented both evaluation in predicted order and best order sequence (since the direction ambiguity).
tics of the evaluated metrics of the reconstructions.Since the motion direction is lost in conventional motion blur, the frames' reconstruction of [22] may be predicted in the reversed order, i.e. in the opposite motion direction.Thus, each reconstructed scene was compared to the GT in both the predicted order and the reverse order, and the higher one (PSNR-wise) was selected to the 'best order' average.Note that in ∼ 50% of the cases higher performance is achieved in the reversed order, which shows that the order ambiguity is prominent.Since the coded blur in our camera is designed to provide direction cues, our method is expected to reconstruct the frames in the correct order.Therefore, we do not need to reverse the order for it.
To assess the benefit in noise robustness of the encoded optical cues, a noise sensitivity analysis is carried by evaluating the reconstruction results of our method vs. the work in [22] for different noise levels (Fig. 7).Similar to the Figure 9. Prototype Camera.the dynamic phase coded camera prototype is based on a commercial camera and a lens with a focusing mechanism, where our phase-mask is incorporated in the lens aperture.The camera flash signal is utilized to trigger the focus variation, controlled using the micro-controller (located near the camera).performance analysis in Fig. 6, The reconstruction performance of [22] is evaluated both in the predicted order and best order.The prominent gap is achieved due to the optically encoded motion information, which allows reconstruction with much better noise robustness.

Prototype Camera Results
To assess our method on real-world scenes, a prototype camera with dynamic phase coding was implemented.The color-focus phase-mask is incorporated in the lens aperture, and lens defocus setting ψ(t) is set to vary during exposure following the desired learned code using a liquid-lens.
The joint operation of the phase-mask and focus variation temporally manipulates the PSF h(ψ(t)) as presented in Sec.3.1.Our prototype camera (see Figs. 9 and 16) is based on a standard C-mount lab-camera (IDS UI-3590CP) equipped with a 4912 x 3684 pixels (1.25[µm] pixel pitch) color CMOS sensor [20].The camera is mounted with a fixed focal length f = 12[mm] C-mount lens with a focusing mechanism based on a liquid-lens (Edmund Cx Cmount lens #33-632 [29]) and additional details on its design are provided in Appendix C. Several dynamic scenes had been captured using the prototype camera, and processed using our image-to-frames CNN for different t values, thus creating short videos of the moving scenes.For comparison, we took motion-blurred images of the same scenes with a conventional camera (i.e. with constant focus and clear aperture).The results are presented in Figs. 1  and 10.Note how the truck moves and its back wheel rotates (front wheel is fixed) in Fig. 10.Our method provides sharp results and a higher frame rate video.Note also that [22] reconstructs the motion in the opposite direction.

Ablation Study
We apply an ablation study for the proposed method and architecture to evaluate the contribution of each component in our system.Tab. 3 presents the different experiments applied and tested configurations, and Tab. 4 presents the different coding and architectures contributions.Presented in Tab. 3, firstly we started from a UNet architecture controlled by a time parameter using AdaIN modules as described in Sec.3.3, where the input is the blurred image only, and the output is the sharp frame in the desired relative time in the exposure interval, and trained without the video-perceptual loss.Keeping the encoder part of the UNet uncontrolled by the time parameter (using instance normalization instead of AdaIN) enables better reconstruction results (config-a in Tab. 3) compared to the full AdaIN network, both in encoder and decoder (config-0 in Tab. 3).The following configurations include the addition of image-coordinates positional encoding features (config-b) and the time parameter concatenated to the input image (config-c).These features achieve improvement in PSNR while the similarity measure is slightly decreased, however, while testing the models on the prototype camera images we noticed better generalization to the real world images using these additions.Adding the video-frames perceptual loss (by setting α vid = 0.1, see config-d) we get improvement both in PSNR and SSIM.To comprehend the improvement of our optics and computational imaging method for the task, we train our best network (config-d) on uncoded images (i.e.temporal averaging only), and evaluated the results (config-e).Without the phase coding we observe a significant performance degradation, which validates the optical coding benefit to the reconstruction ability.Using the learned temporal coding we gain improvement in both reconstruction metrics (config-f), and we consider it as our proposed model.From the VID metric evaluation, which represents the video reconstruction perceptuality, we observe significant improvement using our learned coding compare to the linear (config-f and config-d respectively) To comprehend better the improvement achieved by our learned code (compare to the linear coding) we evaluate different size models as presented in Fig. 11.Using smaller models (e.g. in limited resources conditions) the improvement of the learned code becomes more significant and it contributes to better reconstruction results of the degraded models.We conclude that while a large network is capable to solve the harder task (linear code), for a smaller (and weaker) network the learned code is more meaningful to achieve better results.For the smaller networks, we used U-net with encoding depth of two levels and Mobile-Net blocks [47] (additional details about the models' architecture described in Appendix E).
In Tab. 4 we evaluate the coding methods contribution by the central frame performance (namely the deblurring performance).The uncoded exposure using [22] reconstruction method achieves inferior results.The performance of the naive UNet architecture with linear coding is presented as well.The suggested model is also presented in the table.It is noticeable that the improved UNet improves the results while the learned PSF achieves additional improvement.We also trained our Unet model only on the central frame (as a deblurring task) to infer about the flexibilityquality tradeoff.we used the linear code (equivalent to [13]) and our learned code.Even though our proposed method suffers from a small performance drop on the middle frame, it allows the flexibility of generating a video sequence of the scene with any desired number of frames.For these models we replaced AdaIN with group normalization (GN) for better stability (additional details in Appendix E).We also present another variant of our model using AdaGN instead of AdaIN (config-g in Tab.3), such a model achieves better results for PSNR and SSIM, but similar results in VID metric (architecture details in Appendix E.Even though, we consider config-f in Tab. 3 as our suggested model for the comparisons and the obtained visual results.

VID Loss/Metric Validation
In this section we validate the proposed video loss and present the spatial and temporal consistency preserving behavior of the metric.Since the loss is based on a pretrained 3D-ResNet architecture, the explainability of the model is a challenging task.The video loss is reference-based, namely, it requires the ground-truth video along with the distorted (reconstructed) video to compare the deep features difference.The well known MSE loss is also reference-based in this sense, but it compares the pixel values pixel-wise.The Zoom-ins on 7 reconstructed frames of (top) [22] and (bottom) our results.Our method achieves improved results along the entire burst, reconstructs the correct motion direction and also provides a higher frame rate video.MSE loss has neither spatial nor temporal dependencies between pixels, unlike our video loss.The first observation is that both losses typically decrease or increase together depending on the similarity between the distorted and ground-truth videos.To present the spatio-temporal consistency of the video loss we use the following test scheme: we choose a fraction p of pixels of the video to be distorted.Note that for p = 1 we get the worst loss value since all the video pixels are distorted, and for p = 0 we get zero loss (identical to GT).We choose the pixels to distort in three methods: (i)random over space and time; (ii) not altering some spatial blocks that preserve spatial consistency (a rectangle in each frame randomly located over the time axis); and (iii) not altering some spatialtemporal blocks that preserves both a consistency in time and space (rectangle in each frame in the same location over time axis).The results are presented in Fig. 12 Table 5. Consistency significance for VID loss we tested the spatial/temporal consistency (denoted as spat./temp.respectively) effect on the VID and MSE losses (for p = 0.5).The VID loss is very affected by the consistency of the distorted data, and thus encourages the models for consistent predictions.(the loss values are in 10 x scale according to the value in the brackets each line)

Method
able that for the spatio-temporal consistent distorted video the video loss gives the lowest results (for each p value).Due to MSE loss's lack of spatial and temporal dependence, the metric behaves the same regardless of the pixel sampling (and distortion) method.
For an additional observation, we fixed p = 0.5 and tested the temporal and spatial consistency for two video distortion types: spatial Gaussian blur (σ = 1) and pixels shift (3 pixels in both axes).For the inconsistent temporal/spatial case, we set every other frame/row to be distorted, while for the consistent case we set the first half of the frames/rows to be distorted.As presented in Tab. 5, the VID loss is much better in the consistent case (for both spatial and temporal), while the MSE loss is not affected much due to a lack of axial correlation.Hence, the VID loss encourages spatial and temporal consistency as a training loss.

Video Frame Interpolation Extension
We extend our encoding method for video frame interpolation.Different from the single image task, in this case we use consecutive blurred video frames as input and perform both deblurring and video frame interpolation.Previous works suggested methods for this task, such as [52,74], while other works designed solutions for interpolation of sharp input video frames, as [4,53].In our method, each frame in the input video is encoded using our learned spatiotemporal PSF.For the sharp frame interpolation methods [4,53], we first applied a video deblurring algorithm using [66].We compare the results of the mentioned methods on REDS dataset and Adobe240 dataset [57], which is a different domain from the data we used for training and validation.We synthesized blurred frames in two timing setups, a baseline exposure and two-thirds of the baseline (see Appendix F).The performances of the methods were evaluated as a function of noise levels by PSNR (Fig. 13 Figure 12.Video loss validation by consistent distortions.For the VID loss it is noticeable that the spatio-temporal consistency distortions achieve the lower loss for all distortions rates, and the loss is affected by video consistency.On the other hand, MSE loss has no spatial correlation and thus is not affected by the spatiotemporal consistency. Figure 13.Video frame interpolation -PSNR performance for different noise levels on REDS dataset [37] Comparison of PSNR for two exposure intervals: baseline and two-thirds of the baseline.The noise axis was normalized with respect to the exposure interval based on the SNR behavior (more information in Sec.4.6).and 14) and SSIM-3D and VID (Fig. 15 and 18).It is noticeable that our method is more robust and performs better for noisy images, while some of the other methods perform better for clean images which are impractical for real camera images.We also test [52] method but omit it from the graphs due to low performance (4dB PSNR lower than other methods).Note that since we optimized our model for noise with σ = 1 there is a slight drop in performance for lower noise values.In Appendix F we elaborate on training details and present additional results for REDS dataset.Visual results of the reconstructed videos are presented in Fig. 21 and in the supplementary material.

SNR Tradeoff In Dynamic Scenes
While capturing dynamic scenes there is a tradeoff be-  tween the signal intensity and motion blur.For lower noise, a long exposure is required and thus a long motion trajectory is obtained in the captured image.On the other hand, a sharp image is obtained by short exposure but the signal level is low, and we get low SNR.Hence, there is a tradeoff between the two, and capturing images/frames without noise is impractical.Thus it is important to have noise robust method for the frame interpolation task.In the noise evaluation figures for the frame interpolation task (13, 14, 15 and 18), we scaled the noise levels axis of the "twothirds" timing setup data compared to the baseline noise levels for the SNR tradeoff evaluation.Namely, since the exposure time in this timing setup is two-thirds of the baseline setup the motion blur is lower and the noise should be higher according to the tradeoff.Such that adding 1.5% noise to short (two-thirds) exposure frames should be compared to 1% noise added to the baseline exposure frames.From the results presented in the mentioned figures we can conclude that for low noise levels (brighter scenes) it will be preferred to use shorter exposure (and get sharper images) while for high noise levels it will be preferred to use longer exposures.Limitations.Despite the improved performance achieved, our method still suffers from several limitations.The most prominent are scenes with high-speed or accelerating objects; as our coding method is a composition of the dynamic phase coding and object movement, there is some hidden assumption that this movement (and specifically its acceleration) is not too acute.In such cases, the resulting coded information will be too obscure, with a limited benefit.Another limitation relates to the imaging scenario; the temporal part of the coding is focus variation.Therefore, the underlying assumption in such a design is that the entire scene is in the same focus condition (either in-or out-of-focus).Such a design limits our solution to infiniteconjugate lenses (e.g.GoPro cameras).This limitation is more prominent in outdoor scenes with depth where the image is not in the same focus condition.In addition, since our coding is color-based, we assume that objects are not monochromatic, since in such a case the coding ability is degraded.This assumption is acceptable since almost all natural materials are not monochromatic, and practically even some wavelength bandwidth can suffice.Textures are important to indicate a motion in general, and the textures are required to achieve the coded blur in the image for the reconstruction.Even though, if there are no textures the reconstruction becomes a quite trivial task (due to a single color object, where motion and blur are less apparent).Minor artifacts of the model might be observed by a careful analysis of the result videos.The reconstructed result might get smooth since the blurring process may deteriorate the small details in the image, which the model may struggle to recover.Worth noting that since our training dataset is generated using hand-held camera video, there is a camera movement in the synthesized motion blur, and for such small movements, the model learns to reconstruct the motion correctly.

Conclusion [43]
A spatiotemporally coded camera for video reconstruction from motion blur is proposed and analyzed.Motivated by the ongoing requirement to improve the imaging capabilities of cameras, the motion blur limitation is utilized as an advantage, to encode motion cues allowing reconstruction of a frame burst from a single coded image.The coding process is performed using a phase-mask and a learnable focus variation, resulting in color-motion cues encoding in the ac-quired image.This image, along with a relative time parameter t, are fed to a CNN trained to reconstruct a sharp frame at time t in the exposure time.By choosing a sequence of t values, a frame burst of the scene is reconstructed.Simulation and real-world results are presented, with improved performance compared to existing methods based on conventional imaging, both in reconstruction performance and handling the inherent direction ambiguity.
Moreover, we present a vast ablation study, noise robustness analysis, learned code contribution (including model size dependency), and central frame performance with a flexibility-quality tradeoff assessment.
Our method can assist balancing the various trade-offs that a camera designer has to handle.For example, the promising results achieved hold the potential to extend the method to perform a low-blurred to high-sharp frame rate conversion, achieved with a lower sampling rate and improved light efficiency.This may extend existing photography capabilities with simple and minor hardware changes.
where f is the focal length, R is the exit pupil radius, λ is the illumination wavelength, z img is the sensor plane, and z i is the ideal image plane for an object located at z o .The in-focus circular pupil function is denoated as P (ρ, θ).By adding a coded pattern (amplitude, phase or both) at the exit pupil, the PSF of the system can be manipulated by a pre-designed pattern.The coding phase mask located at the aperture is denoted as C(ρ, θ), which is a circularly symmetric piece-wise constant function representing the mask phase shift rings.Such that, for each ring k between r k1 < ρ < r k2 it holds that C(ρ, θ) = exp{jφ k } where φ k is the phase shift of the ring.The specific parameters presented in Appendix C The defocus parameter ψ measures the maximum quadratic phase error at the aperture edge, such that we get: Following [14], the PSF of an incoherent imaging system is defined as: where F denote Fourier transform.We compute the PSF for the RGB colors (λ ∈ {610, 535, 455}nm) to simulate the camera acquisition using RGB images by Eq. ( 2).It is an approximation of the real imaging system which applies a specific PSF for each wavelength in the full spectrum of light.Under the assumption that the PSF changes slowly in λ compared to the bandwidth of the color filter array of the camera (Bayer filter, for each color of RGB) the approximation holds.
The PSF is a two-dimensional continuous function in the spatial coordinates of the image.Due to the focus change during the exposure ψ changes and thus the PSF changes continuously in time.for the acquisition simulation by Eq. ( 2) the spatio-temporal PSF was discretized in time and space as presented in Fig. 17.

B. Video loss and video metric
For a video loss and video metric (for train and evaluation respectively) we used 18-layers ResNet3D model [64] that performs 3D spatiotemporal convolutions on video time-space volume.We use the outputs of the three first convolution layers (namely conv1, conv2 and conv3) for the reconstructed frame sequence and the ground truth frames (consist of 7 frames each as described in Sec.3.3).We compute Smooth L1 loss between the two features for each layer output, such that we get three scalar values.These values (denoted as l k ) represent the similarity both in spatial and temporal dimensions.For training, we average the three components to a single loss value (denoted as l vid in Sec.3.3).For the video sequence evaluation using the video metric, we compute each of the three components in log scale (same as PSNR) and averaged them to a single value (denoted as VID Sec.3.3) C. Dynamic phase-coded camera prototype Our method is based on a dynamic phase coded camera, designed to embed color-motion cues in the intermediate image, and a corresponding CNN trained to decode these cues and reconstruct a sharp frame burst.After achieving satisfying simulation results (i.e. with simulated coded images), we assembled a prototype camera implementing our proposed dynamic phase-coding.As mentioned in the paper, our coding method is relatively simple and based mostly on conventional commercial parts.As such, it can be easily integrated into any camera equipped with a focusing mechanism.
The coding is achieved jointly using a phase-mask in the lens aperture and by performing a focus sweep during the exposure time.Following the work in [13], we use a similar phase-mask, comprised of two phase rings.The phase-mask aperture diameter is D = 2.3[mm]; the first ring (inner-toouter) radii are r = [0.6330.92]mm and its phase shift is φ = 6.5[rad]; the second ring radii are r = [0.921.15]mm and its phase shift is φ = 13.2[rad](the phase-shifts are measured with respect to λ = 455[nm], which is the peak wavelength of the camera's blue channel).The phase-mask is fabricated using conventional photo-lithography and wet etching process.
The dynamic PSF encoding is achieved by applying the learned focus variation during the exposure time.The focus change is performed electronically, using the camera focusing mechanism, controlled by a dedicated micro-controller [3].The micro-controller contains the learned focus sweep parameters, and triggers the required coding in synchronization with the exposure (utilizing the camera flash-signal, designed to indicate the start of the exposure).Note that although various components had been used in our implementation, the coding can also be implemented easily on existing cameras (assuming the availability of API to the focusing and exposure mechanisms).

D. Learned code and PSF kernels visualization
A visualization of a single horizontally moving point light source is presented in Fig. 4b.It is achieved from Sec. 3.1 with the time changing camera PSF, denoted as h(ψ(t)).The vector ψ ∈ R 49 values (exposure time discretized for 49 values) presented in Fig. 5, and the resulting 49 spatial color kernels presented in Fig. 17   presented in Appendix A. Note that the color at the center of each kernel was used for the h(ψ(t)) points in Fig. 5.The numerical values of the learned ψ vector are provided in the additional materials with the submitted code.

E. Additional models architectures E.1. Deblurring models
For the deblurring task, we train the same model architecture both for the linear code and the learned code (presented in Tab. 4).We used the proposed Unet model with a few minor adaptations due to the different task.Since the time parameter is irrelevant (always t = 0) we did not concatenate a time channel to the input image.moreover, we replace AdaIN with group normalization since it is more stable during training, and set the groups number to 16.

E.2. AdaGN instead of AdaIN video model
In Table 3 we present a model with adaptive group normalization instead of AdaIN (config-g).In this architecture, each instance normalization was replaced using group normalization with 16 groups.Following the normalization we perform an affine transform, the same as performed in AdaIN.

E.3. Degraded architectures
For the degraded models presented in Fig. 11 we used Unet architecture [46] in depth of 2 for the encoder and the decoder parts (two down/up-sampling), and with Mobile-Net convolution blocks [47].We added a skip connection from the input to the output, and neither added the time concatenated channel nor the positional encoding.for B2D2 architecture we used double convolution blocks, while for B1D2 we used a single convolution block.each convolution consist of Inverted-Residual block [47] followed by Leaky-Relu activation.the batch normalization in the Inverted-Residual block was replaced by AdaGN as described before.all the rest details remain the same as the proposed architecture.

F. Video Frame Interpolation Details
For video frame interpolation model we used three consecutive coded blurred frames as an input to the model, and reconstruct a single frame in time interval t ∈ [−1, 1].The other details of training remain as in the image-to-video case.We set two video timing setups for the training and evaluation.Using 960 fps sharp video data set we generated "48-8" blurred video data by averaging 48 frames in linear space to a single blurred frame (equivalent to 50ms exposure), and passing 8 frames as reset time of the camera (intervals of 56 frames total).This timing setup is considered as our baseline.In the second timing setup, denoted as two-thirds of the baseline, we generate "32-16" dataset accordingly.The exposure interval is 33.3ms in this timing setup, which is two-thirds of the baseline.On training, we use batches of both timing setups to generalize the reconstruction for the different timings.Note that the t parameter represents both intra-and inter-blur frame reconstruction for the specified interval, i.e. a sharp frame which is part of the blurred input image, and a sharp frame that is between blurred frames (the camera reset time).

F.1. Adobe240 Dataset
To verify our method on another dataset than the one we used for training and validation, we tested the models on the Adobe240 dataset [57] and presented our results in the main paper.Since this dataset is oriented for hand-held camera video deblurring, we picked 10 videos of dynamic scenes with less camera shake for our evaluation.Same as was done for REDS dataset, before creating the blurred frames by averaging consecutive frames we perform frame interpolation by a factor of 4 using [4] to get 960fps video.Inverse CRF was applied on the frame prior to the PSF convolution and temporal averaging (blurring) for blurred video simulation.

F.2. Prototype Camera Video Acquisition
We captured videos using our prototype camera with the temporal phase coding for each frame.We follow the baseline timing, namely 6/7 of the cycle is exposure time and 1/7 is for reset time.We capture in 4 fps, and the exposure time was set to ∼214ms accordingly.the reset time in this case is 35.7ms which is enough for the liquid lens to return to the initial state for the next frame capture.Even though high pfs was not our goal for the prototyping, the more prominent bottleneck in our case was the data transfer of the camera due to high-resolution frames acquisition (3200x2400) and not the liquid lens.As we did in the single image models, for the prototype reconstructions we trained a model with 3% noise since real world images are noisier than the 1% noise used for evaluation and comparisons.We placed colourful images on a rotating wheel to control the rotation speed.the images are under free use creative commons license from "pixnio" website.The captured blurred videos and the reconstructed videos are presented in the supplementary material.

G. Results
In addition to the image-to-video results presented in the paper, we present additional results in the supplementary video and here.The reconstructed videos were generated with 25 frames since we can choose any frames number using our time dependent CNN.The frame-rate difference compared to [22] (which is limited to 7-frames only) is clearly noticeable in the supplementary video.A comparison between the frames in our results and [22] are presented in Fig. 19, Fig. 20 and Fig. 22.Our method achieves improved results along the entire frame burst.

Figure 1 .
Figure 1.Method demonstration.(a) A flower moving left captured using our dynamic phase-coded camera, which embeds (b) color-motion cues in the intermediate image.These cues guide our image-to-video reconstruction CNN, resulting in a (c) sharp video of the scene (play the video by clicking on (c) in Adobe Reader).

Figure 2
Figure2.Overview of our suggested method.An acquisition of a dynamic scene using our dynamically phase-coded camera provides an intermediate image BC which contains scene dynamics cues in its coded motion-blur.We reconstruct sharp video frames of the scene at desired timesteps t from the single coded-blurred image BC using a time-dependent CNN.The optical coding parameters are jointly optimized with the reconstruction network weights using end-to-end learning.

Figure 3 .
Figure 3. Network architecture.Our CNN is based on the UNet [46] model, with the coded blurred image and a time parameter as inputs and the sharp reconstructed frame at the output (see Eq. (3)).The decoder part is controlled by the time parameter (using AdaIN[19]), to set the relative time of the reconstructed frame.

Figure 4 .
Figure 4. PSF coding.The spatiotemporal PSF coding of the (a) linear focus sweep [13], (b) learned focus variation (simulation), and (c) same PSF in an experiment.The PSF visualizations represent the blur of a point light source moving horizontally (left to right) during the exposure time.The joint effect of the phase-mask and focus variation during exposure results in different wavelength (color) that is in-/out-of-focus when the point moves.

Figure 5 .
Figure 5. Learned defocus vector ψ.The Learned parameters of the defocus change during the acquisition interval.The color of each sample represents the color response of the corresponding PSF kernel (at its center) as presented in Fig.17.
Figure 5. Learned defocus vector ψ.The Learned parameters of the defocus change during the acquisition interval.The color of each sample represents the color response of the corresponding PSF kernel (at its center) as presented in Fig.17.

Figure 8 .
Figure 8. Reconstruction performance (simulation).(toprow) GT image and zoom-in for a 7-frames burst, (middle row) conventional blur and[22] results, and (bottom row) our coded input and reconstruction results.Our method achieves improved results along the entire burst and also provides a higher frame rate video.Click on the blurred input images (left) to play the result videos.

Figure 10 .
Figure 10.Real-world results.(a) blurred image from (top) conventional camera and (bottom) our coded camera; click on the blurred images to play the output videos, (b) Zoom-ins on 7 reconstructed frames of (top)[22] and (bottom) our results.Our method achieves improved results along the entire burst, reconstructs the correct motion direction and also provides a higher frame rate video.

Figure 11 .
Figure 11.Reconstruction PSNR vs. model size.PSNR reconstruction results for three different size models: Our proposed model and two lighter Unet models with encoding depth of two and double/single convolutions block (B2D2 and B1D2 respectively, details in Appendix E).Our learned code improvement is more significant as the model is more degraded and less powerful.

Figure 14 .
Figure 14.Video frame interpolation -PSNR performance for different noise levels on Adobe240 dataset [57].Comparison of PSNR for two exposure intervals: baseline and two-thirds of the baseline.The noise axis was normalized with respect to the exposure interval based on the SNR behavior (more information in Sec.4.6).

Figure 15 .
Figure 15.Video frame interpolation -SSIM3D and VID performance for different noise levels on Adobe dataset [57].SSIM-3D and VID performance for two exposure intervals: baseline and two-thirds of the baseline.The noise axis was normalized with respect to the exposure interval based on the SNR behavior (more information in Sec.4.6).

Figure 16 .
Figure 16.Prototype Camera Diagram.The flash signal from the camera initiates the learned focus variation during the exposure using a micro-controller, such that the designed dynamic phase coding is performed and a motion-coded image is acquired.

Figure 17 .
Figure 17.Learned PSF kernels.The time-variant camera PSF represented by 49 RGB color kernels, starts at the upper-left kernel (blue) and ends at the bottom-right kernel (red) in row-major order.These are a simulation of the camera PSF, computed as discussed in Appendix A

Figure 18 .
Figure 18.Video frame interpolation -SSIM3D and VID performance for different noise levels on REDS dataset[37] Comparison of SSIM-3D metric and our VID metric for two exposure intervals: baseline and two-thirds of the baseline.The noise axis was normalized with respect to the exposure interval based on the SNR behavior (more information is described in Sec.4.6).

Table 2 .
Quantitative comparison.PSNR, SSIM and VID metrics on the entire test set.

Table 3 .
Ablation study.To assess the contribution of each feature of our method, we performed a gradual performance evaluation.The PSNR and SSIM metrics were averaged over all the reconstruction timesteps during the acquisition interval, while the VID metric evaluates the whole scene sequence internally.All the metrics were averaged over the test set scenes.

Table 4 .
Central Frame Performance.Averaged PSNR/SSIM metrics on the central frame (on the test dataset) for different coding methods: uncoded, linear and learned.(the letter in the parenthesis indicates the entry in Tab. 3).We also present models trained only for the deblurring task (central frame reconstruction) for flexibility-quality tradeoff assessment.
. It is notice-