Ultrafast light field tomography for snapshot transient and non-line-of-sight imaging

Cameras with extreme speeds are enabling technologies in both fundamental and applied sciences. However, existing ultrafast cameras are incapable of coping with extended three-dimensional scenes and fall short for non-line-of-sight imaging, which requires a long sequence of time-resolved two-dimensional data. Current non-line-of-sight imagers, therefore, need to perform extensive scanning in the spatial and/or temporal dimension, restricting their use in imaging only static or slowly moving objects. To address these long-standing challenges, we present here ultrafast light field tomography (LIFT), a transient imaging strategy that offers a temporal sequence of over 1000 and enables highly efficient light field acquisition, allowing snapshot acquisition of the complete four-dimensional space and time. With LIFT, we demonstrated three-dimensional imaging of light in flight phenomena with a <10 picoseconds resolution and non-line-of-sight imaging at a 30 Hz video-rate. Furthermore, we showed how LIFT can benefit from deep learning for an improved and accelerated image formation. LIFT may facilitate broad adoption of time-resolved methods in various disciplines.

We analyze in detail the working principle of LIFT here. Although relay systems are usually added for different applications, light field acquisition by a cylindrical lenslet array remains the same for all embodiments. We use two-plane parameterization for light field analysis and do not consider occlusions here. For clarity, only four lenslets are shown in Supplementary Figure 1a, where the spatial axis (x) coincides with the sensor plane, and the angular axis (u) resides on the lenslet-array plane. Each lenslet is also assigned with a local coordinate (in green), whose origin is the image of a point source located at infinity (indicated by the dashed blue lines).
The image formation onto a 1D sensor by a cylindrical lenslet is artificially decomposed into three steps here: (1) pin-hole image formation, (2) PSF substitution, and (3) resampled projection. Step 1: pin-hole image formation model. This is the classical imaging process. Consider a point source located at [ , ], the pin-hole model predicts its local coordinates on a sub-image as: Step 2: PSF convolution A cylindrical lenslet differs from a perfect spherical lens in lacking optical power along one axis, which we referred to as the invariant axis. For a point source, it forms a finite line along the invariant axis at the image plane. The line length is determined by the image magnification = ⁄ of the system and the lenslet size as = (1 + 1⁄ ) , where is the lenslet diameter. Such a line-shaped PSF disperses each point in the image space onto a pixel on a 1D sensor, as illustrated by the transition from step 1 to step 2 in Supplementary Figure 1b. Therefore, an individual pixel integrates the image along the PSF-line, and a parallel beam projection of the image is obtained on the 1D sensor along the angle of the invariant axis.
Step 3: Resampled projection For a fixed 1D sensor, the projection along different angles is acquired by rotating the cylindrical lenslet. As a result, the 1D sensor is generally not perpendicular to the projection direction. This is illustrated in step 3 of Supplementary Figure 1b, where solid black lines indicate the projection direction and the green axis represents the 1D sensor. To relate the unknown image to the acquired projection data via Fourier slice theorem, it is necessary to make the projection perpendicular to the 1D sensor. This can be done by a computational resampling process. Denoting the angle between the projection and the axis as , one can establish a local coordinate [ ′ , ′ ], shown in red dashed lines, to obtain a virtual sensor line ′ that is perpendicular to the projection direction. These two local coordinates are related by a rotation matrix: Combining Supplementary Equation (1) and (2), the image point in the auxiliary coordinate system is obtained as: The projection onto the virtual line sensor is done by simply dropping the y component: Substituting the result back into Supplementary Equation (2), the experimentally recorded projection data is obtained as = / . We dub the term as the resampling factor: it resamples the experimentally recorded projection data (on the sensor line ) onto the desired recording line . In other words, each cylindrical lenslet performs a resampled projection onto the 1D sensor . Ultimately, the LIFT imaging acquisition can be summarized into a single equation: The first two terms describe the projection process and the third term is the light field component contributed by different lenslets, which enables post-capture refocusing and depth retrieval as discussed in Supplementary Note 3.

Supplementary Figure 2. LIFT data sampling analysis in the Fourier domain. a, Illustration for
Fourier slice theorem and the limited view problem. b-c, Experimental images captured by the current LIFT camera for a fiber oriented at horizontal and vertical directions. d, Projection angle and field of view tradeoff for recording projection data at different angles on a 1D sensor by rotating a cylindrical lenslet. e, LIFT implementation using a Dove prism to span the projection angular range to [0, 180], eliminating the limited view problem. Four lenslets are shown for illustration purposes. f-g, Rotation of a LIFT camera or a LIFT camera array eliminates the limited view problem and enriches the number of projections and light field data. Only three lenslet are shown in the 1D camera for simplicity. A rule of thumb for this criterion states that to reconstruct an × image, projections with ~ pixels resolution are needed. Using 1D sensors with a limited pixel count (several thousands) for an image resolution over 100 × 100, practical implementation of LIFT usually restricts the number of projections on the order of ten. This casts LIFT as a sparse view CT problem. Using lenslets, the compression factor in LIFT for sampling an × image is therefore ⁄ , which is on the order of ten for most implementations. To minimize the correlations in the projection data in LIFT and therefore maximize information content for reconstruction, it is also beneficial to arrange the projection angle uniformly.  Figure  2b, where the line-shaped PSF is finite in length ( as predicted in Supplementary Note 1). The maximum height for a point detectable by the 1D sensor is thence limited to ℎ = 2 ⁄ . This implies the achievable field of view is 2ℎ = . As a result, one must strike a balance between the FOV and angular range. In practical implementations, the angular range is limited to [ 1 2 ], leading to a missing cone in the k-space as indicated by the gray area in Supplementary Figure 2a. Tomographic reconstruction in this case results in degraded image quality, which is referred to as the limited view problem. Our current LIFT implementation suffers from a limited view problem since the angular range of projection is about [-45 o , 45 o ] with respect to the y axis. Supplementary  Figure 2b-c show respectively two experimentally acquired images of a fiber oriented at vertical and horizontal directions: due to the limited view problem, the horizontal fiber shows about 2~3 times lower resolution.

Remedy limited view problem.
There are several methods for mitigating this problem. One way resorts to deep learning: by training a neural network for the system with enough data that is afflicted by the limited view problem, the network can learn the pattern (or statistical distribution) of imperfections in the reconstructed image and corrects them thereafter. This solution is systemspecific and can substantially mitigate, but not eliminate, the limited view problem. The second method is to insert a Dove prism after a relay lens, which projects the image of the original object to infinity, as diagrammed in Supplementary Figure 2e. The Dove prism is rotated by 45 degrees so that the image passing through it is rotated by 90 degrees, allowing the cylindrical lenslet behind it to fill in the missing cone and thus eliminating the limited view problem. The downside of using a Dove prism is that it introduces astigmatism for non-collimated light and chromatic aberrations for broadband scenes, compromising the 3D imaging performance of LIFT. Another practical method is to rotate the camera or equivalently, build a camera array as shown in Supplementary  Figure 2f-g. This requires the camera to be compact and, if using rotation, the intended applications to be repeatable, such as NLOS imaging using compact SPAD cameras. For instance, rotating a LIFT camera with 7 lenslets by 3 times will not only enrich the projections to 21 for eliminating the limited view problem but also extend the light field to 2D. A similar gain can be obtained by camera array implementation. Because the deep learning method has the advantage of simplicity and faster image reconstruction, it is the method of choice for current demonstration.
Supplementary Note 3. LIFT light field imaging capabilities 3.1 Refocusing. As depicted in Supplementary Figure 3a, to focus on a different plane 2 , the 1D sensor needs to be moved by ∆ . The light field at the new virtual sensor plane is calculated as: where = −∆ (∆ + ) ⁄ . Ignoring the magnification factor (1 + ∆ ) , which is constant across the whole image area when computationally refocusing, one can rewrite Supplementary Equation (6) as: This is exactly the same refocusing formula in the ray space for light field cameras 2 except that LIFT captures only the angular information along one axis ( ) instead of two. Hence, refocusing onto different depths can be achieved in LIFT by shearing-and-reconstructing: shear the acquired projection data and then perform image reconstruction. This is also clear from Supplementary Equation (5), which describes the light field at plane when the nominal focal plane is at infinity.
Two points on light field characteristic of LIFT are in order. The first is on the dimensionality of the angular information. While the angular information can be extended to 2D as shown in Supplementary Note 2, our current implementation records a 1D light field: there is no angular component (or disparity) along the other axis. However, LIFT still produces a 2D, rather than the 1D blurring effect in conventional 1D light field cameras (inset of Supplementary Figure 3b). This is attributed to the tomographic reconstruction nature of LIFT. We illustrate in Supplementary  Figure 3b the back-projection reconstruction of a point source with three projection data. For the in-focus point, the three projections intersect at a single point and reconstruct the point correctly. When the projection data is sheared to refocus on a different plane, the three back-projected data intersect on three points that spread on a 2D area instead of falling on the same line (otherwise, the three projection directions are collinear). The second point is the generation of ghost images when an image part is heavily defocused. Due to the sparse-view acquisition, the three points in Supplementary Figure 3b are clustered together for small defocus but get well separated under heavy defocus. With more projections (views), there will be more intersection points that ultimately fill the voids in between, producing a blurring bokeh as in conventional photography. This is validated by a simulation of LIFT imaging of a point source using 7 and 127 views under heavy defocus (7 pixels disk size for a 128×128 image), where the convergence of ghost parts to a defocus blur is clearly observed. Such a behavior is not peculiar to LIFT: light field cameras with low angular resolution will also produce ghost images when refocusing far away from an image's actual focal plane 3 . A 2D angular information will benefit LIFT with an enhanced 2D reconstruction as it yields a larger number of projections and, consequently, improve 3D reconstruction as well.
Supplementary Figure 3. Light field imaging process for LIFT. a, Light field propagation through different planes for refocusing. b, Illustration of ghost image generation in LIFT reconstruction. Defocus leads to ghost image generation in sparse view tomographic reconstruction. However, with more views, it eventually converges to a blurring bokeh as in conventional imaging methods. c, Processing pipeline for 2D (x, y) and 3D (x, y, z) imaging in LIFT. For 4D (x, y, z, t) imaging, the 3D image processing is individually applied at each time instance. d, Experimental lenslet arrangement for the depth-of-field and depth-sense version of LIFT, with black solid line representing the invariant axis of the cylindrical lenslet. The angles underneath each lenslet is w.r.t the y axis (counterclockwise being the positive direction) and listed with an accuracy of 1 degree for clarity.

Computationally extending depth of field.
To extend the depth of field in LIFT, the measurement data is processed to reconstruct an image set computationally refocused on all different depths. Next, the sharpest feature around a region of interest (ROI) for each pixel is identified across the image set, and an all-in-focus image is subsequently assembled by combining the sharpest parts via graph cut algorithms 4 . Such an extended depth of field is obtained at the expense of processing time. Also, it requires the image to show enough features. The depth-offield version of LIFT described below sidesteps these two drawbacks all together.

Lenslet arrangement.
LIFT can achieve an extended depth of field without resorting to computational refocusing: by arranging the cylindrical lenslet judiciously, an all-in-focus image can be automatically obtained. Adding a shearing term to Supplementary Equation (5) for refocusing, one obtains: where ∆ = is the shift in the object space due to refocusing. Notably, there is an interaction between the lenslet projection angle and the angular component of the light field. If the cylindrical lenslets are arranged in such a manner that the angle as a function of the angular axis satisfying: ≈ (9) then ∆ = will be a constant independent of the angular variable , meaning no blurring in the resultant image because all lenslets contribute the same shift for a specific object plane (indexed by ). The object plane at different depths is indeed shifted from the nominal center by an amount determined by its defocus distance, but each plane is well focused. This makes the reconstructed image automatically all-in-focus. This configuration is dubbed as the depth-of-field version (Supplementary Figure 3d). Still, it is feasible to undo this automatic all-in-focus imaging effect and computationally defocus by changing the shearing term s from a constant to a lenslet-specific number × . The cost of this arrangement is a degraded depth retrieval accuracy because only residual defocus errors are left, which are contributed by the approximation error in Supplementary Equation (9). To optimize lenslet configuration for 3D imaging, we expand as a function of into a Taylor series: Substituting into ∆ , it becomes evident that maximizing the first-order coefficient 1 leads to the depth-of-field lenslet configuration, whereas minimizing it to zero will maximize the defocus error and hence optimize depth retrieval. It is straightforward to perform a search over the permutations of pre-determined set of projection angles to find the near-optimum configuration for depth retrieval. The resultant configuration is dubbed as the depth-sense version (Supplementary Figure  3d).
Supplementary Figure 4 demonstrates experimentally that refocusing in the depth-of-field version of LIFT camera only induces an image shift. When computationally sweeping the focus from near (a) to far (d), the reconstructed helical fiber is shifted upwards from left to right. However, the helical fiber structure is well resolved in all the cases despite of its large depth range.

Depth retrieval.
LIFT can extract depths via the depth-from-focus (DfF) method 5 , thereby yielding a 3D image at each time instance. In DfF, the camera captures a sequence of images of the scene at different focal settings, producing a focal stack. To infer depth, a focus measure (sum of modified Laplacian 6 ) is computed for each pixel of the image, and the focal setting giving rise to the maximum focus measure is identified, which can be mapped to a depth value. For light field cameras like LIFT, the focal stack is captured with a single snapshot and produced computationally by refocusing the image at different depths. The processing pipeline of LIFT for reconstructing multi-dimensional images (2D, 3D, and 4D) is summarized in Supplementary Figure 3c. Each 1D measurement data is ordered into a sinogram (the projection data ( )), which can be directly reconstructed into a 2D (x, y) image or go through a shear-and-reconstruct process to refocus on different depths, producing a focal stack. Afterwards, the focal stack is co-registered because the refocusing can induce image shifts, as explained in the previous Supplementary Note. The denoising algorithm VBM3D is then applied to attenuate the refocusing artefacts in the focal stack, which substantially improves the robustness of depth retrieval. Finally, the focus measure is computed for each pixel, and a quick sorting algorithm identifies the correct focal setting and map that pixel to the corresponding depth, yielding the 3D image (x, y, z). Owing to the decoupled space-time acquisition in LIFT, the 2D and 3D images processing are independently performed at each time instance to produce the final 3D (x, y, t) or 4D (x, y, z, t) results.
The focus-to-depth mapping for LIFT is illustrated in Supplementary Figure 5a, where the image relay system is not included. A relay system with a magnification of changes the depth retrieval accuracy by 2 . For a plane at [0, 0, ], the distance between the leftmost and rightmost sub-images is: = + (11) where is the baseline length of the lenslet array and is its distance to the sensor. To connect depth with refocusing parameter , it is noted that the distance at infinity is ∞ = and refocusing from infinity onto depth involves shearing the light field, which leads to ∞ = + ( − ) = + , where and indicates the leftmost and rightmost angular components, respectively. Solving above equation yields = .
The depth retrieval accuracy ∆ is the minimum depth change that causes a one-pixel variation in the distance . Given a linear sensor with pixels across the baseline, a one-pixel change is ∆ = ⁄ . Taking the derivative of Supplementary Equation (11) (11) can extract parameters a and D. c-e, Computationally generated focal stack, extracted depth map, and 3D rendering of a slanted plane displaying a grid pattern of dots. The slanted shape of the plane is well reproduced. It is also noted that defocus causes the dots to become dimmer in the focal stack as energy gets spread onto a larger area.

Supplementary Note 4. Modeling of non-ideal effects in LIFT system
In practical system implementations, there will be some misalignments between the 1D sensor and the individual cylindrical lenslets, which will affect image quality if not accounted for. As the misalignment of a lenslet shifts the image from its ideal position by a vector ⃗, it can be modeled as a convolution operation with a shifted Dirac delta function ( − ⃗). The forward model in LIFT can then incorporate the non-ideal effect as: = = ′ (12) where is the uncorrected image vector, and is the convolutional matrix: with being the block Toeplitz matrix of point spread function of lenslet i. By calibrating with an arbitrary point source, indicated as vector , one can reconstruct the point spread function of the non-ideal system as ′ = = , which recovers the matrix . The true image can then be recovered by deconvolving with the calibrated using the Richard-Lucy algorithm. Supplementary Figure 6 illustrates experimentally the benefits of modeling non-ideal effect in LIFT. The calibrated point spread function and the iteratively reconstructed image of the helical fiber without and with applying deconvolution are shown in Supplementary Figure 6a

Supplementary Note 5. LIFT imaging quality
The image quality of LIFT depends on both the compression factor, or equivalently, the number of projections, and the signal to noise ratio of the measurement data.

Compression factor.
The sampling analysis in the previous Supplementary Note indicates that a larger number of projections will fill the k-space more densely and therefore lead to better reconstruction quality in LIFT due to the reduced compression factor. This is illustrated in Supplementary Figure 7, which shows the recovered images for a Shepp-Logan phantom and a cluttered camera-man photograph using different number of projections at a resolution of 128×128. The sampling angular range is [0 o , 180 o ] and the transform function ( ) is chosen as total variation (TV) to encourage sparsity in image gradient. Sampled at the Nyquist rate, the images recovered with a projection number of 128 serves as the ground truth reference for calculating the peak signal to noise ratio (PSNR) of other reconstructed images. It is noted that, as the compression factor gets larger (i.e., fewer projections), the PSNR of the reconstructed images becomes smaller and fine image details gradually get washed out. Moreover, the cluttered camera-man photograph renders a smaller PSNR than that of the Shepp-Logan phantom when employing the same compression factor. Therefore, the number of projections must be appropriately scaled to accommodate scenes of different complexity. This is expected and conforms to the general observations in sparse view CT reconstruction. Both data fidelity and regularization terms in the optimization-based formulation contribute to the improved noise robustness for LIFT reconstruction over filtered backprojection. The data fidelity is a least square term that tends to suppress noises at the expense of resolution. The regularization term is critical for noise attenuation as it denoises intermediate reconstructions in each iteration, which is particularly evident under the framework of regularization by denoising (RED) for inverse problems.
Supplementary Note 6. NLOS imaging via LIFT 6.1 Experimental setup. We summarize below the equipments used in LIFT system.  NLOS (x, y, t) datacube. Mediated by a relay wall, NLOS imaging shows drastically different characteristics from natural photographs: the instantaneous (and steady state) images on the wall are generally smooth and highly compressible, even for complex hidden scenes. We show the instantaneous images on the wall for hidden scenes of different complexity, which were simulated by a transient render 7 on publicly accessible datasets 8 . For these synthetic datasets, the time bin is 10 ps, and the laser incident point is at the center of the wall. The camera (point scanning or parallel detectors) samples the wall at a resolution fixed at 128×128, regardless of the grid size. To simulate the recoverable (x, y ,t) datacube by the LIFT camera, we employed a twostep process: 1) changing the PSF of the camera in the synthetic dataset to line-shaped PSFs to model the LIFT camera, and 2) reconstructing the datacube by the iterative FISTA algorithm, owing to its greater flexibility to handle LIFT models using different number of projections. As NLOS imaging typically employs compact SPAD sensors and does not require to record the This can be easily achieved by a few rotations of a 1D SPAD based LIFT camera as discussed in Supplementary Figure 2d. The hidden scenes are reconstructed at a volumetric resolution of 128×128×128 using the extended phasor-field method.

Compressibility of the
Supplementary Figure 10a depicts the instantaneous images on the wall at several representative time instants for a hidden resolution target, which is placed 0.5 m away from the wall (grid size: 1 m × 1 m). The recovered instantaneous images using different number of projections are shown in the second to fourth row. The (x, t) slices at different y of the datacube and reconstructed hidden scene (maximum intensity projection along the depth direction) are compared in a tabular format in Supplementary Figure 10b and c, respectively. It is noted that the instantaneous images are highly structured and smooth. As a result, LIFT can recover well both the instantaneous images and the hidden scene using only 14 projections, corresponding to ~10% of the data load in the point-scanning method. Compared with the ground truth images, LIFT results tend to be smoother, particularly for the cases using a small number of projections. This is attributed to the LIFT's radial sampling pattern in k-space: high spatial frequencies are more sparsely sampled, as indicated in Supplementary Figure 2a. Nonetheless, LIFT using 7 to 14 projections can still detect, though not resolve, the smallest strips in the resolution target. The reconstruction by LIFT using 7 projections rendered the main shapes of the resolution target despite of some artefacts. Figure 10. Compressibility of NLOS imaging for the hidden scenes of a resolution  target. a, (x, y) images-the instantaneous images on the wall, acquired by point-scanning (ground truth) and the LIFT camera using different number of projections. b, (x, t) slices at different y of the corresponding spatiotemporal data cube on the wall. c, The reconstructed hidden scene. PS: pointscanning data acquisition. LIFT-N: LIFT data acquisition using N projections. Scale bar: 140 mm.

Supplementary
Supplementary Figure 11 shows the results for a complex bookshelf scene ~2 m away from the wall (grid size 1.8 m × 1.8 m). The same observations can be made: LIFT using only 14 projections can recover the hidden scene decently although there is a slight resolution degradation and an increase in artefacts/noises. Using 7 projections (compression factor ~20) for data acquisition, LIFT can recover the main shapes of the scene despite of some background noises. 6.3 Noise robustness. The robustness to noises was studied for the bookshelf scene that suffers from strong inter-reflections. The globally maximum photon count in the data cube is varied as in previous works 9 . For LIFT, the maximum photon counts are in the projection measurement rather than the reconstructed (x, y, t) datacube. Supplementary Figure 12a shows, in a tabular arrangement, the reconstructed bookshelf using a maximum photon count ranging from 2000 to only 50 along the row direction and a varying number of projections in the column dimension. The reconstruction results by the phasor-field method using point-scanning are given in Supplementary  Figure 12b as references, whose maximum photon counts are varied from 200 to 5. While the point-scanning method recovers the bookshelf with a maximum photon count of 10, LIFT using 21 projections need 100 counts to recover the main shapes of the bookshelf. This indicates that LIFT using 21 projections is about 10 times nosier than the point-scanning method. Less projections in LIFT requires more photons to recover the hidden scene and tends to produce smoother results. This is expected as less projections will produce stronger reconstruction artefacts and noises. However, LIFT can readily compensate for its noisier reconstruction by allowing longer exposure time while still maintaining 30 Hz video rate. With 7 projections in a 1D SPAD camera, LIFT can acquire 21 projections using only three rotations, leading to an exposure time of 10 ms at each rotation for imaging at 30 Hz. In contrast, point scanning 10 at 32×32 resolution is still ~ ten times away from 30 Hz, even using an exposure time as short as ~250 µs. Scanning a 1D SPAD array along one spatial axis can reach 30 Hz at a resolution of 100×100 but only at an exposure time of 300 µs (30 ms/100) for each line, which is 30 times shorter than that of LIFT. Compared with 2D SPAD cameras, LIFT using 1D SPAD array benefits from ~10 times larger fill factor, which currently floats around 10% in state-of-the-art 2D designs. Therefore, LIFT can collect over ten times more photons to compensate for its higher noise level while offering unique advantages: compressive data acquisition and full-fledged light field capabilities. Given an (x, y, t) datacube of 128×128×1000, acquired with 8 bit precision, the resultant data load is 16 Megabytes, more than twice of that in 4K ultra high definition camera. Streaming such data at 30 Hz reliably typically requires nontrivial compression algorithms. Instead, LIFT with 21 projections reduced the data load during acquisition more than six times. Moreover, the light field capability of LIFT is inherently challenging to implement in scanning-based methods or 2D SPAD cameras without incurring a substantial increase in system complexity and data load. Lastly, we summarize the photon counts used in the simulation. For the bookshelf scene, the time bin is 10 ps, and the global maximum photon count is 100 for LIFT reconstruction, leading to an average of ~20 photons per time bin in the window of [0, 24 ns]. This corresponds to reconstruction in the point-scanning method using a maximum of 10 photons (1.2 average photons per bin in the same time window). After accounting for the time bin difference, this is on par with previous experimental data of a complex hidden scene that has a maximum of 6 photons with a 4 ps time bin, acquired using 1 ms exposure 9 . This validates that LIFT using 1D SPAD cameras holds promise for large scale NLOS imaging at 30 Hz video rate.

Resolution of NLOS imaging.
The resolution of NLOS imaging is primarily determined by the camera temporal resolution τ (with system jitter if temporal scanning/averaging is employed) and wall size 11 . For current NLOS demonstration, the temporal resolution τ (jitter-limited) is about 60 ps, leading to an axial resolution of ∆ = τ 2 ≅ 9 mm. The theoretical lateral resolution is derived by O'Toole et al. 11 as ∆x = τ √( /2) 2 + 2 , with z being the distance of the hidden scene to the wall. A similar resolution bound is also found by Liu et al. 9 However, LIFT reconstruction through either iterative methods or deep adjoint neural network does not achieve a perfect recovery of the ground truth images, particularly for the high-frequency details as shown in Supplementary  Figure 10-11. Also, the image reconstruction of LIFT tends to render worse resolution in scenarios with noisy measurement data (see Supplementary Note 5). These factors degraded the lateral resolution of LIFT for NLOS imaging in practice.
Supplementary Figure 13 shows the characterization of the NLOS imaging resolution using measurement data with a low SNR. The hidden scene consists of two 20 mm wide strips separated by 100 mm, placed ~ 250 mm away from the wall. The raw temporal signals with maximum intensity in each lenslet (Supplementary Figure 13a) show large noises, causing the reconstruction to suffer from artefacts and a worse resolution. As 20 mm is smaller than practical lateral resolution, the strip image renders the line spread function of the system. The resolution is then estimated to be ~40 mm by averaging the two strips' full width half magnitude (FWHM) in Supplementary Figure 13c. The theoretical lateral resolution is ~18 mm in this case, with an effective wall size (containing NLOS signals) being ~400 mm while the camera FOV on the wall is ~600 mm. The resolution can be improved by using more projections (by camera rotation or camera array) and a higher laser power for an improved SNR.

Supplementary Note 7. Benchmark LIFT against state-of-the-art transient and NLOS imaging methods
To better understand the strengths and limitations of LIFT for transient imaging in general and NLOS imaging in particular, we compare it with other ultrafast cameras (Supplementary Table 1) and NLOS imaging methods (Supplementary Table 2) below. It is noted that, given the same signal to noise ratio, image acquisition at/over the Nyquist sampling rate generally represents an upper bound on the imaging quality (contrast, spatiotemporal resolution), a condition that compressive imaging methods asymptotically approach. While LIFT image acquisition is scalable in the compression factor and can attain up to the Nyquist sampling rate (as discussed in Supplementary Note 2.2), its practical implementations mostly fall into the compressive regime. As a result, we only compare the imaging metrics based on our current compressive LIFT camera. For NLOS imaging, the comparison excludes imaging metrics obtained using retro-reflective objects since they are less common and typically render orders of magnitude stronger signals than diffusive targets. Also, the camera's spatial resolution on the wall is in lieu of 100×100 (except for the edge-resolved transient imaging (ERTI) that only involves a one-dimensional angular scanning). Since the spatial resolution of NLOS imaging degrades linearly with the distance to the wall, the listed resolution is accompanied with the distance at which it was evaluated. One notable exception is ERTI, which has a constant angular resolution that makes its lateral resolution, which equals to the angular resolution times the distance to the wall, degrades at a faster rate as in phase array radar imaging.

Supplementary
LIFT features unique light field capability with the deepest sequence yet manages to use a small compression factor for snapshot 2D transient imaging with a resolution over 120×120. While SPAD cameras 16,17 can acquire high-resolution images at the Nyquist rate and, therefore, accommodate cluttered natural scenes better, the need of spatial scanning and repeated illuminations leads to prolonged acquisition. Interestingly, the transient images at each time instant obtained by SPAD cameras 16,17 also show notable compressibility-they are far simpler than the static photograph of the cluttered scene, which will be accentuated with a higher temporal resolution. The snapshot acquisition enables LIFT to achieve drastically faster NLOS imaging with a resolution and quality close to those in dense point-scanning methods, allowing a low laser power to be used for imaging over 1 m scale. By scaling according to the r 4 photon decay law in NLOS imaging, LIFT is expected to reach an imaging volume around 3 m × 3 m × 3 m with an average laser power of 160 mW. Its light field capabilities will also be an important ingredient towards translating NLOS imaging to field deployment.