Introduction

The technique of imaging objects out of the direct line of sight has attracted increasing attention in recent years1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26. A typical non-line-of-sight (NLOS) imaging scenario is looking around the corner with a relay surface, where the target is obscured from the vision of the observer. NLOS imaging aims to recover the albedo and surface normal of the hidden targets with the measured photon information. Potential applications of NLOS imaging include but are not limited to robotic vision, autonomous driving, rescue operations, remote sensing and medical imaging.

To achieve NLOS reconstruction, laser pulses of high temporal resolution are used to illuminate several points on the relay surface, where the first diffuse reflection occurs. After that, photons enter the NLOS domain and are bounced back to the visible surface again by the unknown targets. The hidden targets can be reconstructed with the time-resolved photon intensity measured at several detection points on the visible surface. Commonly used time-resolved detectors are single-photon avalanche diodes (SPAD)27. The imaging system is confocal if the illumination point coincides with the detection point for each spatial measurement, otherwise being non-confocal. Besides, we call the measurements regular if the illumination and detection points are uniformly distributed in a rectangular region.

According to the representation of the hidden surface, existing imaging algorithms are divided into three categories: point-cloud-based28, mesh-based29 and voxel-based methods1,8,9,30,31,32,33,34,35. Among these categories, voxel-based algorithms yield to be the most efficient ones with low time complexity32 and fine reconstruction results34. For voxel-based methods, the reconstruction domain is discretized with three-dimensional grid points and the albedo is represented as a grid function.

The first voxel-based NLOS reconstruction method is the back-projection algorithm proposed by Velten et al.1. The measured photon intensity is modeled as a linear operator applied to the albedo, and the targets are reconstructed by applying the adjoint operator to the measured data. Further improvements of the back-projection method include rendering approaches for fast implementations2,16 and filtering techniques33,36 for noise reduction. The light-cone-transform30 (LCT) proposed by O’Toole et al. describes the physical process as a convolution of the light cone kernel and the hidden target. In this way, the reconstruction is formulated as a deblurring problem and can be computed efficiently using the fast Fourier transform. The directional light cone transform31 (D-LCT) generalizes this method and simultaneously reconstructs the albedo and surface normal of the hidden target. The frequency-wavenumber migration8 (F-K) method uses the wave equation to reconstruct the albedo and can also be implemented efficiently in the frequency domain. The LCT, D-LCT and F-K methods only work directly under confocal settings. Although it is possible to transfer the data collected in non-confocal setups to confocal ones, the approximation error cannot be neglected34. To reconstruct the hidden object under non-confocal settings, the phasor field32 (PF) method formulates the NLOS detection process as one of diffractive wave propagation and provides a direct solution with low time complexity. Its recent extension with SPAD arrays reconstructs live low-latency videos of NLOS scenes37. The signal-object collaborative regularization34 (SOCR) method considers priors on both the reconstructed target and the measured signal, which leads to high-quality reconstruction with little background noise. For scenarios with non-planar relay surfaces, the F-K and back-projection type methods can be used directly. Algorithms designed only for planar relay settings can be applied using the signal shifting techniques8,14.

Despite these breakthroughs, two major obstacles of existing methods toward practical applications are the need for a large relay surface and dense measurement. If the relay surface is irregular or small, these algorithms may fail due to the lack of data. Besides, dense measurement results in a long acquisition time, which poses a significant challenge for applications such as auto-driving where the observer may move at high speed. In recent works, it was reported that sparse measurements could be used to reconstruct the hidden scenes. Isogawa et al. showed that the target could be reconstructed with confocal and circular NLOS scans38. Sparse measurements from square grids scanning on the relay surface could also be used by incorporating the compressed sensing technique35. Besides, a single shot can be used to track a moving hidden target17, although the reconstruction fails when the target is still due to the ill-posedness of the inverse problem.

In this work, we propose a Bayesian framework for NLOS reconstruction which is applicable for any spatial pattern of the illumination and detection points. By introducing the virtual confocal signal at rectangular grid points, we design joint regularizations for the measured signal, virtual confocal signal and the hidden target. We put forward a confocal complemented signal-object collaborative regularization (CC-SOCR) framework, which reconstructs both the albedo and surface normal of the hidden target. The proposed method allows regular and irregular measurement patterns in both confocal and non-confocal scenarios. Besides, our approach provides faithful reconstructions with negligible background noise, even in cases with very coarse and noisy measurements. Notably, the proposed method suggests a paradigm shift, liberating the research of NLOS imaging from relying heavily on the assumption of a large-size relay surface with the entire region (wall, ground). Our method demonstrates high-quality NLOS reconstructions in various scenarios with the relay surfaces having discrete scattering regions, arbitrary irregular shape, or very limited size, enabling the hidden object reconstruction with far more types of realistic relay surfaces such as window shutter, window frame, and fence, which significantly broadens the scope of NLOS imaging applications. As shown in Fig. 1, the illumination and detection patterns are irregular but manifest in ubiquitous scenes of daily lives. Reconstruction results of the bunny with synthetic confocal signals39, detected at the entire relay surface and these four scenarios, are provided in Supplementary Figs. 15.

Fig. 1: Irregular illumination and detection patterns for NLOS imaging.
figure 1

a The relay is a fence. b The relay is a horizontal shutter. c The relay is an array of window edges. d The relay is a set of several sticks sparsely and randomly distributed.

Results

The NLOS physical model

The goal of NLOS imaging is to take a collection of measured transient data and find the target that comes closest to fitting these measured signals. In this work, we adopt the physical model proposed in SOCR34. Let \({x}_{i}^{{\prime} }\) and \({x}_{d}^{{\prime} }\) be the illumination and detection points on the visible surface, and we call \(({x}_{i}^{{\prime} },{x}_{d}^{{\prime} })\) an active measurement pair, or simply a pair in the following. The photon intensity measured at time t is given by

$$\tau ({x}_{i}^{{\prime} },{x}_{d}^{{\prime} },t)={\int }_{\varOmega }\frac{({x}_{d}^{{\prime} }-x)\cdot {{{{{\bf{n}}}}}}(x)}{{|{x}_{i}^{{\prime} }-x|}^{2}{|{x}_{d}^{{\prime} }-x|}^{3}}f(x)\delta (|{x}_{i}^{{\prime} }-x |+|{x}_{d}^{{\prime} }-x|-ct)dx$$
(1)

in which Ω is the three-dimensional reconstruction domain, f(x) denotes the albedo value of the point x, n(x) is the unit surface normal at x that points towards the visible surface. The unit vector n(x) can be arbitrarily chosen for points with zero albedo value. By denoting \({{{{{\bf{u}}}}}}=f{{{{{\bf{n}}}}}}\), Eq. (1) is written equivalently as

$$\tau ({x}_{i}^{{\prime} },{x}_{d}^{{\prime} },t)={\int }_{\varOmega }\frac{({x}_{d}^{{\prime} }-x)\cdot {{{{{\bf{u}}}}}}(x)}{{|{x}_{i}^{{\prime} }-x|}^{2}{|{x}_{d}^{{\prime} }-x|}^{3}}\delta (|{x}_{i}^{{\prime} }-x |+|{x}_{d}^{{\prime} }-x|-ct)dx$$
(2)

Noting that the intensity is linear with u, the physical model can be written as \({{{{{\boldsymbol{\tau }}}}}}=A{{{{{\bf{u}}}}}}\) in the discrete form. The albedo and surface normal can be obtained directly from u. Indeed, the albedo of a voxel x is given by the norm of the vector u(x). The surface normal of a voxel x is obtained by normalizing the vector u(x). The surface normal is not defined where the albedo is zero.

The measured signal

To reconstruct the hidden target, we consider a collection of M measurements. Let \({p}_{m}=({x}_{m}^{p},{y}_{m}^{p},{z}_{m}^{p})\) be the coordinates of the mth illumination point, in which \({x}_{m}^{p}\), \({y}_{m}^{p}\) and \({z}_{m}^{p}\) are the coordinates in the horizontal, vertical and depth directions. We denote by \({q}_{m}=({x}_{m}^{q},{y}_{m}^{q},{z}_{m}^{q})\) the coordinates of the mth detection point, and call \(({p}_{m},{q}_{m})\) a measurement pair. For each pair, the photon counts of the first T time bins are collected. The coordinates of all measurement pairs are written as \({C}_{meas}=\{({x}_{m}^{p},{y}_{m}^{p},{z}_{m}^{p},{x}_{m}^{q},{y}_{m}^{q},{z}_{m}^{q})|m\in [M]\}\), in which we denote by [M] the set \(\{1,2,\ldots,M\}\). Let \(\tilde{{{{{{\bf{b}}}}}}}\) be the noisy signal measured at Cmeas. In practice, various types of noise inevitably corrupt the measured signals and significantly degrade the quality of the reconstruction. To mitigate the effects of noise and improve the reconstruction quality, we introduce the estimated signal b as an approximation of the ideal signal considered at the measured locations. The variable b is treated as a random vector so that it can be determined under the Bayesian framework. Besides, we denote the simulated signal considered at the set Cmeas as \({A}_{{{{{{\bf{b}}}}}}}{{{{{\bf{u}}}}}}\), in which Ab is the discrete physical model defined in Eq. (2).

The virtual confocal signal

We discretize the reconstruction domain Ω with \(V= \{({x}_{i},{y}_{j},{z}_{k}) |i\in [I],j\in [J],k\in [K]\}\), in which xi, yj and zk are coordinates of the voxel in the horizontal, vertical and depth directions, respectively. When the number of measurement pairs is small, the solution to the least-squares reconstruction problem may not be unique due to the lack of data. To overcome the rank deficiency of the measurement matrix, we introduce the virtual confocal signal d considered at the regular focal points \(({x}_{i},{y}_{j},0)\), in which \(i\in [I]\) and \(j\in [J]\). The set of measurement pairs of the virtual confocal signal is denoted as \({C}_{virt}=\{({x}_{i},{y}_{j},0,{x}_{i},{y}_{j},0) |i\in [I],j\in [J]\}\). The simulated signal generated with Eq. (2) at the set Cvirt is denoted by \({A}_{{{{{{\bf{d}}}}}}}{{{{{\bf{u}}}}}}\). The variable d is treated as an optimization variable and obtained together with the reconstruction by solving the optimization problem introduced in the next subsection. Let \({C}_{common}={C}_{meas}\cap {C}_{virt}\), we denote by \({R}_{{{{{{\bf{b}}}}}}}({{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}})\) the subset of b which is spatially located at the set \({C}_{common}\). We also write \({R}_{{{{{{\bf{d}}}}}}}({{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}})\) the subset of the signal d which is considered at the set \({C}_{common}\). When \({C}_{common}\) is empty, both \({R}_{{{{{{\bf{b}}}}}}}({{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}})\) and \({R}_{{{{{{\bf{d}}}}}}}({{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}})\) are empty datasets.

The Bayesian framework

We treat the reconstructed target u, the measured signal \(\tilde{{{{{{\bf{b}}}}}}}\), the approximated signal b, and the virtual confocal signal d as random vectors and formulate the imaging task as an optimization problem using Bayesian inference. The target and signals are obtained simultaneously by maximizing the joint posterior probability.

$$({{{{{{\bf{u}}}}}}}^{\ast },\,{{{{{{\bf{b}}}}}}}^{\ast },\,{{{{{{\bf{d}}}}}}}^{\ast })=\mathop{{{\arg }}\,\max }\limits_{{{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}}}{\mathbb{P}}({{{{{\bf{u}}}}}},\,{{{{{\bf{b}}}}}},\,{{{{{\bf{d}}}}}}|\tilde{{{{{{\bf{b}}}}}}})$$
(3)

Three assumptions are made to formulate this as a concrete optimization problem. Firstly, the conditional distribution of the measured signal \(\tilde{{{{{{\bf{b}}}}}}}\) given the joint probability distribution of u, b and d is

$${\mathbb{P}}(\tilde{{{{{{\bf{b}}}}}}}|{{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}})={\mathbb{P}}(\tilde{{{{{{\bf{b}}}}}}}|{{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}})=\exp (-{|{{{{{\bf{b}}}}}}-\tilde{{{{{{\bf{b}}}}}}}|}^{2}-\varUpsilon ({{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}},\tilde{{{{{{\bf{b}}}}}}}))$$
(4)

in which ϒ is related to the joint prior distribution of u, b and \(\tilde{{{{{{\bf{b}}}}}}}\). With this assumption, d does not provide additional information to predict \(\tilde{{{{{{\bf{b}}}}}}}\) when b is known. Secondly, the joint prior distribution of u and b is

$${\mathbb{P}}({{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}})=\exp (-{|{A}_{{{{{{\bf{b}}}}}}}{{{{{\bf{u}}}}}}-{{{{{\bf{b}}}}}}|}^{2}-\varGamma ({{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}}))$$
(5)

in which \(\varGamma\) describes the prior distribution of u and b. The estimated signal b is less noisy than the measured data and is closer to the ideal signal of certain real-world targets, which helps to enhance the reconstruction quality. Thirdly, the conditional distribution of d given u and b is

$${\mathbb{P}}({{{{{\bf{d}}}}}}|{{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}})=\exp (-{|{R}_{{{{{{\bf{b}}}}}}}({{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}})-{R}_{{{{{{\bf{d}}}}}}}({{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}})|}^{2}-{|{A}_{{{{{{\bf{d}}}}}}}{{{{{\bf{u}}}}}}-{{{{{\bf{d}}}}}}|}^{2}-\varXi ({{{{{\bf{u}}}}}},{{{{{\bf{d}}}}}}))$$
(6)

in which Rb(b,d) and Rd(b,d) are the subsets of the signals b and d that share the same measurement pairs. \(\varXi ({{{{{\bf{u}}}}}},{{{{{\bf{d}}}}}})\) is related to the joint prior distribution of the target u and the virtual confocal signal d.

With these assumptions, we derive a concrete optimization problem using the Bayesian formula.

$$({{{{{{\bf{u}}}}}}}^{\ast },{{{{{{\bf{b}}}}}}}^{\ast },{{{{{{\bf{d}}}}}}}^{\ast }) =\mathop{{{\arg }}\,\max }\limits_{{{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}}}\,{\mathbb{P}}({{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}}|\tilde{{{{{{\bf{b}}}}}}})\\= \mathop{{{\arg }}\,\max }\limits_{{{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}}}\,{\mathbb{P}}(\tilde{{{{{{\bf{b}}}}}}}|{{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}}){\mathbb{P}}({{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}})\\= \mathop{{{\arg }}\,\max }\limits_{{{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}}}\,{\mathbb{P}}(\tilde{{{{{{\bf{b}}}}}}}|{{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}}){\mathbb{P}}({{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}})\\= \mathop{{{\arg }}\,\max }\limits_{{{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}}}\,{\mathbb{P}}(\tilde{{{{{{\bf{b}}}}}}}|{{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}}){\mathbb{P}}({{{{{\bf{d}}}}}}|{{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}}){\mathbb{P}}({{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}})\\= \mathop{{{\arg }}\,\min }\limits_{{{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}}}{|{{{{{\bf{b}}}}}}-\tilde{{{{{{\bf{b}}}}}}}|}^{2}+{|{R}_{{{{{{\bf{b}}}}}}}({{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}})-{R}_{{{{{{\bf{d}}}}}}}({{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}})|}^{2}+{|{A}_{{{{{{\bf{d}}}}}}}{{{{{\bf{u}}}}}}-{{{{{\bf{d}}}}}}|}^{2}\,\\ +{|{A}_{{{{{{\bf{b}}}}}}}{{{{{\bf{u}}}}}}-{{{{{\bf{b}}}}}}|}^{2}+\varUpsilon ({{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}},\tilde{{{{{{\bf{b}}}}}}})+\varXi ({{{{{\bf{u}}}}}},{{{{{\bf{d}}}}}})+\varGamma ({{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}})$$
(7)

in which the third equality follows from Eq. (4) and the last equality holds with Eqs. (4), (5) and (6). By designing appropriate regularization terms Y, \(\varXi\) and \(\varGamma\), we obtain high-quality reconstructions of the targets even in scenarios with highly incomplete measurements. The proposed framework and collaborative regularizations designed are illustrated in Fig. 2a. Concrete expressions of the regularizations are provided in the Methods section. We term the proposed method the confocal complemented signal-object collaborative regularization (CC-SOCR) due to the virtual confocal signal d introduced and the regularizations imposed on the signals and the target.

Fig. 2: The proposed CC-SOCR method.
figure 2

a The CC-SOCR framework. For high quality reconstructions, the measured signal, estimated signal and the virtual confocal signal are treated as random variables and solved simultaneously using the Bayesian inference method. b The measured signal, the estimated signal, the virtual confocal signal and the reconstructed target are shown from left to right. The confocal measured data for the instance of the statue is provided in the Stanford dataset8. The relay region consists of four letters N, L, O, and S.

In the following, we compare the reconstruction results of the proposed method with the Laplacian of Gaussian filtered back-projection33 (LOG-BP), F-K, LCT, PF and SOCR methods. For the LCT method, we adopt the D-LCT31 extension that reconstructs both the albedo and surface normal. For the PF method, we adopt the implementation with the back-projection (PF-BP) algorithm9 and the Rayleigh Sommerfeld Diffraction (PF-RSD) algorithm32. Performance comparisons of all these methods are shown in Table 1. To bring existing methods into comparisons in scenarios with incomplete measurements, we interpolate the signal with the nearest neighbor method8,35, which generates better results than zero padding32 (See Supplementary Fig. 24).

Table 1 Comparisons of eight NLOS reconstruction algorithms

Results on synthetic data

Instead of using an entire planar visible surface, we assume the relay to be a square box that simulates the scenario of the four edges of a window. The hidden object is a regular quadrangular pyramid, whose base length and height are 1 m and 0.2 m respectively. The central axis of the pyramid is perpendicular to the plane in which the relay square box lies, and the distance of the pyramid to this plane is 0.5 m. The albedo of the pyramid is assumed to be a constant. As shown in Fig. 3a, we simulate the signal measured at 36 points with Eq. (1). The points are exhaustively scanned, where only one point is illuminated each time, and signals are detected at all points. The dataset contains signals measured at 36 confocal and 1260 non-confocal pairs. The time resolution is set to 32 ps. Note that the LCT, D-LCT, F-K, PF-RSD and SOCR methods do not work directly in this scenario. We compare the reconstruction result of the proposed method with LOG-BP. The maximum intensity projections are shown in Fig. 3c and Fig. 3d. The reconstructed albedo is normalized to the range [0,1]. Albedo values that are less than 0.25 are thresholded to zero. The LOG-BP method fails to locate the target correctly and contains misleading artifacts near the boundary of the reconstruction domain. The proposed method locates the target correctly and does not contain noise in the background. The maximum depth error of the CC-SOCR reconstruction is 0.02 m, which is much smaller than the LOG-BP reconstruction (0.12 m). The absolute depth errors are shown in Fig. 3f. Classification error, defined as the percentage of excessive and missing voxels of the reconstruction, is used to assess the methods of locating the target. The classification error of the CC-SOCR reconstruction is 2.86%, which is nearly one order of magnitude smaller than that of the LOG-BP reconstruction (21.75%).

Fig. 3: Reconstruction results of the pyramid (non-confocal, synthetic signal).
figure 3

a The illumination and detection points are shown in yellow. b Ground truth. The albedo as well as the depth, horizontal and vertical components of the directional albedo are shown from left to right. c The reconstructed albedo of the LOG-BP algorithm. d Reconstructed albedo and surface normal of the proposed CC-SOCR method. The albedo as well as the depth, horizontal and vertical components of the directional albedo are shown from left to right. e The depth of the LOG-BP, CC-SOCR reconstructions and the ground truth are shown from left to right. f The absolute depth error of the LOG-BP and CC-SOCR reconstructions are shown from left to right. The background is shown in black. Excessive voxels reconstructed are shown in white.

Results on measured data

For confocal experiments, we use the instance of the statue in the Stanford dataset8 to test the performance of the proposed method. The target is 1 m away from the visible planar surface. In the original dataset, 512 × 512 focal points are raster-scanned in a square region of size 2 × 2 m2. The time resolution is 32 ps and the total exposure time is 60 min. An evenly distributed 64 × 64 dataset is sub-sampled from the original dataset, and it would take an exposure time of 56.25 s to measure this sub-sampled signal. The oracle shown in Fig. 4a and Fig. 5a are generated with the SOCR method using this sub-sampled signal. To simulate the case where the relay surface is a horizontal shutter, we only extract the signals measured at 21 rows from the downsampled data, as shown in the yellow region of Fig. 4b. From bottom to top, the five equispaced regions contain 3, 5, 5, 5 and 3 rows of measurements, respectively. The dataset contains signals measured at 1344 focal points, which would take 18.46 s for data acquisition. Reconstruction results are shown in Fig. 4. The LOG-BP reconstruction is noisy. The reconstruction results of F-K, D-LCT and SOCR algorithms are blurry and contain artifacts. The proposed method reconstructs the target faithfully.

Fig. 4: Reconstructions of the statue with the relay surface in the shape of a horizontal shutter (confocal, measured signal).
figure 4

a The oracle is generated with the SOCR method with 64 × 64 measurements34. b Confocal signals are measured in the yellow region. c Reconstructed albedo of the LOG-BP algorithm. d Reconstructed albedo of the F-K algorithm. e Reconstructed albedo and surface normal of the D-LCT method. The albedo as well as the depth, horizontal and vertical components of the directional albedo are shown from left to right. f Reconstructed albedo and surface normal of the SOCR method. The albedo as well as the depth, horizontal and vertical components of the directional albedo are shown from left to right. g Reconstructed albedo and surface normal of the CC-SOCR method. The albedo as well as the depth, horizontal and vertical components of the directional albedo are shown from left to right.

Fig. 5: Reconstructions of the statue with 10 × 10 confocal measurements (confocal, measured signal).
figure 5

a The oracle is generated with the SOCR method with 64 × 64 measurements34. b Confocal signals are measured at the yellow points. c Reconstructed albedo of the LOG-BP algorithm. d Reconstructed albedo of the F-K algorithm. e Reconstructed albedo and surface normal of the D-LCT method. The albedo as well as the depth, horizontal and vertical components of the directional albedo are shown from left to right. f Reconstructed albedo and surface normal of the SOCR method. The albedo as well as the depth, horizontal and vertical components of the directional albedo are shown from left to right. g Reconstructed albedo and surface normal of the CC-SOCR method. The albedo as well as the depth, horizontal and vertical components of the directional albedo are shown from left to right.

Figure 5 shows the reconstruction results of the statue with signals detected at 10 × 10 uniformly distributed focal points in a square region of size 2 × 2 m2, which would take 1.37 s for the measurements. The points scanned are shown in Fig. 5b. The LOG-BP reconstruction contains heavy background noise and the target cannot be clearly identified. The F-K and D-LCT reconstructions are blurry and also contain background noise. The SOCR reconstruction contains artifacts, indicating that the error of the signal introduced in the nearest neighbor interpolation process cannot be neglected. In contrast, the proposed method locates the target correctly and reconstructs more details than other methods. More reconstruction results with different numbers of uniformly distributed confocal measurements are compared in Supplementary Figs. 610.

Figure 6 shows the reconstruction results of the statue obtained with signals measured at different regions of the relay surface: a set of 200 randomly distributed focal points in an area of size 2 × 2 m2; a region consisting of 5 equispaced vertical bars with 1344 focal points; a region that consists of four letters N, L, O and S with 825 focal points; a region made up of several sticks sparsely and randomly distributed with 1229 focal points; and a heart-shaped region with 258 focal points. These results indicate the capability of the proposed method in reconstructing the hidden target under various relay settings. For the case of the heart-shaped relay, the CC-SOCR method locates the target correctly, while all other methods fail. The measured signal, approximated signal and virtual confocal signal of the scenario with measurements at the four letters N, L, O and S are shown in Fig. 2b. The virtual confocal signal plays an important role for high-quality reconstruction. The three views and surface normal of the reconstructions as well as more comparisons under different relay settings are provided in Supplementary Figs. 1117.

Fig. 6: Reconstructions of the statue under representative cases with different relays (confocal, measured signal).
figure 6

The illumination regions are shown in yellow in the first column. The reconstructed albedo of F-K, D-LCT, SOCR and CC-SOCR methods are compared in the second to fifth columns.

For non-confocal experiments, we use the measured data of the instance of the Figure 4 provided by the phasor field method32. The hidden object is 1 m away from the visible wall. The temporal resolution is 16 ps. We pick out the signal measured at 64 × 64 illumination points in a square region of size 1.27 × 1.27 m2. The detection point is 0.64 m to the left and 0.55 m to the bottom of the illumination region. Except for the signal selected, we also use four subsets of the signal to reconstruct the target: signals measured at five equispaced vertical bars that contain 3, 5, 5, 5, and 3 columns of focal points from left to right; signals measured at five equispaced horizontal bars that contain 3, 5, 5, 5, and 3 rows of focal points from bottom to top; signals measured at 14 × 14 uniformly distributed focal points in an area of 1.27 × 1.27 m2; signals measured at 200 randomly chosen focal points. To bring the PF-RSD and SOCR methods into comparison, the nearest neighbor interpolation technique is applied to extend the signal to 64 × 64 illuminations. As shown in Fig. 7, the LOG-BP and PF-BP reconstructions are noisy and contain artifacts. The PF-RSD reconstructions also contain artifacts. Both SOCR and CC-SOCR methods reconstruct the target successfully. However, the SOCR reconstructions contain artifacts (the third row) or lose some details (the fourth and fifth rows). These results also indicate that the bias of the signal obtained from the nearest neighbor interpolation leads to non-negligible reconstruction error. The proposed CC-SOCR method provides faithful reconstructions in all cases. The three views and surface normal of the reconstructions are provided in Supplementary Figs. 1822.

Fig. 7: Reconstruction results of the instance of the figure 4 (non-confocal, measured signal).
figure 7

The illumination regions are shown in yellow in the first column. Reconstructed albedo of the LOG-BP, PF-BP, PF-RSD, SOCR and CC-SOCR methods are shown in the second to sixth columns.

For scenarios with non-planar relay surfaces, we use the measured data in the Stanford dataset8 to test the proposed method. The original dataset contains confocal signals measured at 128 × 128 focal points and is sub-sampled to 64 × 64. The NLOS scene contains two retroreflective letters, which leads to a bias with the physical model used. We extract subsets of the sub-sampled dataset to construct confocal and non-planar signals with irregular measurement patterns, as shown in opaque in the first column of Fig. 8. The proposed CC-SOCR method directly works under these settings and the results are shown in the last column. The LOG-BP method also works directly under these settings, but the reconstructions are of low quality and contain heavy background noise (See Supplementary Fig. 28). To bring the F-K, D-LCT and SOCR methods into comparisons, we shift the signal in the temporal dimension with the technique provided by the code of the F-K method. The shifted signals are then interpolated to 64 × 64 in spatial dimensions using the nearest neighbor method and serve as inputs of conventional imaging methods. As is shown in the last row of Fig. 8, the proposed method locates the targets correctly with the oval-shaped non-planar illumination region, while all other methods fail.

Fig. 8: Reconstructions of the letters N and T with irregular and non-planar relay settings (confocal, measured signal).
figure 8

The illumination regions are shown in opaque in the first column. The F-K, D-LCT, SOCR and CC-SOCR reconstructions are shown in the second to the fifth columns, respectively.

Discussion

We have proposed a framework for the general setting of NLOS imaging. In this section, we discuss its relationship with the original SOCR method, the complexity of the algorithm and possible directions for further improvements.

The SOCR method reconstructs the albedo and surface normal of the hidden targets under both confocal and non-confocal settings. However, the experimental setup is still quite limited. As demonstrated in the original paper34, it only deals with signals measured at regular grid points. This is due to the spatial correlation of the signals in the regularization term.

The proposed CC-SOCR method generalizes the SOCR method to the most general setup, where no limitations of the measurement pairs are required. The CC-SOCR differs from SOCR in three aspects. Firstly, the introduced virtual confocal signal overcomes the rank deficiency of the measurement matrix, making it capable of reconstructing the targets under more general settings. Secondly, CC-SCOR does not include spatial correlations of the measured signal in the regularization term. As discussed in the Methods section, in CC-SOCR, the Wiener filter is applied only to the temporal dimension of the measured signal. Thirdly, the priors imposed on the target are related not only to the measured data but also to the introduced virtual confocal signal. Concrete expressions of these regularization terms are provided in the Methods section.

The proposed optimization problem can be solved efficiently using the alternative iteration method. In Supplementary Note 2, we decompose the problem into several sub-problems and discuss in detail the solutions to each sub-problem. We also provide a guide for choosing parameters in Supplementary Note 3. Convergences of all sub-problems are guaranteed, as discussed in the work of the SOCR method34 and Supplementary Note 2. However, global convergence is not guaranteed because the sub-problem of updating the reconstructed target is solved approximately. Nonetheless, extensive results in Supplementary Note 1 have demonstrated the capability of the proposed method in providing high-quality reconstructions in various scenarios.

When the reconstruction domain is discretized with N×N×N voxels and the signal is detected at M measurement pairs, the memory complexity of the CC-SOCR algorithm is \({{{{{\rm{O}}}}}}(\max \{{N}^{3},MN\})\). The time complexity per iteration is \({{{{{\rm{O}}}}}}(\max \{{N}^{5},M{N}^{3}\})\), which is the same as the overall computation complexity. In Supplementary Note 4, we provide a detailed discussion of the complexity and report the running time for the instance of the statue with 200 randomly distributed confocal measurements. For the special case of N×N confocal measurements, the time and memory complexity is \({{{{{\rm{O}}}}}}({N}^{5})\) and \({{{{{\rm{O}}}}}}({N}^{3})\), which is the same as the SOCR algorithm. To reduce the computational complexity, the virtual confocal signal at coarser grids can be used. The time complexity reduces to \({{{{{\rm{O}}}}}}({N}^{4})\) in scenarios with \({{{{{\rm{O}}}}}}(N)\) measurement pairs if the virtual confocal signals are considered at \(\sqrt{N}\times \sqrt{N}\) focal points. In Supplementary Figure 23, we compare the reconstruction results of the statue with virtual confocal signals of sizes 64 × 64, 32 × 32, 16 × 16 and 8 × 8 in an area of 2 × 2 m2, respectively. The execution time is provided in Supplementary Tables 36. Besides, the CC-SOCR algorithm can be implemented using the embarrassingly parallel paradigm. The imaging process can be accelerated with GPU implementations of the code on large-scale parallel computing platforms. In the future, we would like to implement the octree representation of the reconstruction domain to reduce the complexity of the proposed method.

In CC-SOCR, virtual confocal signals observed at planar rectangular grid points are used to complement the reconstruction process in the case of incomplete measurements. It is also possible to consider virtual non-confocal signals for stronger regularizations. Besides, virtual confocal signals at several planes may be introduced to make use of the spatial correlation. However, the time and memory complexity will also increase.

With sufficient measurements, both the SOCR method and the CC-SOCR method provide high-quality reconstructions (See Supplementary Figs. 1, 11, 18). However, when the number of measurement pairs is small, the reconstruction problem is ill-posed. Although the complete signal can be obtained with interpolation techniques, existing methods still fail due to the bias introduced in the signal (Supplementary Fig. 27). The introduced virtual confocal signal benefits from the regularization guided by the simulated signal of the target and leads to faithful reconstructions. In the absence of the virtual confocal signal, the reconstructions may be blurry (Supplementary Fig. 25) or contain artifacts in the background (Supplementary Fig. 26). Besides, the CC-SOCR algorithm provides a robust way to convert measured non-confocal NLOS signals to their confocal counterparts. The generated confocal signal of the instance of figure 4 is provided in the supplementary code.

Methods

The joint regularizations

In Eq. (7), we formulate the CC-SOCR framework as an optimization problem. Here we show how the regularization terms \(\varGamma ({{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}})\), \(\Xi ({{{{{\bf{u}}}}}},{{{{{\bf{d}}}}}})\) and \(\varUpsilon ({{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}},\tilde{{{{{{\bf{b}}}}}}})\) are designed. To better grasp the idea of these regularization terms, we suggest a basic understanding of the data driven tight frame algorithm40, the block matching and 3D filtering (BM3D) algorithm41 and the SOCR method34.

\(\varGamma ({{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}})\) describes the prior distribution of the reconstructed target and the approximated signal of the measurement pairs. For the reconstructed target, we consider the sparsity and non-local self-similarity priors and use the zero norm to impose sparseness on the approximated signal b. We set

$$\varGamma ({{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}})={s}_{u}{|{{{{{\bf{L}}}}}}|}_{1}+{\lambda }_{u}\mathop{\sum}\limits_{i}[{|{B}_{i}({{{{{\bf{L}}}}}})-{D}_{s}{C}_{i}{D}_{n}^{T}|}^{2}+{\lambda }_{pu}{|{C}_{i}|}_{0}]+{s}_{b}{|{{{{{\bf{b}}}}}}|}_{0}$$
(8)

in which su, \({\lambda }_{u}\), \({\lambda }_{pu}\) and sb are fixed parameters. L is the albedo of u, \({B}_{i}\) is the block matching operator, with i the index of a reference block. The summation is made over all possible blocks. Ds and Dn are two orthogonal matrices that capture the local structure and non-local correlations of the 3D albedo blocks. Ci is the matrix consisting of transform coefficients of the ith block. \(|\cdot {|}_{0}\) denotes the number of nonzero values of a tensor.

For the term \(\varUpsilon ({{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}},\tilde{{{{{{\bf{b}}}}}}})\), we set

$$\varUpsilon ({{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}},\tilde{{{{{{\bf{b}}}}}}})= \mathop{\sum}\limits_{i}{|{P}_{i}(\tilde{{{{{{\bf{b}}}}}}})-D{S}_{i}|}^{2}+\mathop{\sum}\limits_{i,j}{\left(\frac{{\sigma }_{{{{{{\bf{b}}}}}}}}{{d}_{j}^{T}{P}_{i}({A}_{{{{{{\bf{b}}}}}}}{{{{{\bf{u}}}}}})}{S}_{i}(j)\right)}^{2}\\ +{\lambda }_{sb}\mathop{\sum}\limits_{i}{|{P}_{i}({{{{{\bf{b}}}}}})-D{S}_{i}|}^{2}$$
(9)

in which \({\lambda }_{sb}\) is a fixed parameter, Pi is the patch extracting operator, with i the index of a local patch. Noting that the signals may not be measured at regular grid points, the patch extracting operator Pi only applies to the temporal direction of the signals. \(\tilde{{{{{{\bf{b}}}}}}}\) is the measured signal. D is the matrix of discrete cosine transform. The jth filter of D is denoted by dj. Ab is the measurement matrix. Si is the vector that consists of Wiener coefficients of the ith patch, with its jth element denoted by \({S}_{i}(j)\). \({\sigma }_{{{{{{\bf{b}}}}}}}\) is the noise level. The summations are made over all possible patches and filters of the discrete cosine matrix.

For the regularization term \(\varXi ({{{{{\bf{u}}}}}},{{{{{\bf{d}}}}}})\), the prior of the virtual confocal signal d is constructed under the guidance of the target u and the physical model Ad. Noting that the confocal signal d is considered at rectangular grid points, both the spatial and temporal correlations can be used. Let Pi be the 3D patch extracting operator (2D in space and 1D in time), we seek a data-driven orthogonal dictionary \(\varPsi\) that sparsely represents the local patches of both the approximated signal d and the simulated signal \({A}_{{{{{{\bf{d}}}}}}}{{{{{\bf{u}}}}}}\). For simplicity, we abuse the notation Pi to represent either a 1D patch of the measured signal b in the temporal direction or a 3D patch of the virtual confocal signal. The meaning can be made clear from the variable to which it applies. Let \({Q}_{i}\) be the matrix of transform coefficients of the ith patch, the regularization term is given by

$$\Xi ({{{{{\bf{u}}}}}},{{{{{\bf{d}}}}}})=\mathop{\sum}\limits_{i}[{|{Q}_{i}-{\varPsi }^{T}{P}_{i}({{{{{\bf{d}}}}}})|}^{2}+{\lambda }_{sd}{|{Q}_{i}-{\varPsi }^{T}{P}_{i}({A}_{{{{{{\bf{d}}}}}}}{{{{{\bf{u}}}}}})|}^{2}+{\lambda }_{fd}{|{Q}_{i}|}_{0}]+{s}_{d}|{{{{{\bf{d}}}}}}{|}_{0}$$
(10)

in which \({\lambda }_{sd}\) and \({\lambda }_{fd}\) are two fixed parameters that control the weight of the simulated signal and the sparsity of the representation, respectively. sd is the parameter that controls the sparsity of the virtual confocal signal d.

The CC-SOCR optimization problem

By substituting Eqs. (8), (9), (10) into Eq. (7) and introducing weights, we obtain the concrete optimization problem of the proposed CC-SOCR framework as follows.

$$\mathop{\min }\limits_{{{{{{\bf{u}}}}}},{{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}},{D}_{s},{D}_{n},{{{{{\bf{C}}}}}},{{{{{\bf{S}}}}}},\varPsi,{{{{{\bf{Q}}}}}}} \, {|{A}_{{{{{{\bf{b}}}}}}}{{{{{\bf{u}}}}}}-{{{{{\bf{b}}}}}}|}^{2}+{s}_{u}{|{{{{{\bf{L}}}}}}|}_{1}+{s}_{b}{|{{{{{\bf{b}}}}}}|}_{0}\\ +{\lambda }_{u}\mathop{\sum }\limits_{i}[{|{B}_{i}({{{{{\bf{L}}}}}})-{D}_{s}{C}_{i}{D}_{n}^{T}|}^{2}+{\lambda }_{pu}{|{C}_{i}|}_{0}]\\ +{\lambda }_{b}{|{{{{{\bf{b}}}}}}-\tilde{{{{{{\bf{b}}}}}}}|}^{2}+{\lambda }_{b}{\lambda }_{pb}\mathop{\sum }\limits_{i}{|{P}_{i}(\tilde{{{{{{\bf{b}}}}}}})-D{S}_{i}|}^{2}\\ +{\lambda }_{b}{\lambda }_{pb}\mathop{\sum }\limits_{i,j}{\left[\frac{{\sigma }_{{{{{{\bf{b}}}}}}}}{{d}_{j}^{T}{P}_{i}({A}_{{{{{{\bf{b}}}}}}}{{{{{\bf{u}}}}}})}{S}_{i}(j)\right]}^{2}\\ +{\lambda }_{b}{\lambda }_{pb}{\lambda }_{sb}\mathop{\sum }\limits_{i}{|{P}_{i}({{{{{\bf{b}}}}}})-D{S}_{i}|}^{2}\\ +{\lambda }_{d}{|{A}_{{{{{{\bf{d}}}}}}}{{{{{\bf{u}}}}}}-{{{{{\bf{d}}}}}}|}^{2}+{s}_{d}{|{{{{{\bf{d}}}}}}|}_{0}\\ +{\lambda }_{d}{\lambda }_{pd}\mathop{\sum }\limits_{i}{|{Q}_{i}-{\varPsi }^{T}{P}_{i}({{{{{\bf{d}}}}}})|}^{2}\\ +{\lambda }_{d}{\lambda }_{pd}{\lambda }_{sd}\mathop{\sum }\limits_{i}{|{Q}_{i}-{\varPsi }^{T}{P}_{i}({A}_{{{{{{\bf{d}}}}}}}{{{{{\bf{u}}}}}})|}^{2}\\ +{\lambda }_{d}{\lambda }_{pd}{\lambda }_{fd}\mathop{\sum }\limits_{i}{|{Q}_{i}|}_{0}\\ +{\lambda }_{bd}{|{R}_{{{{{{\bf{b}}}}}}}({{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}})-{R}_{{{{{{\bf{d}}}}}}}({{{{{\bf{b}}}}}},{{{{{\bf{d}}}}}})|}^{2}\\ \,{{{{{\rm{s}}}}}}.{{{{{\rm{t}}}}}}. \, {{{{{\bf{L}}}}}} ={{{{{\rm{albedo}}}}}}({{{{{\bf{u}}}}}}),\\ {D}_{s}^{T}{D}_{s} =I[{p}_{x}{p}_{y}{p}_{z}],\,{D}_{n}^{T}{D}_{n}=I[r],\\ {\varPsi }^{T}\varPsi =I[{q}_{x}{q}_{y}{q}_{t}]$$
(11)

in which C, S and Q represent the collections of the transform-domain coefficients \(\{{C}_{i}\}\), \(\{{S}_{i}\}\) and \(\{{Q}_{i}\}\) respectively. \(I[n]\) represents the identity matrix of order n. \({p}_{x}\), \({p}_{y}\) and \({p}_{z}\) are the patch sizes of the albedo in the horizontal, vertical and depth directions. r is the number of neighboring blocks of each reference albedo block. qx, qy and qt are the patch sizes of the virtual confocal signal d in the horizontal, vertical and temporal directions. \({\sigma }_{{{{{{\bf{b}}}}}}}\) is a parameter related to the noise level of the measured signal. The fixed parameters su, sb, sd, λu, λb, λd, λpu, λpb, λpd, λsb, λsd, λfd, λbd balance the data-fitting terms and the regularization terms. The solution to the optimization problem is provided in Supplementary Note 2, and the supplementary software has been attached to this article.