Robust real-time 3D imaging of moving scenes through atmospheric obscurant using single-photon LiDAR

Recently, time-of-flight LiDAR using the single-photon detection approach has emerged as a potential solution for three-dimensional imaging in challenging measurement scenarios, such as over distances of many kilometres. The high sensitivity and picosecond timing resolution afforded by single-photon detection offers high-resolution depth profiling of remote, complex scenes while maintaining low power optical illumination. These properties are ideal for imaging in highly scattering environments such as through atmospheric obscurants, for example fog and smoke. In this paper we present the reconstruction of depth profiles of moving objects through high levels of obscurant equivalent to five attenuation lengths between transceiver and target at stand-off distances up to 150 m. We used a robust statistically based processing algorithm designed for the real time reconstruction of single-photon data obtained in the presence of atmospheric obscurant, including providing uncertainty estimates in the depth reconstruction. This demonstration of real-time 3D reconstruction of moving scenes points a way forward for high-resolution imaging from mobile platforms in degraded visual environments.

This section introduces the proposed statistical algorithm for the reconstruction of 3D images of moving targets, acquired using single-photon LiDAR through high densities of atmospheric obscurants. The algorithm has three main objectives: (i) robust processing of Lidar data obtained in presence of a high level of noise, (ii) fast processing allowing online processing of successive 3D frames, (iii) deliver uncertainty quantification of the obtained estimates. To reach those goals, the advanced algorithm adopts a statistical Bayesian formulation with carefully chosen likelihood and prior distributions. Due to the proposed formulation, the estimates and their uncertainty measures are obtained using an iterative algorithm with efficient updates allowing online data processing. Results on simulated and real data show more robustness to noise and faster results than state-of-the-art algorithms. The next section describes the problem formulation. The proposed Bayesian model is presented in Section II, followed by the description of its estimation algorithm in Section III. Sections IV, and V finally compare the proposed approach to state-of-the-art algorithms on simulated and real data.

I. PROBLEM FORMULATION
This section introduces the statistical observation models that relate the measured tagged-photons to the parameters of interest Θ = (Θ s , Θ n ), which gathers the target Θ s and noise Θ n parameters.

A. Observation model
The TCSPC system provides time-tagged photon event data that can be accumulated into a timing histogram y n,t , denoting the number of detected photons at pixel location n ∈ {1, · · · , N }, and time-of-flight (ToF) bin t ∈ {1, · · · , T }. These histograms (often called cubes of data) contain depth and reflectivity information about the observed target allowing 3D imaging. This process can be repeated at different time instants to acquire 3D videos, in which case the data is denoted y n,t,k where k ∈ {1, · · · , K} represents the cube's number. Each photon counts can be assumed to be drawn from the Poisson distribution P (.) as follows [1], [2]: y n,t,k ∼ P (s n,t,k ) .
(1) Fig. 1. Example of synthetic data representing a target signal (in red), signal and background (in blue), and an observed histogram (in black) using model (5). For this pixel, the sum of signal counts is 100 counts, and the sum of background counts are 1750 counts leading to an SBR= 0.057.
Assuming the presence of at most one surface per-pixel, the average photon counts can be modelled as a mixture of useful signal, and background counts as follows s n,t,k = r n,k f (t − d n,k ) + b n,t,k where f represents the system impulse response function (IRF) assumed to be known from a calibration step and normalized by its sum (f is obtained using a long integration time, single point measurement on a uniform flat surface in laboratory conditions), r n,k ≥ 0 is the reflectivity of the object surface for the kth data, d n,k ≥ 0 denotes the distance of the object from the sensor for the kth data (related to its depth), b n,t,k ≥ 0 gathers the background and dark counts of the detector (in the following, we will use the "background" term for all events that do not originate from reflections at the target plane). It is assumed that the background b n,t,k depends on t as further explained in next sub-sections.

B. Event based observation model
Model (1) has been used in many studies, and it assumes the availability of histograms of counts. An alternative is to directly model the detected list of photons z n,p,k for the nth pixel and for p ∈ {1, · · · ,ȳ n,k }, using a mixture of densities as in [1], [3] P (z n,p,k |u n,k , d n,k ) = (1−u n,k )g (z n,p,k )+u n,k f (z n,p,k − d n,k ) where g (z n,p,k ) denotes the density of background counts, u n,k represents the probability of the detected photon to belong to a target or background andȳ n,k the total number of photons detected in the nth pixel and kth frame. Interestingly, model (3) can be equivalently represented using the following hierarchical model [4] P (z n,p,k |h n,p,k , d n,k ) = g (z n,p,k ) where h n,p,k ∈ {0, 1} is a latent variable following a Bernoulli distribution of parameter u n,k , and taking 1 if the pth photon is reflected from a target and 0 otherwise. Note that marginalizing h n,p,k in this model leads to the mixture model (3). Model (4) decouples the distributions of the photons associated with the target and background, which is particularly interesting for Bayesian approaches as shown later.

C. Background model
In imaging through obscurants or a scattering medium, the background b n,t,k will present high values and might vary with respect to the depth observation window as already observed in [5], [6]. In [6], Satat et al assumed a Gamma shaped background noise based on the multiple scattering of the photon in the obscuring medium. In this paper, we assume that the observation window, and hence the useful signal, is located in the decreasing tail of the background distribution which can be approximated using a decreasing exponential with a minimum background level as follows b n,t,k = max [a n,k exp(−c n,k t),ẽ n,k ], where a n,k and c n,k respectively represent the amplitude and decreasing rate of the exponential, andẽ n,k is a constant level per-pixel. Note here that the rate "c n,k " is pixelwise dependent to capture the non-homogeneous variations of the imaging environment. This results in the final model given by (see also Fig. 1) s n,t,k = r n,k f (t − d n,k ) + max [a n,k exp(−c n,k t),ẽ n,k ] . (5) Our goal is then to adaptively estimate Θ = (Θ s , Θ n ) that includes the target Θ s = (D, R) and noise Θ n = A, C,Ẽ parameters using the observations y n,t,k , z n,p,k ∀n, t, p, k, while exploiting their statistics in (1) and (4), and the multiscale/multi-temporal information for robustness to noise.

D. Histogram corrections
The measured histograms exhibit consistent non-uniform timing bins due to issues in the timing read-out circuitry. This effect was also observed and discussed further by Henriksson et al. in section 3.2 of [7]. This leads to a clear periodic fluctuations (with a period of 4 bins) in the measured histograms as shown in Fig. 2 (blue curve) representing the sum of all histograms of an oil-based vapour obscurant scene without target. Assuming the fluctuation pattern is fixed between successive frames, this effect is corrected before applying the proposed strategy and the correcting procedure is denoted as "histogram corrections" in the main article. More precisely, using data with only background counts and no target, we estimate 4 weights/proportions for each successive 4 bins to approximate the fluctuations in the timing bin lengths. We then compensate per-pixel histograms by redistributing the photon counts between bins to approximate equal weights or timing bin lengths. An example of the obtained results is shown in Fig. 2 (red curve) which highlights reduced fluctuations after correction.

II. HIERARCHICAL BAYESIAN MODEL
The estimation of Θ is an ill-posed problem due to the high level of noise corrupting the data. In this paper, we adopt a Bayesian approach to alleviate the indeterminacy resulting from ill-posed problems. This approach combines the observation information summarized in the likelihood term, with prior distributions on the unknown parameters to obtain a posterior distribution on the parameters. The latter could be exploited by deriving point estimators and additional measures of uncertainty about the estimates. The following sub-sections introduce the proposed Bayesian model.
A. Likelihood 1) Approximate Likelihood: Assuming independence between the observed photon ToFs leads to the following likelihood for the nth pixel of the kth frame P (z n,k |h n,k , d n,k ) =ȳ whereȳ n represents the number of photons in the nth pixel. It is common to approximate the system IRF with a Gaussian function [8], [9], in which case the likelihood in (6) becomes P (z n,k |h n,k , d n,k ) =ȳ n p=1 g (z n,p,k ) 1−h n,p,k N (z n,p,k ; d n,k , σ 2 ) (7) where N (x; µ, σ 2 ) denotes the Gaussian distribution of x with expectation µ and variance σ 2 . Assuming the availability of an estimatorĥ n,p,k of h n,p,k which allows the separation of signal and background counts leads to where Q ( s z n,k ) is a function of signal counts,d n,k = sn p=1 s z n,p,k sn , s z and b z represent the photons reflected from a target and background counts, respectively, ands n,k andb n,k denote their number in the nth pixel. The resulting likelihood highlights a Gaussian distribution for the depth variable d n,k . This approximate likelihood for d n,k can also be obtained by considering the Poisson model (1) for signal photons (nobackground) leading to P ( s y t,k |r n,k , d n,k ) ∝ G (r n,k ; 1 +s n,k , 1)Q s y t,k where s y t,k represent the histogram of signal counts, ∝ stands for proportional to, and G(x; ., .) denotes the gamma distribution with shape and scale parameters. The formulation (9) is interesting for the estimation of the reflectivity r n,k , ∀n, k.
Note that the simple depth maximum likelihood estimator (MLE) reduces to a log-matched filtering of the histogram with the impulse response of the system when assuming the absence of noise. The MLE approach is often used because of its simplicity, however, it obviously yields poor results from data acquired in the presence of obscurants as it does not consider spatial correlations between pixels, hence the need for an advanced robust method.
2) Multiscale information: It is common to use the multiscale information available within the histograms to improve the performance of maximum likelihood estimation for Lidar data [1], [10]- [12]. The key observation is that spatially downsampled histograms, which are still Poisson distributed, lead to depth and reflectivity estimates with lower noise at a price of a reduced spatial resolution. In this paper, we adopt a similar strategy by considering L downsampled version of the histogram of counts. Considering predefined L graphs of neighbours φ 1,··· ,L , spatially downsampled version of the histograms Y can be obtained for each time instant k leading to Y k , ∈ {1, · · · , L}. Assuming independence between these histograms lead to L likelihood distributions for photon events ∀ ∈ 1, · · · , L, where = 1 is the original cube, and > 1 is a downsampling index, e.g., 3 × 3 for = 2, 5 × 5 for = 3, etc. Similarly, one can express the likelihood of histograms as follows P (y ( )

B. Prior distribution for depth
The depth parameters are assigned priors enforcing spatial smoothness. This is achieved by introducing a latent variable x that is related to all multi-scale depth parameters. More precisely, we consider a mixture of Gaussian conditional prior distributions for x as follows where ν n (resp. ψ k ) denotes the spatial (resp. temporal) neighbourhood of the nth pixel (resp. kth frame), d n ,k denotes the mean, γ n ,n ≥ 0 are weights for the variance, and n > 0 is the variance of x n . Model (12) shows that the latent variable x links to the L depth variables (as indicated in Fig. 3) providing a robust approximation of the true depth map by considering correlations between pixels. This model also shows a temporal correlation between successive latent variables x k,k−1,... through the variable v. It should be noted that (12) does not enforce positivity on the depth parameter x, however, this will be ensured as indicated Section III-B.
DAG for the observations, parameter and hyperparameters of the marginalized model. The user fixed parameters appear in circles, the observations in a dashed box and the rectangular box gathers the joint multiscale depth parameters.

C. Prior distribution for reflectivity
Similarly to depth, spatial smoothness could be enforced to reflectivity by considering latent variables (for example using the gamma Markov random field prior [13]. In this work, the multi-scale estimates r ( ) n,k implicitly contain spatial correlations allowing the definition of a simple prior. More precisely, we assume that r (1) n,k = · · · = r (L) n,k = r n,k as this will lead to fast estimation for reflectivity as detailed later.

D. Hyperparameter priors
While the distribution in (12) introduces spatial correlations between pixels through the latent variable x, it will lead to over smoothed depth edges due to the implicit l 2 -norm in the Gaussian distribution, i.e., it will join distinct depth surfaces. To alleviate this issue and conserve edges, a better prior distribution should consider sparse promoting norms such as l 1 . To obtain such effect, we consider an exponential prior distribution for the hyper-parameters γ ( ) n ,n , η (k ) n ,n as follows where E (.; .) denotes the exponential distribution with the fixed parameters w n ,n = 1 and should be carefully selected. The combination of (13) and (12) will provide the desired edge preserving effect as indicated in the next section.
The parameter n , ∀n should be positive. Assuming prior independence between the parameters n , ∀n and in absence of additional information, the non-informative scale prior is chosen for this parameter as follows where I R + (.) denotes the indicator function on R + .

E. Marginalized posterior distribution
The proposed Bayesian model is summarized in the directed acyclic graph (DAG) displayed in Fig. 3. The joint posterior distribution of this Bayesian model can be computed from the following hierarchical structure (after dropping indices for clarity) Equations (15) and (11) clearly highlight the independence of the depth related parameters D, X, Γ, W and the reflectivity ones R, η, V . Akin to the Bayesian lasso model [14], [15], the parameters Γ, η can be marginalized leading to Laplace distributions instead of the Gaussian distributions as shown in (17). Note that the marginalization reduces the number of parameters to estimate while making the estimation procedure more robust. The marginalized posterior distribution is finally given by This distribution contains complete information regarding the parameters of interest X, D, R, and their uncertainties. A common approach is to extract point estimators such as the minimum mean square estimator (MMSE) or maximum a-posteriori (MAP) estimator. In this paper, we consider the marginalized MAP estimator of all parameters, while uncertainty regarding depth is summarized in the parameter n , ∀n.

III. ESTIMATION ALGORITHM
This section describes a coordinate descent algorithm to approximate the maximum of the marginalized posterior distribution in (16) with respect to the parameters X, D, R, [16], [17]. The iterative algorithm sequentially updates a parameter of interest while holding the other parameters fixed. This is equivalent to sequentially maximizing the conditional distributions associated with each parameter. This process is repeated until the algorithm has converged to a local minimum of the negative log-posterior. The algorithm's main steps are presented in Algo. 1 and described with more details in the following sections. It is worth noting that the resulting algorithm alternates between robust non-linear parameter estimation (line 10) and a filtering step (line 11), which are commonly observed steps in several state-of-the-art algorithms [18]- [20] and optimization algorithms [21]. (26) 6: Compute the weights W , V as in (27) Update x n,k , ∀n, k using WMF in (18) 11: Update d ( ) , ∀ using threshold operator in (19) 12: Update using analytical mode in (21) 13: Set conv= 1 if the convergence criteria are satisfied 14: end while 15: Optional: data super-resolution 16: Compute d HR k , r HR k as in Algo. 2 17: Output: 18: A. Updating X The parameters of X are independent allowing parallel updating of x n,k , ∀n. The conditional distribution of each x n,k , ∀n, k is given by where L(.; ., .) denotes Laplace distribution. Maximizing this distribution is equivalent tô This is the equation of a weighted median filter (WMF) which has several efficient implementations (e,g, [22]). Note that the solution of (18) will be positive provided that x n ,k > 0 and d ( ) N are independent and spatial correlation is introduced through the latent variable x. This is interesting as it allows the parallel implementation of d This is a generalization of the well known soft-threshold operator which can be efficiently solved. Note that the solution of (19) will be positive provided that x n ,k > 0 andd ( ) n > 0 which is ensured during initialization.

C. Updating depth variance:
The conditional distribution of n is an inverse-gamma distribution given by whose mode is given bŷ whereK (resp.N ) is the number of temporal (resp. spatial) neighbours.

D. Updating reflectivity parameters: R
The updates of reflectivity parameters are independent from depth parameters, allowing parallel computations. The maximum likelihood estimator of r n,k is given bŷ

E. Background estimation
Our algorithm assumes known coefficients h n,k , ∀n, k, which can be deduced from known noise parameters as indicated in the input of Algo. 1. Model (5) approximates the noise using three parameters Θ n,k = (ẽ n,k , a n,k , c n,k ) whereẽ n,k represents a constant minimum noise level, a n,k its amplitude and c n,k its slope. These parameters will be approximately estimated as described below, where the focus is to obtain analytical estimators suitable for fast computations.
1) Estimation ofẽ n,k : The maximum likelihood estimate ofẽ n,k in absence of a target (r n,k = 0) is given byẽ n,k = T t=1 y n,t,k T , which is the average of the detected counts of each pixel. However, in presence of a target component and an exponential decay of the background, the model assumes each histogram (i.e., pixel) has a constant level plus a narrow signal peak due to a narrow system impulse response. Under these considerations, the parameter of interest is the background level while the signal peak is seen as an outlier. Therefore, our solution considers the median of the detected counts as an efficient and robust approximate estimate ofẽ n,k , i.e., e n,k = median(y n,:,k ).
2) Estimation of a n,k : An approximate estimator can be obtained by analysing the case of a target absence (r n,k = 0) and no background minimum level (ẽ n,k = 0). In this case, the maximum likelihood estimator of the parameter is given by a n,k = T t=1 y n,t,k T t=1 exp(−c n,k t) , which is a normalized averaging of the pixel counts. To alleviate the dependence on the unknown parameter c n,k (and thus avoid iterative steps), one can approximate this expression using a point-wise estimator. For this, note that y n,t,k = a n,k exp(−c n,k t) is an exponentially decreasing function, reaching its maximum for a n,k = max (y n,t,k ), ∀t. The latter equation will be considered as our approximate estimator forâ n,k , while only considering the 20 first histograms bins after assuming no target is located at these first time bins.
3) Estimation of c n,k : Given the previously estimated value ofẽ n,k , one can easily select the observed counts in the range interval t ∈ [1, min(20, T 1 )], where T 1 represents the intersection of the decreasing observed counts with the constant background levelẽ n,k , after assuming the absence of target for t < 20 (i.e., d n,k > 20 bins). For this interval, the log-likelihood of c n,k (with the other parameters fixed to their point estimates) can be expressed as 4) Estimation of h n,k : At this stage, the noise parameters Θ n = A, C,Ẽ are estimated and will help unmix signal from background counts. More precisely, the signal counts are first located using a log-matched filter strategy as followŝ n,t,k ).
The signal counts are then defined as where t l = max(1,d n,k − 3σ), t h = min(T,d n,k + 3σ), σ represents the standard deviation of the Gaussian system impulse response and q ( ) represents the number of neighbours in φ . Depth and reflectivity maximum likelihood estimates can then be obtained using (11)

F. Weights selection
The choice of the weights is very important and will have a direct impact on the algorithm performance. Several strategies have been considered in the literature where the choice can be based on the spatial distance between points, similarity of their values, etc [23]- [25]. In this paper, we define the weights based on multi-scale and multi-temporal information.
Inspired by [26] which adopted the rank order mean approach to unmix signal from background counts, our selection of the multi-scale weights W is based on the empirical observation that the depth map at a given scaled ( ) should be close to its median depth image d ( ) (with a graph of neighbours φ 0 , for example 3 × 3 neighbours in a uniform grid) and the median of the following scale d ( +1) . Armed with this observation, we assign low weights for pixels that differ significantly from their corresponding pixels in d ( ) and d ( +1) , i.e., for ∈ {1, · · · , L − 1} and where w norm is a normalization constant, the coefficient ζ d is easily fixed based on physical considerations related to the impulse response width and it is weighted with the square root of q to account for the multi-resolution effect. In (27), the product over promotes lower scales data if their weights are height. In addition, we define a scale-quality coefficient α ∈ [0, 1] which describe the overall quality ofd ( ) n as follows where δ(.) is the delta function. At scale , high (resp. low) α means that the image is of good (resp. bad) quality and the product in (28) will be activated (resp. deactivated). Another useful feature of the proposed algorithm when dealing with high background levels is that it can also account for temporal information between frames, which is justified if the speed of the observed object is lower than the frame rate of the system. Note that higher weights V are associated to lower spatial distance between pixels and/or temporal distance between frames, as follows where k ,n v n ,k = f c l ,n w ( ) n,n , with 0 < f c < 1 a user-fixed parameter representing the confidence level on current frame, w norm , h norm are normalization constants ensuring k ,n v n ,k + l ,n w ( ) n,n = 1. We note finally that the weights could be updated with iterations leading to a pseudo-Bayesian approach [27], but this is out of the scope of this paper and will be left for future work.

G. Stopping criteria
Two criteria are considered to stop the iterative coordinate decent algorithm. The first is maximum number of iterations. The second evaluates the new depth parameter values and stops the algorithm if the following condition is satisfied where i denotes the algorithm iterations and ξ is a threshold.

H. Optional postprocessing: data super-resolution
The system acquires data at a frame rate of 150 kHz but at the relatively low spatial resolution of 32 × 32 pixels. It is therefore necessary to simulate an improved spatial resolution using super-resolution approaches to improve visualisation. This is a well known problem and several methods have been proposed in the literature to deal with it ranging from simple interpolation algorithms (linear, bicubic), fast linear-based algorithms (guided filter [28]), to sophisticated approaches requiring more computational cost as in [29]- [31]. In this paper, we propose a fast algorithm which achieves real-time, good quality reconstruction of 3D scenes. Our approach is based on three successive steps as highlighted in Algo. 2. The first generates crude estimates of the high-resolution images using simple interpolation methods. The second steps improves edge estimation to remove block effects by considering a selfguided weighted-median filtering with a window of size r × r, which can be efficiently implemented using the strategy in [22]. For each pixel, the weights are defined depending on the pixel's similarity to its r × r neighbours, which filters homogeneous regions while preserving the edges. A final filtering step is applied to reflectivity using a self guided filter with parameters (radius = 5, reg = 10 −4 ). The depth is smoothed using the following point based-strategy: The proposed reconstruction algorithm, called M2R2D, is evaluated and compared to different state-of-the-art algorithms by considering simulated data. More precisely, our algorithm has been compared to the UA [1], Lindell [32]; and NR3D algorithms [12] which are robust to noise, and to the state-of-theart MANIPOP algorithm [10] that is designed to address multipeak scenarios. Both UA and MANIPOP algorithms were used with hyper-parameters provided by the authors, except at extreme noise conditions where larger spatial correlations are considered to improve their robustness. The algorithms are compared in terms of depth accuracy using the depth absolute error (DAE) measure DAE= 1 where d ref and x are the reference and estimated depth maps, respectively. The true positive percentage (i.e., percentage of true detections) representing the percentage of pixels satisfying a given depth absolute error (DAE) is also considered. To highlight the benefit of the proposed 1 -based hierarchical Bayesian model, we also compare to the 2 based model obtained without marginalization, i.e., considering the prior in (12) while fixing γ ( ) n ,n = 1/w ( ) n ,n and η (k ) n ,n . The latter model involves 2 -norms whose edge blurring effect is well known [9], hence the use of the proposed 1 -based regularization. The simulated realistic scenes were obtained from the Middlebury dataset 1 [1], [33], where we considered the bowling scene represented in Fig. 4. We have analysed two scenarios 1 Available in: http://vision.middlebury.edu/stereo/data/ corresponding to a photon-starved regime in presence of a uniform background, and a high photon return regime in the presence of an exponentially decreasing background due to obscurant. For both scenarios, we analysed different signalto-background (SBR) levels which are defined as the ratio of the signal and background counts averaged over all pixels [12]. The data of the first scenario has been generated as follows • We generated histograms of size 123 × 139 pixels and T = 300 bins while considering a real measured impulse response g(.) (having a leading-edge of 10 bins and trailing-edge of 70 bins), • We considered six levels of target average photons-perpixel (Sppp for signal ppp, which represents the integrated number of counts in the signal peak after removing background counts) ranging from 0.2 ppp to 25 ppp. • We considered two average photons-per-pixel background levels (denoted Bppp for background ppp, which represents the per-pixel number of background counts), Bppp ∈ {1, 8} leading to SBR values in the interval [0. 025,25]. This data allows a fair comparison with UA, NR3D and MA-NIPOP algorithms as they assume a constant background level. Note however that the Lindell algorithm [32] is a learning based algorithm that has been trained on a dataset having different characteristics (e.g., different impulse response width, 1024 time bins, etc). Therefore, to ensure a fair comparison, we used the simulator provided by the authors to simulate data at different Bppp ∈ {2, 10, 50} and Sppp ∈ {2, 5, 10} levels, for the middlebury Art scene of size 185 × 232 pixels 2 . To highlight the benefit of the proposed algorithm, we have also considered a second set of data having an exponentially decreasing background level as observed on real data. The considered Sppp and Bppp are inspired from those observed on real data (see Table III), as follows • We generated histograms of size 62×70×340 while considering a real measured impulse response g(.) (having a leading-edge of 10 bins and trailing-edge of 70 bins), • We considered four levels of target average photons-perpixel (Sppp) as follows: 100, 250, 500, 1000ppp • We considered two average photons-per-pixel background levels, Bppp ∈ {1750, 3500} leading to SBR values in the interval [0.02, 0.57], where a n,k = 2ẽ n,k , c n,k = 0.004 and t ∈ [1,340] bins.
In all what follows, the proposed algorithm has been run with the following parameters: L = 3 with q (2) = 3 × 3 and q (3) = 9 × 9, ζ d = 3.75cm, and f c = 0.1. Fig. 5 shows the DAE results obtained by the considered algorithms for the sparse photon regime scenario. The algorithms provided similar performance for both uniform background levels with a clear advantage to M2R3D. As expected, marginalization improves performance as highlighted by comparing M2R3D without marginalization to M2R3D. The proposed strategy showed lower computational cost than UA, NR3D and MA-NIPOP as indicated in Table I. It should also be noted that both UA and MANIPOP complexity is proportional to Sppp, NR3D scales with the scene complexity (slower to converge for extreme cases), while M2R3D computational time remains fixed for variable Sppp and is mainly related to the size of the image. Finally, Fig. 6 shows examples of the estimated depth images indicating sharp and accurate results for the proposed strategy. This figure shows that UA over-smooths images while MANIPOP, which is designed for multi-peak scenarios, is more conservative leading to a number of "empty" pixels where no depth value can be estimated, especially in the extremely noisy cases. Furthermore, the proposed algorithm provides uncertainty estimates regarding the depth estimates as indicated by maps in the bottom row. It is shown that highest variances (i.e., uncertainty) are located near the edges or in presence of strong outliers. For completeness of this study, the figure also shows the results of the recent RT3D algorithm designed to deal with multi-peak scenarios [20], which is also conservative and results in a number of empty pixels where a depth value cannot be estimated. Note, however, that RT3D differs from the other algorithms used in this comparison as it processes a point cloud instead of an image, which makes a direct comparison more difficult. Note also that the obtained point cloud has been projected to the plan X-Y to obtain the images in Fig. 6, where white regions mean absence of detection. Fig. 7 shows the estimated point clouds at Sppp= 0.4, indicating that UA over-smooths the surface and connects edges, NR3D shows noisy results, while MNR3D shows the best compromise for this extreme case in addition to uncertainty maps to help select reliable estimates.
As previously indicated, we compared M2R3D with the Lindell denoising algorithm [32] on 185 × 232 × 1024 histogram cubes with uniform background (generated using the [32] provided simulator). Fig. 8 shows the obtained true positive percentage w.r.t. distances for different Sppp, Bppp levels using M2R3D and Lindell algorithm. This figure shows that M2R3D reaches higher percentages faster, indicating a good generalizability of the proposed algorithm to other data format. An example of the estimated depth maps at the extreme case of Sppp=2 and Bppp=5 is shown in Fig. 9. It is clear that M2R3D  preserves better object edges, can estimate small objects (e.g., see paint brushes) and is less sensitive to the intensity values (e.g., see bottom left region in Lindell map). The second scenario is more challenging as it shows nonuniform background with respect to time, and has more photon detection which penalizes the UA and MANIPOP whose complexity is proportional to the number of detected photons. MANIPOP was not considered in this scenario due to its high computational cost and that it was specifically designed only to deal with the sparse photon regime. Fine tuning RT3D hyperparameters was not successful for this data and thus it is not considered. Fig. 10 shows the depth DAE obtained with the different algorithms, where the best results are obtained with the proposed strategy. Table II also indicates that the proposed algorithm is the fastest in this photon-dense regime. Finally, Fig. 11 shows examples of the estimated depth images, where UA and NR3D fail in the presence of high non-uniform background that leads to a large fraction of pixels failing to register a depth measurement, especially at low values of Sppp. Again the proposed algorithm highlights higher uncertainty around the edges, while showing low standard-deviation for other regions indicating good confidence in the provided estimates.       This section evaluates quantitatively the results of the proposed algorithm on real data. We first provide more details regarding the real data considered in the main paper. Table III indicates the average signal and background levels obtained by M2R3D for Figs. 5, 6, and 7 of the main paper for the pixels located on the planar board targets and satisfying a depth error lower than 5.6cm (+-1.5 time bin). The averaged SBR=Sppp/Bppp is also indicated together with the minimum and maximum values, for completeness. These SBR values are averaged across all pixels on the four wooden panels, with considerable variability being observed between individual pixels, as outlined in Table III . Note that these results indicate a high uncertainty on attenuation length (AL) especially for the measurements at high obscurant densities at 150m which show very low Sppp levels. These results indicate a AL level which may be even higher than stated in the main paper, AL was estimated using a different method. In order to retain statistical accuracy and avoid the deleterious effects of pulse pile-up [34]- [36], the time-correlated single-photon technique requires the photon count rate to be much lower than the excitation pulse rate. In the case of the measurements presented in this paper, none of the data sets were in the pulse pile-up regime. For example, in the results shown in Figure 5 in the main paper, the worst-case data set had an average per-pixel count rate that was 4% of the pulse rate. The maximum per-pixel value in this worst-case data set was 7%. Second, we consider imaging the 3D wooden panel target (described in the main paper) located at a distance of 50m from the LiDAR system. The criterion used is the true positive percentage representing the percentage of pixels satisfying a given depth absolute error (DAE), where the reference depth is obtained by considering the low resolution depth image of the proposed algorithm in clear conditions (i.e., very low level of obscurant). Fig. 12 (top) represents the true positive percentage with respect to (top) time in seconds for a fixed DAE = 5.6cm and (bottom) depth absolute error in cm for the frame time=208s. The dashed lines in Fig. 12 (top) represents the time instants of the four images shown in Fig. 4 of the main document, i.e., 201s, 208s, 217s, 323s. This figure shows higher percentages obtained for the proposed M2R3D which quantitatively highlights the benefit of the proposed strategy on real data. It is shown in Fig. 12 (top) that M2R3D reached a true positive percentage > 90% level after 210s, advancing the other algorithms by ≈100s. Fig. 12 (bottom) also shows that M2R3D presents a true positive percentage>70% which is only reached by other algorithms when allowing a 80cm absolute depth error (i.e., DAE=0.8m).

Supplementary material 2: System description
A diagram of the bi-static LiDAR imaging system is shown in Supplementary Fig. 13. The system was mounted on a breadboard set-up having pitch and yaw adjustments, thus facilitating the alignment of the LiDAR system to the scene of interest.
On the illumination channel, the output fibre from the laser module was connected to a reflective collimation package and the collimated beam was passed through a 9 nm wide fullwidth half-maximum (FWHM) bandpass filter (BP2) with a central wavelength matched to that of the pulsed laser source. This was used to remove any potential out-of-band light or amplified spontaneous emission from the laser source. A pair of lenses in an adjustable telescope arrangement was used to control the divergence of the outgoing laser beam and thus the extent of the flood illumination at the target scene. These achromatic lenses were optimised for use in the SWIR region of the spectrum.
Controlled adjustments to the pitch and yaw of the optical arrangement for the illumination channel enabled the beam to be accurately positioned with respect to the field-of-view (FoV) of the 32 x 32 SPAD detector array. A combination of a longpass filter with a cut-on wavelength of 1500 nm (LP1), and a 9 nm FWHM bandpass filter similar to that used in the illumination channel, was inserted between OBJ1 and the SPAD detector array to limit out-of-band ambient light detection.
Each SPAD detector in the array had a quoted single photon detection efficiency of approximately 24% and a measured dark count rate averaging approximately 320 kcps (kilo-counts per second). A VIS-SWIR digital camera (Ninox 640, Raptor Fig. 13. A schematic diagram of the bi-static LiDAR imaging system configuration which comprised a λ = 1550 nm pulsed fibre laser source and a 32 x 32 InGaAs/InP SPAD detector array. Optical components include: objective lenses for both the detection (OBJ1) and transmission (OBJ2) channels; a longpass filter (LP1); and bandpass filters (BP1, BP2). Details on the high-performance optical filters and the zoom mechanism are given in the text.
Photonics) based on a 640×512 array of InGaAs PIN photodiode detectors was also mounted on the system breadboard and used for checking the alignment of the illumination and SPAD detector array camera channels to the scene of interest, and for monitoring the scene during a measurement. A 1500 nm longpass spectral filter was used to limit the spectral range detected by this VIS-SWIR camera.