Long-range depth imaging using a single-photon detector array and non-local data fusion

The ability to measure and record high-resolution depth images at long stand-off distances is important for a wide range of applications, including connected and automotive vehicles, defense and security, and agriculture and mining. In LIDAR (light detection and ranging) applications, single-photon sensitive detection is an emerging approach, offering high sensitivity to light and picosecond temporal resolution, and consequently excellent surface-to-surface resolution. The use of large format CMOS (complementary metal-oxide semiconductor) single-photon detector arrays provides high spatial resolution and allows the timing information to be acquired simultaneously across many pixels. In this work, we combine state-of-the-art single-photon detector array technology with non-local data fusion to generate high resolution three-dimensional depth information of long-range targets. The system is based on a visible pulsed illumination system at a wavelength of 670 nm and a 240 × 320 array sensor, achieving sub-centimeter precision in all three spatial dimensions at a distance of 150 meters. The non-local data fusion combines information from an optical image with sparse sampling of the single-photon array data, providing accurate depth information at low signature regions of the target.


I. SUPPLEMENTARY INFORMATION
Illustration of the scan path taken by the laser illumination during the data acquisition. The laser is scanned along a path of 20 by 20 positions, and the illumination is set to cover around a 50 by 50 pixel area. The data from the 20 by 20 scan is combined to produce a single image, corresponding to a particular gate setting. The gate is then scanned in order to produce data for a three-dimensional image. Sample fit of the error function to the intensity data recorded by three pixels on the SPAD sensor array. Correction for temporal mismatch has not been applied. The blue line is the data after preprocessing and the magenta line is the fit. The position of the targets can be established to a precision within 1 cm.

A. Data preparation
The two-dimensional image for a particular depth is constructed from individual image frames captured by performing a 20 by 20 scan of the target scene, see Fig. 1. The first 30 frames of each dataset of 400 frames are discarded as the frames are captured before the SPAD sensor array is fully initialized. A further 20 frames at the end of each dataset are discarded since the frames do not contribute to the three-dimensional (3D) reconstruction of the scene. By removing these redundant frames, we reduce the computation time. The remaining 350 frames for each fixed depth are then divided into 175 pairs made up of each laser scan position in the upper half of the array and the corresponding scan position in the lower half. For example, (1,176) , (2,177) , · · · , (175, 350).
The initial step of our preprocessing is to subtract the noise that originates from the isolated hot pixels with very high dark count rates. For each frame in the pair, we apply hot pixel compensation (inpaint nans function for Matlab) wherein we replace hot pixel elements in the array with an 8-neighbor average. Here, we define a hot pixel as one with a dark count higher than 20% of the total number of bit planes added.
Next, we filter out the noise that comes from daylight illumination of the target scene. We obtain the absolute value of the difference between the two hot pixel-compensated frames. The result is that the almost invariable noise from the sunlight is filtered out while the target signal is preserved. In Eq.(1), f a i indicates the result after step 2.
We further suppress the noise due to dark counts by applying the Matlab median filter across pixels in a 3-by-3 square neighborhood. This results in pixels that are brighter or darker than those in the neighborhood being replaced by the median value. In Eq.
(2), f b i indicates the result after step 3.
We also want to ignore all pixels for which the intensity is less than a threshold of 10. The laser illuminates a small area in each frame, but the intensity of pixels in the rest of the frame is not exactly zero. To force the intensity of pixels that lie outside the illumination area to zero, we set 10 as a threshold and replace all intensities less than this value with zero for all pixels. For all other pixels, the intensity becomes the original value minus the threshold. In Eq.
(3), f c i indicates the result after step 4.
We carry out these steps for all 175 pairs, after which the noise is suitably filtered out. We then sum all the 175 frames together to obtain a 2D image for the target scene at a given depth sample. In Eq.(4), f d i indicates the result after this last step.
The process is repeated for the 240 by 320 by 400 by 51 four-dimensional data cube to obtain the 2D images for the full range of depths measured. Bright pixels in an image indicate that the illuminated objects are inside the corresponding time gate while dark pixels indicate that they are outside.
B. Non-local data fusion 1. Observation model.
For each depth location, the proposed LIDAR system scans the scene using N b beams leading to a photon count image for each depth location. Denoting by y n,k the number of photon counts within the kth depth sample of the nth pixel, and given the acquisition process, the observed data y n,k can be assumed to be distributed according to a Poisson distribution P (.) as follows where s n,k denotes the average photon counts whose shape is related to the system impulse response (IR) or gate. In this paper, we approximate it using the following parametric formulation where erf(.) denotes the error function, h ≥ 0 is a fixed IR intrinsic parameter which represents the width of the leading edge, and d n ≥ 0 and r n ≥ 0 are related to the target's depth and intensity, respectively. Here, we assume the absence of background b n = 0, ∀ n, since the noise has been removed at the preprocessing stage. Figure 2 shows a sample fit to the experimental data. The proposed method aims at estimating the target parameters Θ = (d, r) while considering the observed histograms y n,t , ∀ n, t, and their statistics in Eq. (5). A common approach to achieve this task is to maximize the likelihood P (Y |Θ) with respect to the parameters, or equivalently to minimize the negative log-likelihood obtained by assuming independence between the observed pixels conditional on Θ, as follows where Y is a K × N matrix gathering the vectors y n = (y n,1 , · · · , y n,K ) in its columns, and K is the number of depth samples and N the number of pixels.

Fusion-based regularization
Estimating the parameters Θ in under-sampling conditions is an ill-posed inverse problem that requires the introduction of prior knowledge (or regularization terms) related to the target depths and intensities. In this paper, we propose to solve the following optimization problem where τ 1 > 0, τ 2 > 0 are two regularization parameters, M is a Q × N binary matrix that contains a single non-zero value in each row to model the loss of some image pixels due to compressive sampling and Q is the number of non-empty pixels, i R+ (Θ) is the indicator function imposing non-negativity on the elements of Θ (i R+ (x) = 0 if x belongs to the non-negative orthant and +∞ otherwise), and φ v 1 , φ w 2 are two regularization functions that depend on two weighting vectors v and w, respectively. These weights can be evaluated by considering single-photon data, or more interestingly, by adopting a fusion approach. The latter can be performed by learning the weights on other imaging modalities of the same scene, which contain complementary information. The regularization terms and the weights are defined in the following sections.
Prior on parameters. Most image restoration algorithms exploit spatial correlation of the image to reduce the noise and reconstruct missing elements. This can be done by imposing sparsity constraints on image features such as their gradient or their projection using a dictionary. In this paper, we consider a non-local regularization approach as such approaches have shown clear advantages in reconstructing natural images (especially textured images) [1][2][3]. The proposed regularization term can be expressed as where H v ∈ R n d N ×N is a block-circulant-circulant-block matrix (BCCB) which computes weighted differences between each pixel and other n d pixels located in a predefined field, v ∈ R n d ×N is a matrix of weights associated with each pixel and each direction, and H Diff m ∈ R N ×N computes the difference between each pixel and that located in the mth direction. This regularization enforces small weighted quadratic differences between similar pixels in the image, i.e. promotes spatial correlation between pixels. The same regularization term Eq.(10) is applied for depth and intensity while considering different weight vectors, as described in the following section.
Weight selection. The weights control the level of regularization imposed on each pair of pixels and should be chosen carefully to improve performance. As previously mentioned, these coefficients can be estimated from the data themselves, or preferably by considering a complementary modality of acquisition. Therefore, in addition to threedimensional (3D) LIDAR data, we also acquire a co-registered passive RGB optical image of the same scene denoted by z (of size 3 × N ). The intensity weights w will then be learned on this optical image as follows where we consider an 1 norm to compute pixel differences to preserve sharpe edges and truncate values smaller than 0.5 to avoid strong downweighting of pronounced edges, as in [4], (σ w is chosen to be 0.1). The coefficient v should reflect our knowledge about the target's depth. In this paper, we assume that close regions should share similar depth as for local correlation approaches [5,6]. Under these considerations, the weights can be evaluated as follows where P n = (r n , c n ) gathers the row and column coordinates of the nth pixel and x v is a normalization constant given by the maximum value of 1 ||Pn−Pm|| 2 , ∀ n, m. All weighting functions and coefficients in this paper are chosen empirically.

Results
This section evaluates the performance of the proposed non-local fusion-based algorithm when considering a compressive sensing scenario, i.e. reducing the number of scan points. The evaluation is carried out on the data presented in the main text. The proposed algorithm is compared to the following: • Classical algorithm: the depth and intensity parameters are estimated by fitting the IR in Eq. (6) to the data; • TV algorithm: a total variation prior is applied to Eq.(9) to restore the depth and intensity parameters. Figure 3 shows that the classical algorithm provides a corrupted depth map when considering 25% of the scanned positions, while the TV algorithm and the one proposed both provide similarly good results owing to their ability to account for spatial correlations (see Fig. 4 and 5 respectively). Further reducing the number of scanned positions leads to a challenging reconstruction problem and TV results are affected when only 10% of the scanned positions are considered. In contrast, the proposed algorithm shows robust results since it uses non-local information and benefits from the complementary information provided by the passive optical camera. This is further highlighted when considering 5% of the scanned positions; the proposed algorithm is able to reconstruct part of the mannequin's left arm, which is missing from the heavily corrupted depth images provided by the other algorithms. 25%, 10% and 5% of scan positions correspond to 83.7%, 61.8%, and 35.8% of pixels respectively. Correction for temporal mismatch has been applied. This shows the clear advantage of the proposed algorithm in data reconstruction under an extremely reduced number of scan positions, i.e. a reduced acquisition time.