Non-Line-of-Sight Tracking and Mapping with an Active Corner Camera

The ability to form non-line-of-sight (NLOS) images of changing scenes could be transformative in a variety of fields, including search and rescue, autonomous vehicle navigation, and reconnaissance. Most existing active NLOS methods illuminate the hidden scene using a pulsed laser directed at a relay surface and collect time-resolved measurements of returning light. The prevailing approaches include raster scanning of a rectangular grid on a vertical wall opposite the volume of interest to generate a collection of confocal measurements. These are inherently limited by the need for laser scanning. Methods that avoid laser scanning track the moving parts of the hidden scene as one or two point targets. In this work, based on more complete optical response modeling yet still without multiple illumination positions, we demonstrate accurate reconstructions of objects in motion and a 'map' of the stationary scenery behind them. The ability to count, localize, and characterize the sizes of hidden objects in motion, combined with mapping of the stationary hidden scene, could greatly improve indoor situational awareness in a variety of applications.


The Active Corner Camera
The ability to form non-line-of-sight (NLOS) images of changing scenes could be transformative in a variety of fields, including search and rescue, autonomous vehicle navigation, and reconnaissance. Most existing active NLOS methods illuminate the hidden scene using a pulsed laser directed at a relay surface and collect time-resolved measurements of returning light [1]. The prevailing approaches include raster scanning of a rectangular grid on a vertical wall opposite the volume of interest to generate a collection of confocal measurements [2][3][4][5]. These and a recent method that uses a horizontal relay surface [6] are inherently limited by the need for laser scanning. Methods that avoid laser scanning track the moving parts of the hidden scene as one or two point targets [7,8]. In this work, based on more complete optical response modeling yet still without multiple illumination positions, we demonstrate accurate reconstructions of objects in motion and a 'map' of the stationary scenery behind them. The ability to count, localize, and characterize the sizes of hidden objects in motion, combined with mapping of the stationary hidden scene, could greatly improve indoor situational awareness in a variety of applications.
The challenge of both active and passive NLOS imaging techniques is that measured light returns to the sensor after multiple diffuse bounces. With each bounce, light is scattered in all directions, eliminating directional information, and attenuating light by a factor proportional to the inverse-square of the path length. Particularly in the passive setting, where no illumination is introduced, occluding structures that limit possible light paths have been used to help separate light originating from different directions in the hidden scene [9][10][11][12][13][14][15][16][17]. Useful structures include the aperture formed by an open window [9] or the inverse pinhole [18] created when a once-present object moves between measurement frames. Unlike other occluding structures, whose shapes must be estimated or somehow known [11,12,14,15,19], vertical wall edges have a known shape and are often present when NLOS vision is desired. An edge occluder blocks light as a function of its azimuthal incident angle around the corner and, as a result, enables computational recovery of azimuthal information about the hidden scene. This was first demonstrated in the passive setting [10,20], where 1D (in azimuthal angle) reconstructions of the hidden scene were formed from photographs of the floor adjacent to the occluding edge; 2D reconstruction was demonstrated in a controlled static environment, although the longitudinal information present in the passive measurement was found to be weak [21].
In the active setting, most of the approaches proposed to date scan a pulsed laser over a set of points on a planar Lambertian relay wall and perform time-resolved sensing with a single-photon detector to collect transient information [2][3][4][5][22][23][24][25][26][27]. To reconstruct large-scale scenes, these approaches generally require scanning a large area of the relay wall and thus a large opening into the hidden volume. To partially alleviate these weaknesses, edge-resolved transient imaging (ERTI) [6] combines the use of an edge occluder from passive NLOS imaging with the transient measurement abilities of active systems. ERTI scans a laser on the floor along an arc around a vertical edge, incrementally illuminating more of the hidden scene with each scan position. Differences between measurements at consecutive scan positions are processed together to reconstruct a large-scale stationary hidden scene. However, the laser scanning requirement is still a limiting constraint. An earlier work using the floor as a relay surface shortens acquisition time by using a 32 × 32 pixel SPAD array in conjunction with a stationary laser [8]. Simultaneous measurements from the 1024 pixels and background subtraction are used to track the horizontal position of a hidden object in motion, modeled as a point reflector.
In this work, we use similar hardware as in [8] and also use a floor as a relay surface. As illustrated in Figure 1A, our Figure 1: An active corner camera use scenario is shown in (A). A pulsed laser pointed at the floor illuminates the hidden scene while a SPAD camera adjacent to the occluding wall measures the temporal response of returning light. An initial reference measurement is acquired to characterize the response of the stationary scene. When the moving object enters, the new measurement includes added photon counts due to the object and reduced photon counts at more distant ranges due to the occluded background region behind it. Using these changes, we reconstruct objects in motion as well as the occluded background regions behind them (B). By accumulating frames as an object moves through the hidden scene, we form a map of the stationary hidden scene.
desire for NLOS vision is caused by an occluding wall; unlike in [8], the edge of the wall is explicitly modeled and exploited to enable reconstruction of moving objects in the hidden scene. Like the passive corner-camera systems in [10,20,21,28], we position the SPAD field of view (FOV) adjacent to the wall edge, as shown in Figure 1A, to derive azimuthal resolution from the occluding edge. As in [6], we derive longitudinal resolution from the temporal response to the pulsed laser. However, unlike [6], our proposed system acquires data for each frame in a single snapshot without scanning, allowing us to track hidden objects in motion. Consider Figure 1A and note that a moving target not only adds reflected photons to the measurement, but also reduces photons due to the shadow it casts on the stationary scene behind it. Through additional modeling of occlusion within the hidden scene itself, we use these changes to reconstruct occluded background regions for each frame ( Figure 1B). As an object moves through the hidden scene, reconstructions of occluded background regions may be accumulated to form a map of the hidden scene ( Figure 1C). In contrast to [8], where x and y coordinates are estimated for a hidden target in motion at an assumed height, our algorithm counts hidden objects in motion and reconstructs their shape (i.e., height and width), location, and reflectivity while simultaneously mapping the stationary hidden scenery occluded by them. In our setup, the measurement rate at the nth spatial pixel in the kth time bin is Poisson distributed where b ∈ R N ×K is the rates due to stationary scenery, s fg ∈ R N ×K is the response of the foreground object, and s oc ∈ R N ×K is the response of the occluded background region, before the object enters. We assume that b is approximately known through a reference measurement acquired before moving objects enter the hidden scene or through other means. Vectors ψ fg and ψ oc contain parameters that describe the foreground objects and corresponding occluded background regions. We seek to recover the parameters ψ fg and ψ oc from the measurement x, for each measurement frame. As shown in Figure 2A for a single moving object, we model moving objects and their occluded background regions each as a single vertical, planar, rectangular facet resting on the ground. We assume there are M moving objects with parameters ψ fg = {(θ m , a m fg , r m fg , h m ), m = 1, . . . , M }. Marked in Figure 2A, a m fg is the albedo, r m fg is range, and h m is height of the mth object. Angles θ m = (θ m min , θ m max ) are the minimum and maximum polar angles of the foreground facet, measured around the occluding edge in the plane of the floor. The mth occluded region is described by range r m oc and albedo a m oc parameters ψ oc = {(a m oc , r m oc ), m = 1, . . . , M }. The height of the occluded region is not a separate parameter; it depends upon its range r oc and the corresponding moving object's range r fg and height h. When parameters ψ fg and ψ oc have been estimated for a sequence of measurement frames, the vertical lines running through the centers of estimated occluded background regions (light red) are joined by planar facets (blue) to form a contiguous map of the background, as shown in Figure 2B.
A method to quickly compute the rates due to a planar rectangular facet resting on the ground (i.e., s fg (ψ fg ) and s oc (ψ fg , ψ oc )) is a key part of our inversion algorithm. Take p l to be the position of the laser illumination and p f to be a point on the floor in the area of the nth camera pixel P n . The flux during the kth time bin at the nth camera pixel due to hidden surface S is where a(p s ) is the surface albedo at point p s , w(·) is the pulsed illumination waveform, ∆ t is the duration of a time bin, t 0 is the time the pulse hits the laser spot, and c is the speed of light. The factor G(·, ·, ·) is the Lambertian bidirectional reflectance distribution function (BRDF) and is the product of foreshortening terms (i.e., the cosine of the angle between the direction of incident light and the surface normal), as described in Appendix A. The factor v(p s , p f ) is the 'visibility function' that describes the occlusion provided by the occluding edge between hidden scene point p s and SPAD FOV point p f . As shown in the bird's eye view of Figure 3, point p f is located at angle γ measured from the occluding wall in the plane of the floor. Point p s is at azimuthal angle α, in the plane of the floor, measured around the corner from the the boundary between hidden and visible sides of the wall. Thus, point The yellow region in the SPAD FOV is the collection of all points p f not occluded from point p s by the wall, where v(p s , p f ) = 1.
In the green region, light from p s is blocked by the wall and v(p s , p f ) = 0. This fan-like pattern is the 'penumbra' exploited by the passive corner camera in [10,20,21,28]. In some previous works, computation time is reduced by making a confocal approximation [2,6] (i.e., assuming the laser and detector are co-located). Under this assumption, the set of points p s in the scene that correspond to equal round-trip travel time from p l , to p s , and back to p f , lie on a sphere. In contrast, as in [8], we seek to exploit the spatial diversity of our sensor array and thus require a more general ellipsoidal model that arises when p l and p f are not co-located. When S is a vertical, rectangular, planar facet, the intersection of a given round trip travel time (the ellipsoid) and the plane containing our facet is an ellipse. Using [29], we write that ellipse in translational form, enabling us to perform the integration in (2) in polar coordinates. This method, described further in Appendix A, allows us to compute s fg (ψ fg ) and s oc (ψ fg , ψ oc ) quickly enough to implement our inversion algorithm.
Before estimating parameters ψ fg and ψ oc for a given frame, we estimate the number of moving objects M . The passive corner camera processing of [20] is applied to the temporally integrated difference measurement (e.g., Figure 4B) to produce a  Figure 3: A bird's eye view of the vertical edge occluder. The edge blocks light from scene point p s as a function of its azimuthal angle α, measured around the corner. A point p f in the SPAD FOV at angle γ is illuminated by p s if γ ≥ α.
1D reconstruction of change in the hidden scene as a function of azimuthal angle α. The intervals where this 1D reconstruction is above some threshold are counted to determine M . Parameters ψ fg and ψ oc are then estimated from time-resolved measurement x using a maximum likelihood estimate (MLE) constrained over broad, realistic ranges of ψ fg and ψ oc . To approximate the constrained MLE, the Metropolis-Hastings algorithm is applied in two stages: first to estimate foreground parameters ψ fg , assuming no occlusion of the background, and second to estimate the parameters of the occluded background region ψ oc , assuming ψ fg = ψ fg . Further details about our procedure for estimating N , ψ fg , and ψ oc are included in Appendix B.
In Figure 4, we show reconstruction results for eight measurement frames acquired as two hidden objects move along arcs toward and then past each other as shown in Figure 4A. In this demonstration, the integration time (i.e., the total time over which the camera collects meaningful data) used for each new frame was 0.4 s. Integration time for the reference measurement was 30 s. The acquisition time (i.e., the total time required to collect, accumulate, and transfer data) was longer; see Appendix C. Measurements averaged spatially over all pixels are shown in Figure 4C. The top plot shows the stationary scene measurement (red) with the measurement acquired after objects have moved into the scene in Frame 1 (blue); their difference is shown on the axis below. A peak in the difference around 3 meters is due to the additional photon counts introduced by the two moving objects; a dip at 6 meters is due to their occluded background regions. Although it is impossible to separate the contributions from each target in this spatially integrated view of the data, the vertical edge occluder casts two distinct shadows in the temporally integrated measurement shown in Figure 4B. Our processing exploits spatiotemporal structure of the data that is not apparent from the projections in Figure 4B and C.
Single-frame reconstruction results are shown for three different frames in Figure 4D. In Frames 1 and 7, two targets are resolved with accurate heights, widths, and ranges. The reconstructed occluded background regions are placed accurately in range. In Frame 6, the closer target passes in front of the more distant one, and the single reconstructed target is placed at the range of the front-most object. Two views of the reconstructed maps (blue), accumulated over all eight measurement frames, are shown in Figure 4E to closely match the true wall locations (green).
In Figure 5, we demonstrate that our reconstruction algorithm works with dimmer moving objects as well as with objects that do not match our rectangular, planar facet model. Single-frame reconstruction results are shown for the white facet, a darker gray facet, a mannequin, and a staircase shaped object. In all four cases, the reconstructed foreground object is correctly placed in range. Although our model does not allow us to reconstruct the varying height profile of the stairs, we correctly reconstruct it to be wider and more to the right than the other hidden objects. In Appendix D, we demonstrate that our algorithm works under a wide range of conditions, including different hidden object locations, frame lengths, and lighting conditions.
In this work, we present an active NLOS method to accurately reconstruct both objects in motion and a map of stationary hidden scenery behind them. This innovation is made possible through careful modeling of occlusion due to the vertical edge and within the hidden scene itself. The algorithm presented in [8] attempts only to identify a single occupied point in the hidden scene, making detailed modeling of the scene response unnecessary. In this work, we also make no assumptions about light returning from the visible scene, allowing arbitrary visible scenery to be placed at the same ranges as the hidden objects of interest. This is true in [6] as well, however in their setup, with the single-element SPAD fixed in position and a very small laser scan radius, the contribution to the measurement from the visible side may be assumed constant across all measurements. In our configuration, the SPAD array has a non-negligible spatial extent resulting in a visible-side contribution that varies across the measurements. Our use of a stationary scene measurement allows us to effectively remove the contribution due to unknown visible-side scenery; our modeling of occlusion within the hidden scene itself allows us to perform this 'background subtraction' without losing all information about the stationary hidden scenery.
Although we have successfully demonstrated our acquisition method, various aspects of our system and algorithm could be improved upon. Our current algorithm processes each frame independently, using only broad constraints on the unknown parameters. An improved system could jointly process frames and benefit from inter-frame priors. Such priors could incorporate Temporally integrated measurements in (B) show the penumbra pattern, with a distinct shadow due to each of the two hidden objects. Spatially averaged measurements for the stationary scene (red) and one motion frames (blue) are shown on the top axis of (C), with their difference shown below. The peak near 3 m is due to the moving objects; the dip near 6 m is due to the background occlusion. Selected single-frame reconstructions are shown in (D). Two views of the map reconstructions, accumulated over 8 frames, are shown in (E). continuity of motion, the fact that object height, width, and albedo are unlikely to change between frames, and the fact that that walls in the hidden scene are typically smooth and continuous. In our demonstration, we use a thin occluding wall and do not model wall thickness. The thin-wall assumption is illustrated in Figure 3, where the angle α is measured around the same point regardless of the location of p s . When the the wall has appreciable thickness, cases α ∈ [0, π/2) and α ∈ [π/2, π] require different modeling. One could incorporate wall thickness into the model or estimate wall thickness as an additional unknown parameter. A method might also be designed to produce higher resolution reconstructions of each moving target. Each target could be divided horizontally into several vertical segments, each with an unknown albedo and height to be estimated. This type of algorithm might better resolve the staircase object in Figure 5. Through further analysis, it might also be possible to optimize certain parameters in our setup. For example, certain FOV sizes and positions or laser locations might produce a better balance between the different sources of information in the data.
The demonstrations in this work employed a sensor with 32 × 32 SPAD pixels, 390 ps timing resolution, 3.14% fill factor, and ∼17 kHz frame rate, limited by the USB 2.0 link [30]. A frame length of 10 µs and a gate-on period of 800 ns yielded a duty cycle of 8%. Particularly, the spatial and temporal resolution limit the precision of the estimated facet parameters, whereas the fill factor and frame rate limit the signal-to-noise ratio for a given acquisition time and, thus, the ability to track faster or farther objects. We expect the results reported in this manuscript will improve by orders of magnitude with new SPAD technology, as reviewed in [31,32], where some works have demonstrated up to 1 megapixel SPAD arrays [33], greater than 100 kHz frame rates [34], fill factors greater than 50% [34,35], and time resolution finer than 100 ps [34,36]. Figure 5: Single-frame reconstruction results for four different hidden objects, including a less reflective gray target, a non-planar mannequin, and a non-rectangular staircase. In all cases, our model allows us to accurately locate both the object and the stationary scene in the background.

Setup
Illumination is provided using a 120 mW master oscillator fiber amplifier picosecond laser (PicoQuant VisUV-532) at 532 nm operating wavelength. The laser has an ∼80 ps FWHM pulse width and is triggered by the SPAD with a repetition frequency of 50 MHz. The SPAD array consists of 32 × 32 pixels with a fill factor of 3.14%, with fully independent electronic circuitry, including a time-to-digital converter per pixel [30]. At the 532 nm laser wavelength and room temperature, the average photon detection probability is ∼30% and the average dark count rate is 100 Hz. The array has a 390 ps time resolution set by its internal clock rate of 160.3 MHz. Attached to the SPAD is a lens with focal length of 50 mm, which yields a 25 × 25 cm field of view when placed at around 1.20 m above the floor. We set each acquisition frame length to 10 µs, with a gate-on time of 800 ns, thus yielding an 8% duty cycle. During the 800 ns gate-on time of each frame, 40 pulses (800 ns * 50 MHz) illuminate the scene. The SPAD array has a theoretical frame rate of 100 kHz, set by the 10 µs readout per frame, but experimentally we observed just ∼17 kHz, which was mainly limited by the USB 2.0 connection to the computer.

Data acquisition
For our demonstrations, we set up a hidden room 2.2 m wide, 2.2 m deep and 3 m high, as shown in Figure 4A. Assuming the coordinate system origin is at the bottom of the occluding edge, the left wall is at x = −1.20 m, the right wall is at x = 1 m, the back wall is at y = 2.2 m, and the ceiling is at z = 3 m. The walls are made of white foam board and the ceiling is black cloth. The SPAD array is positioned on the side of the wall, looking down at the occluding edge origin, allowing half of the array to be occluded. The laser is positioned so that it shines close to the origin. To reject the strong ballistic contribution (first bounce) of light reflected from the origin, we punched a hole in the occluding wall and shined the laser through the hole. The true location of the laser spot on the floor is slightly off the origin, by 6 cm to the right side. The latter was found by cross-checking and minimizing the number of ballistic photons measured by the SPAD array. More recent SPAD arrays incorporate a fast hard gate to rapidly enable and disable the detector with few hundreds picoseconds width, which can be tuned to censor the ballistic photons [34,36].
Two test scenarios were analyzed. For the first, we used two rectangular white foam board facets of size 20 × 110 cm as our moving objects. For the second, we used four different targets: a white foam board facet (of size 20 × 110 cm), a gray foam board facet (white foam board painted with a gray diffuse spray paint), a fabric mannequin of size 30 × 80 cm, and a stair-like facet of size 75 × 75 cm. These objects were used to test our method on targets of different shape, height and albedo. All tests were conducted with the objects facing the occluding edge. Before moving objects enter the hidden room, a 30 s acquisition was collected to form an estimate of b, the response of the stationary scene. Then, new measurement frames were collected with moving objects fixed at discrete points along their trajectories during 0.4 s. In Appendix D, we demonstrate that these measurements can be acquired over a much shorter period of time with little effect on the reconstruction quality.

A Fast Forward Model Approximation
A.1 Transient light transport modeling In our system, a pulsed laser and SPAD array are both pointed at the floor adjacent to an occluding edge. Take p l to be the position of the laser spot, p f to be a point in the area of the nth camera pixel P n , and p s to be a point on the hidden surface S. The camera measurement rate integrated over the kth time bin at the nth spatial pixel is where v(p s , p f ) is the visibility function defined in the main document, a(p s ) is the surface albedo at point p s , w(·) is the pulsed waveform, ∆ t is the duration of a time bin, t 0 is the time the pulse hits the laser spot, and c is the speed of light. The Lambertian bidirectional reflectance distribution function (BRDF) factor G(·, ·, ·) is given by where n l , n s , and n f are the surface normal vectors at points p l , p s , and p f , respectively. When the camera pixel size is sufficiently small, (4) may be approximated by where ∆ P is the area of the camera pixel andp f,n is the center of pixel n. When the pulse duration is short relative to the time bin length, we may replace w(·) by a Dirac impulse function δ(·) scaled by a pulse intensity I to write A.2 Fast computation of the response from a vertical planar facet Our inversion algorithm is based on a planar facet model of the hidden scene. Thus, we are interested in quickly computing the response of a vertical planar facet resting on the floor, shown in grey in Figure 6a. The method we develop here achieves a speed gain factor of about 150 over a direct numerical integration with similar accuracy. We define a coordinate system (different than in the main document) so that the laser spot p l and pixel centerp f,n are placed a distance m apart along the y-axis equidistant from the origin, as marked in Figure 6a. For any d > m, points p l andp f,n form the foci of an ellipsoid that contains all points p s in the scene with round-trip travel distance d = p l − p s + p f,n − p s . The intersection of that ellipsoid with the planar facet is an ellipse segment, as shown in red in Figure 6a. Now consider the integration in (7). As t increases, the d value matching the shift of the Dirac grows and the corresponding ellipse intersection expands. Over the duration of the time bin, the integral represents the light returning from the dark grey annulus, between the red and blue segments, shown in Figure 6a. In [6], p l and p f are treated as being at the origin, so that the ellipsoid reduces to a sphere and the planar intersection reduces from an ellipse to a circle; the double integral in (7) is converted to polar coordinates, resulting in an approximate, closed-form solution to (7). In this work, since the spatial diversity of our sensor array is essential, we cannot use this approximation. Instead, we develop a quick computation of the rates due to a vertical facet under the more general elliptical model. Consider the coordinate system in Figure 6a with laser position p l and pixel centerp f,n . The ellipsoid corresponding to round-trip travel distance d may be written as where a e = c e = The planar facet is contained in a plane that may be described by normal vector n s and point q. The intersection of this plane and the ellipsoid in (8) is an ellipse that may be written in translational form [29] in a new [r, s, u] coordinate system: If q is chosen to be interior to the ellipsoid, the procedure in [29] may be used to find the new [r, s, u] coordinate system and provides formulae for A and B given n s , q, and the ellipsoid parameters (a e , b e , c e ). Figure 6b shows ellipses corresponding to   (7) sums the integrand over the dark grey annulus. This double integral can be formulated as integration in polar coordinates over a region in the rs-plane as shown in Figure 6c. For computational speed, we approximate this integral as where ∆ n,k arc is the annulus area. The pointp s is on the ellipse corresponding to the middle of the time bin (i.e., with parameters A mid and B mid ) at polar angle θ mid = (θ min + θ max )/2. Note that we have replaced a(p s ) with a under the assumption that the facet has uniform albedo. The area ∆ n,k arc is approximated as a fraction of a circular annulus: Ellipse radii at polar angle θ = θ mid : R start and R stop , marked in Figure 6c, may be computed using In practice, when a facet is wide, the extent of the annulus may not be well described by a single central pointp s . When the distance d annulus between the edges of the annulus, marked with yellow dots in Figure 6c, is greater than parameter d max , the annulus is divided into smaller annulus segments that each satisfy d annulus < d max . The full procedure for computing the facet response at each of the N camera pixels and K time bins is outlined in Algorithm 1. Figure 7 compares facet responses computed using Algorithm 1 (solid line) with those computed using a slower high-fidelity simulation tool based on numerical integration (dashed line with markers) for a person-sized facet at four different azimuthal rotation angles. Different colored curves denote rates computed at different array pixels. Note that in all four cases, our fast forward model computation closely matches the higher-fidelity simulator. In Figure 8, we show results for a shorter facet, similar in height to the child-sized mannequin we use in our experiments. The shorter facet also exhibits close match between Algorithm 1 and the higher-fidelity simulation. Figure 9 shows s exact − s fast 1 / s exact 1 at different points in space for a vertical facet facing the origin. Here, s fast is the vector of rates computed using Algorithm 1 and s exact is generated using the higher-fidelity simulation. Although the shorter facet (a) has more error (i.e., larger s exact − s fast 1 / s exact 1 ) than the taller facet (b), both have relatively small error. In our experimental demonstrations, we use a small scaled-down room with objects similar in size to the one tested in Figure 9a. Although our experimental results in Section D are achieved using a scaled down setup, Figure 9b suggests that the forward model computation in Algorithm 1 is even more accurate in a larger, more life-like setting.

A.3 Facet time bounds
Take t min and t max to be the smallest and largest arrival times due to the facet, corresponding to round-trip travel distances d min and d max . We seek d min and d max in order to compute rates due to the facet at all affected time bins. To simplify the geometry, we assume we are only interested in vertical rectangular facets resting on the floor. Under this assumption, the part of the facet with the shortest round-trip travel distance is somewhere along its bottom edge. The two bottom vertices of the planar facet v 1 and v 2 are contained in the xy-plane (i.e., the ground plane), as shown in Figure 10. When v 2 (1) = v 1 (1), the line that runs along the bottom of the facet, connecting the two lowest vertices (v 1 and v 2 ), is given by where The intersection of the ground plane with the ellipsoid in (8) is an ellipse ("Ellipse Ground") (not to be confused with the ellipse in the rs-plane of (9)), with foci at [0, m/2] and [0, −m/2]. The equation for this ellipse in the xy-plane is   find θ min and θ max by finding intersections of ellipse (A mid , B mid ) with facet edges 9: compute θ mid = 1 2 (θ min + θ max ) 10: compute R start and R stop using (12)  compute approximate facet response in kth time bin at ith pixel using (10) and (11) 12: next time bin starts with end of this one 13: end for each 14: end for each is met, Ellipse Ground (15) and the line in (13) intersect at a single point, rather than two, corresponding to round-trip travel distance d int . Substituting (16a) and (16b) into (17) and solving for d int yields We use d int in (16a) and (16b) to solve for A ground and B ground and find the corresponding single point of intersection by solving (13) and (15) for (x int , y int ): If the point (x int , y int ) is on the line segment between v 1 and v 2 , then the shortest round-trip travel time from p l to the planar facet and back to p f is d min = d int . If the point (x int , y int ) is not on the line segment between v 1 and v 2 , then the point on the line segment that is closest to the point (x int , y int ) is the point with the shortest round-trip travel time from p l to the facet and back to p f . This is one of the bottom facet vertices v 1 or v 2 : The longest round-trip travel time d max corresponds to the furthest of the top two vertices (v 3 and v 4 ): In the special case where v 2 (1) = v 1 (1), we recall that p l and p f are equally spaced from the origin along the y-axis. In this case, if sign(v 1 (2)) = sign(v 2 (2)), we know that the point on the line segment from v 1 to v 2 with the shortest round-trip travel time is at [v 2 (1), 0, v 2 (3)] and When sign(v 1 (2)) = sign(v 2 (2)), the shortest round-trip travel time is to the vertex v 1 or v 2 with the smallest magnitude y-coordinate: The indices of the first and last affected time bins are k min = floor(d min /(c∆ t )) and k max = floor(d max /(c∆ t )) respectively.

B Inversion Algorithm
In this work, we seek a reconstruction of change in the hidden scene from frame to frame. Until motion occurs in the hidden scene, measurements include light returning from hidden-and visible-side stationary scene content. When an object enters the foreground of the hidden scene, the new measurement changes to include added rates due to the foreground object as well as a rate reduction due to the object's occlusion of the background. At each new frame, we use these changes to reconstruct the moving object as well as the stationary background behind it. As an object traverses the hidden scene, these background segments accumulate to form a reconstruction of the stationary hidden scene.

B.1 Model and likelihood
When objects move into the hidden scene, the camera measurement at the nth spatial pixel and the kth time bin is Poisson distributed, x n,k ∼ Poisson(b n,k + s n,k fg (ψ fg ) − s n,k oc (ψ fg , ψ oc )), where b ∈ R N ×K is the rates due to stationary scenery, s fg ∈ R N ×K is the response of the objects, and s oc ∈ R N ×K is the response of the occluded background region, before the objects enter. We assume that we have observed the floor for long enough that b is approximately known. 1 Vectors ψ fg and ψ oc contain parameters that describe the foreground objects and corresponding occluded background regions, all of which are modeled as vertical, planar, rectangular facets that face the occluding edge. The foreground facets have parameters where occluded background region m has albedo a m oc and range r m oc . As described in Appendix B.4, the height of the occluded region depends upon its range r oc and on the range r fg and height h of the foreground facet. Given parameters ψ fg and ψ oc , the procedure outlined in Algorithm 1 may be used to quickly compute s fg and s oc .
Because all the Poisson random variables in (24) are independent, the likelihood of the measurement vector x given parameters ψ fg and ψ oc is the product f (x | ψ fg , ψ oc ) = N n=1 K k=1 b n,k + s n,k fg (ψ fg ) − s n,k oc (ψ fg , ψ oc ) x n,j exp − b n,k + s n,k fg (ψ fg ) − s n,k oc (ψ fg , ψ oc ) x n,k ! .

B.2 Estimation
Our inversion algorithm, outlined in Figure 11 and Figure 12, estimates the parameters ψ fg and ψ oc for each measurement frame where motion has occurred.
Parameter initialization. Figure 11 covers the parameter initialization procedure. Rates due to stationary scenery on visible and hidden sides of the occluding wall (red) are estimated from an initial reference measurement In the pre-processing stage, a difference frame (blue) y is computed by subtracting the estimated background rates (red) b from the new measurement frame (green) x t : y = x t − b. For each time index k, this difference frame is spatially integrated (across all spatial pixels in the SPAD array) to create the collapsed profile of photon counts versus range, y SI (orange): The light travel distances corresponding to the maximum and minimum values of y SI are halved and used to initialize parameters r fg and r oc . The time bin indices K fg of the foreground object are selected according to where p is the index of the minimum entry of y SI and β time < 1 is a tuning parameter used to set the threshold. For each pixel n, temporal integration (yellow) of y t is performed over the foreground time bin indices to form the equivalent of a passive An example of y passive for a single-object scenario, with a clear penumbra pattern, is pictured in the yellow box in Figure 11. This 'passive' measurement y passive is fed into the one-dimensional passive corner camera algorithm of [20] 2 to produce a onedimensional profile of the hidden scene s θ ∈ R Q as a function of azimuthal angle θ around the corner, where Q is the number of angular bins in our discrete representation of the 1D hidden scene. In all results presented here, Q = 64. An example of s θ is shown in the black box in Figure 11, where a sharp peak is visible at the object location around θ = π/2. The azimuthal resolving power of the vertical edge makes this representation of our data well suited for counting the number of objects M moving in our hidden scene [21]. We compare s θ to a threshold (β θ /Q) Q q=1 s q θ , where β θ is a tuning parameter. Each interval where s θ is above the threshold is considered to be a single object. The angles of the first and last threshold crossing for each object are used to initialize parameters θ min and θ max .
Parameter estimation. Figure 12 outlines the parameter estimation and post-processing steps. Parameter estimation is performed using the Metropolis-Hastings (MH) algorithm, a type of Markov chain Monte Carlo (MCMC) method, and the likelihood in (27). Note that although all parameters in ψ fg and ψ oc could be estimated simultaneously in a single run of the MH algorithm, we choose to separate the parameter estimation into two stages for speed. Foreground parameters ψ fg are estimated first, assuming there is no background occlusion. Next, background parameters ψ oc are estimated, keeping the foreground parameters ψ fg fixed. In this way, each stage only requires us to evaluate the forward model for a single facet per moving object, per iteration of the MH algorithm. When estimating foreground parameters, we only compute the response s n,k fg (ψ fg ) for each proposal; when estimating background parameters, s n,k fg ( ψ fg ) is fixed using already estimated foreground parameters ψ fg and we only evaluate s n,k oc ( ψ fg , ψ oc ) at each algorithm iteration. Separating parameter estimation into two problems is justified by the fact that foreground objects and occluded background regions generally affect very different swaths of time bins. Additionally, because foreground objects are generally closer, their measured responses are much larger. Thus, foreground parameters ψ fg may be accurately estimated without incorporating background occlusion into the model. Although inter-frame parameter priors could be incorporated into our algorithmic framework, we demonstrate good performance with simple uniform priors on the parameters in ψ fg and ψ oc , choosing wide bounds to exclude extreme and unrealistic scenarios.
In the foreground parameter estimation stage, we use the MH algorithm to draw samples from the posterior distribution: where g fg (ψ fg ) is the prior on ψ fg , with S fg defined to be the set of all possible values of ψ fg that arises from uniform priors on each of the parameters in ψ fg . The likelihood f fg (x | ψ fg ) of measurement x given parameter ψ fg may be approximated as where the effects of background occlusion have been excluded from the model for speed. Candidate parameter values ψ fg are drawn according to ψ fg ∼ N (ψ t fg , Σ fg ), where ψ t fg is a vector containing the current state of each unknown parameter and Σ fg is a diagonal matrix with an entry corresponding to each parameter's proposal variance. This matrix is scaled every 100 iterations to achieve an acceptance rate near 23% [37].
Following the MH algorithm, proposal ψ fg is accepted with probability ω: where (a) arises from the fact that for ψ t fg to have been accepted, g fg (ψ t oc ) > 0, so that For each proposal ψ oc , we evaluate (33) with where to prevent a computing overflow, we compute log f fg (x n,k | ψ fg ) as where Γ(·) is the Gamma function. An approximate constrained maximum likelihood (ML) estimate ψ fg is formed by drawing samples from f fg (ψ fg | x) using the procedure outlined above, binning samples of each parameter in ψ fg into histograms, and taking the center of the most commonly occurring bin as the estimate of that parameter. In the background estimation step, we fix the foreground parameter estimate ψ fg and estimate the parameters ψ oc that describe the occluded regions behind them. The posterior distribution of ψ oc given measurement x is where g bg (ψ fg ) is the prior on ψ oc : and S bg is the set of possible parameter values that arises from uniform priors on the parameters in ψ oc and from a constraint that that any parameter ψ oc must yield positive rates: b n,k + s n,k fg ( ψ fg ) − s n,k oc ( ψ fg , ψ oc ) > 0. With the estimated foreground rates ψ fg fixed, the likelihood f bg (x | ψ oc ), including the effects of occlusion within the hidden scene, is approximately Using a procedure similar to the foreground parameter estimation step, we draw samples from the posterior distribution in (39) using the MH algorithm and form an approximate constrained ML estimate ψ bg .
Laser power correction. In our experiments, we observed the laser power to fluctuate over the duration of our acquisitions. Before processing the data, we estimate a multiplicative scaling factor to adjust for any laser power fluctuation that occurred between the reference measurement and the motion frame. Take b to be the estimated background rates, before correcting for variable laser power. The motion frame measurement at time t is given by x t . Assuming that reference and motion frames should have the same rates at close range (because only more distant parts of the scene are changing), we compute the scale factor κ using all N camera pixels summed over the first 10 time bins: The corrected background measurement, used in all previous parts of this section, becomes b = κb .

B.3 Combining estimates from subsequent frames
As frames accumulate, the sequence of estimates ψ fg and ψ bg may be processed together to form a reconstruction of the hidden scenery behind the moving objects. This idea is illustrated in the POST PROCESSING box in Figure 12. Here, the estimated occluded regions for three subsequent frames are shown in blue. In our post-processing step, we connect the vertical lines running through the center of these facets to form the combined reconstruction shown in red. The height of each red facet is the same as the blue facet from the previous frame. Combined reconstruction results shown in Section D were produced in this way.

B.4 Computing the vertices of occluded regions
An occluded background region is described by range r oc measured from the occluding edge at angle θ mid = (θ max + θ min )/2. We describe the occluded background region as a subset of the vertical plane facing the occluding edge at range r oc , determined by the position and size of the foreground object as well as the location of the laser spot p l . 3 We seek the vertices of the occluded background region so that its response s oc may be computed using Algorithm 1. Although in certain geometries the occluded background region may not be exactly rectangular, we make a rectangular approximation for speed. Assume we have chosen our coordinate system so that laser spot p l is at the origin and the ground is in the xy-plane. The vertical plane that contains the occluded background region may be written in terms of a surface normal vector n plane = [n x plane , n y plane , 0] and a point on the plane surface p plane = [p x plane , p y plane , p z plane ] : 4 A vertex of the background facet v bg may be written in spherical coordinates: where α is an angle measured down from the positive z-axis, δ is an angle measured from the positive x-axis towards the positive y-axis, and r is the range from the origin. Because the occluded region is projected from the foreground facet onto the back plane, α and δ may be computed using the corresponding vertex v fg of the foreground facet: The unknown range r of the background vertex v bg is found by substituting the coordinate expressions of (44) into (43) and solving for r: r = n plane , p plane n x plane cos α sin δ + n y plane sin α sin δ .

C Acquisition Time Analysis
In this work, we use integration time to refer to the total time over which the camera collects meaningful data. We use acquisition time to refer to the total time it takes the camera to collect, accumulate, and transfer data. Results have been reported in terms of integration time because it reflects the true capability of our proposed imaging system and not certain limitations of our hardware. Newer SPAD arrays are capable of achieving acquisition time roughly equal to the integration time, exploiting multigates approaches, such as the one implemented in the 32 × 32 pixel array presented in [38].
For the older camera [30] used in our experiments, the USB 2.0 standard is used to transfer data from the camera to our lab computer, resulting in the loss of 83% of the measured frames; with the newer USB 3.0 standard, no measured frames would be lost. The limited gating capabilities of our older camera only allow us to observe 800 ns of each frame of duration 10 µs, resulting in the loss of 92% of the illumination periods. Combined, these two acquisition and data-transfer limitations cause our acquisition times to be about 1 (0.08)(0.17) ≈ 75 times longer than they would be on a newer camera. Although we demonstrate integration times short enough to track objects in motion, the required integration time could be further reduced by increasing the laser power or the SPADs' fill factor and photon detection probability, for instance exploiting 3D stacking technologies or microlens arrays. In a representative measurement in our system, after removing 39 hot pixels, we found that the remaining pixels had detections in approximately 0.0025% of illumination periods (see Table 1 in Appendix C). Thus, we could increase our system detection rates (through some combination of increases of laser power, fill factors, and detection probability) by a factor of ∼ 2000 (decreasing our required integration time by the same factor) before reaching the 5% threshold at which dead time distortions are conventionally regarded to become significant [39]. By this calculation, the results in the paper produced using an integration time of 0.4 s could be achieved with 200 µs integration time, enabling tracking at 5000 updates per second. Furthermore, methods to mitigate dead time effects would facilitate interpretation of data at higher count rates [40][41][42]. Note, however, that we are not considering the computational demands of very high-speed tracking.

D Additional Experimental Results
Our inversion algorithm has been tested in a variety of conditions, including different hidden scenes, lighting conditions, and frame lengths. In Section D.1, we demonstrate our reconstruction algorithm on a hidden scene containing a single, moving planar object and explore the effects of frame length and reference measurement integration time. In Section D.2, we show more detailed reconstruction results for the scene containing two moving planar objects that was presented in the main document. In Section D.3, we demonstrate that our algorithm continues to work well in challenging conditions. We show results for scenes with non-planar moving objects, more complicated stationary hidden scenery, and with extreme amounts of ambient light.

D.1 Single-object demonstrations
In Figure 13 plotted on top of each other in red. The SPAD FOV is shaded light gray, the thick dark line on the floor marks the footprint of the occluding wall, and the dotted arcs mark points on the floor that are 0.5 m, 1 m, 1.5 m, and 2 m from the occluding edge. The third figure row shows sample measured histograms at three different pixels in the camera FOV. Note that in the measured histograms, the peak at about 1 m corresponds to the first bounce. Due to the location of the laser spot, the first bounce is occluded from view for some SPAD pixels and is thus most pronounced in the histograms for other pixels (e.g., in Pixel 1024 of Figure 13). The much smaller peak near 3.5 m in range is due to the foreground object, and the counts observed between 3.5 m and 8 m are due to the back walls and ceiling. For all three frame lengths, the foreground facets are correctly placed on a 1.25 m arc around the occluding edge. Foreground facet height estimates are also very close to the true height of 1.1 m. This is easily observed in first figure row, where the tops of reconstructed foreground facets clearly align with each other with very little variability. For all three frame lengths, we observe that the background accuracy is greatest at azimuthal angles closer to the hidden side, with increasing error deeper in angle into the hidden scene. As expected, scenery deeper into the hidden scene illuminates fewer camera pixels and thus results in lower measurement SNR. We note that although foreground estimates are quite accurate for all three integration times, error in the background reconstruction increases as the integration time decreases.
In Figure 14, we use the same single-object scenario as in Figure 13 to explore the effect of reference measurement integration time on reconstruction accuracy. Here, each new frame has an integration time of 0.8 s while testing reference measurement integration times of 8.1 s, 3.3 s, and 1.6 s seconds. Figure columns show results for different integration times; figure rows show two different reconstruction views. As in Figure 13, foreground facets are correctly placed in range at 1.25 m. However, when stationary scene integration time is decreased to 1.6 s, we observe more error in foreground facet height estimates. Similarly, background reconstructions for 8.1 s and 3.3 s are comparable to the results in Figure 13, where an integration time of 33 s was used for the reference measurement. Increased error is seen in the background reconstruction when background integration time is reduced to 1.6 s. Although some of the results in this paper use the longest available background integration times, the results in Figure 14 suggest that there is little cost to using background integration times as short as 3.3 s.
The results in Figures 13 and 14 were for a single object fixed in range, and moving in angle around the occluding edge. In Figure 15, we fix a object in angle at π/2 and move it in range away from the occluding edge to demonstrate our algorithm on a variety of object positions within the hidden scene. Each figure column corresponds to a different object range (labeled above). The first row shows LOS photographs of the hidden scene in each case, and the second row shows our corresponding reconstructions. Because the object is not moving in angle, we do not attempt to process multiple frames into a combined background reconstruction. Instead, we plot the single-frame reconstruction results for each object position. Reconstruction results for the first five object positions closely match the ground truth. When the object reaches 1.75 m in range from from the edge, it is extremely close to the back wall and the reconstruction quality starts to degrade slightly, although the ranges of both foreground and background reconstructions are still very accurate. Because parts of the planar facet at 1.75 m share round-trip travel times with the back wall, separately estimating foreground and background parameters is less justified and there is less total change in the measurement due to the object.

D.2 Two-object demonstration
In Figures 16 and 17, we show more detailed reconstruction results for the two-object scenario in the main document. Columns of Figure 16 correspond to seven measurement frames as the objects move towards and then past each other. The object that starts on the left in Frame 1 is fixed at a range of 1 m; the object that starts on the right side is slightly further away at 1.25 m. LOS ground truth photographs are shown in the first row, with single-frame reconstruction results shown in the second row. For increased legibility, the z-axis in the reconstructed frames is cropped at 2.5 m, although the true ceiling height is 3 m. In all but Frame 6, two objects are correctly resolved and placed with very little error in range. There is some variability in object height estimates, most notably in Frames 4 and 5. In Frame 6, when the left object begins to cross in front of the right object, our algorithm resolves a single object. In all result frames, the background estimates closely match the ground truth. In Figure 17, we show two views of the combined background reconstruction, formed by accumulating a total of 13 background estimates (two each from Frames 1-5 and 7, and one from Frame 6). The estimate (blue) is extremely close to the measured ground truth (green).

D.3 Robustness demonstrations
In Figure 18, we explore the effects of model mismatch on our inversion algorithm. Each figure column corresponds to a different hidden object type while rows show ground truth LOS photographs and single-frame reconstruction results. The white and gray facet objects fit the rectangular, planar facet model, although the gray facet reflects fewer photons resulting in lower SNR. The mannequin is not a facet at all, and it is meant to test the common scenario where the hidden object is a person moving through the hidden scene. The stairs object is planar, but it is also wide and not rectangular, similar perhaps to a piece of furniture that has been relocated within the hidden scene or a moving car. The white facet reconstruction is the most accurate of the four examples. We note that the gray facet reconstruction is slightly wider than the ground truth (and the white facet reconstruction), with error likely due to the SNR reduction. In all four cases, the hidden object was fit by a rectangular planar facet placed correctly in range, and the range of the occluded background region is accurately recovered. We note that although our model does not allow us to reconstruct the varying height profile of the stairs, we correctly reconstruct it to be wider and more to the right than the other hidden objects. These results indicate that our rectangular, planar-facet model does not prevent our inversion algorithm from giving useful results under significant model mismatch. Instead, we expect to correctly locate and roughly describe (in width and height) a variety of hidden objects while also reconstructing the hidden scene behind them. Our reconstruction approach does not require any knowledge of the stationary hidden scene, as long as its response can be well characterized by a reference measurement. As a result, our inversion algorithm is not constrained to work only on simple hidden scenes containing just a few walls. In Figure 19, we demonstrate our inversion algorithm on a stationary hidden scene that contains a large stationary object in the foreground. In the first figure row, we show LOS photographs of the hidden scene. The first photograph, taken of the stationary scene before the moving object enters, shows a large white foreground object in addition to side and back walls. After the object enters the scene, it remains at an azimuthal angle of π/2 around the corner and moves to 1 m, 1.25 m, and 1.5 m ranges. At these positions, the moving object shares some round-trip travel times with the large stationary facet. The second figure row shows single-frame reconstruction results for the three moving object positions. In all cases, an accurate reconstruction of both the moving object and the occluded background region is formed.
In Figure 20, we evaluate our algorithm with high background counts created by turning the light on in the lab. Figure 20a Figure 18: A demonstration of algorithm robustness to different object types at a distance of 1.25 m from the occluding edge. The four columns correspond to different object types: a white facet, a dark gray facet, a mannequin, and a stair-shaped facet. The first row shows LOS photographs of the hidden scene; the second row shows our reconstructions. The true location of the back wall is shown in green with foreground and background reconstructions shown in red. Although the first two objects perfectly match the rectangular, planar-facet model, the dark gray facet returns far fewer photons, and the mannequin and stairs are not rectangular facets at all. In all four cases, the foreground object is well fit by our facet model and the occluded background region is placed correctly in range. Each measurement frame had a 0.4 s integration period, with the reference measurement integrated over 33 s.
ground truth stationary scene 1 meter 1.25 meter 1.5 meter reconstruction Figure 19: Reconstruction results for a challenging hidden scene with a large stationary foreground object. The first row shows LOS photographs of the hidden scene; the second row shows the corresponding reconstructions. The true location of the back wall is shown in green with foreground and background reconstructions shown in red. The first column shows a photograph of the stationary scene, acquired before the moving object enters the hidden scene. Subsequent columns correspond to three different positions of the moving object, which is fixed in azimuthal angle at π/2 (measured around the edge and into the hidden scene) and moved in range to 1 m, 1.25 m, and 1.5 m from the occluding edge. Note that even though the stationary foreground object and the moving object occupy similar range bins, reconstructions closely match the ground truth in all test cases. These results were produced using a 0.4 s integration time per frame with an 24.5 s reference measurement.
shows a photograph of the entire lab with the overhead lights on, Figure 20b shows just the hidden scene, and Figure 20c shows the hidden scene once the moving object has entered. Spatially averaged measurements for the reference and motion frames are shown in Figure 20d, with their difference shown in Figure 20e. Although the reference and motion frames and essentially indistinguishable with the naked eye, their difference, while noisy, has a discernible peak at the range of the moving target and a dip at the range of the occluded background wall; compare with Fig. 4C of the main paper, in which the difference is much less noisy. Single-frame reconstruction results are shown in Figure 20f. Despite the unmodeled effects of detector dead time and the large noise variance due to high background counts, the reconstructed object and background closely match the ground truth. Table 1 provides data from representative two-minute experiments with and without overhead lights. The number of integrated frames is not determined entirely by the acquisition time because of the random loss of frames over the USB 2.0 interface. Of the 1024 SPAD pixels, 39 are designated hot. After removal of these pixels, the remaining 985 pixels have about 100 times more photon detections with overhead lights on. The table outlines the computations of counts per pixel per illumination pulse under each of the two conditions. Since dead time effects are negligible, we may approximate the signal-to-ambient ratio for the results in Figure 20 as  , and a photo of the hidden scene after an object (at a range of 1.25 m and azimuthal angle of π/2) has entered the hidden scene is shown in (c). Spatially averaged measurements for the stationary scene (red) and one motion frame (blue) are shown in (d), with their difference shown in (e). Despite the high background counts, the reconstruction in (f) closely matches the ground truth. The results were produced using an integration time of 3.3 s with a reference frame integrated over 24.5 s.