Phasor field diffraction based reconstruction for fast non-line-of-sight imaging systems

Non-line-of-sight (NLOS) imaging recovers objects using diffusely reflected indirect light using transient illumination devices in combination with a computational inverse method. While capture systems capable of collecting light from the entire NLOS relay surface can be much more light efficient than single pixel point scanning detection, current reconstruction algorithms for such systems have computational and memory requirements that prevent real-time NLOS imaging. Existing real-time demonstrations also use retroreflective targets and reconstruct at resolutions far below the hardware limits. Our method presented here enables the reconstruction of room-sized scenes from non-confocal, parallel multi-pixel measurements in seconds with less memory usage. We anticipate that our method will enable real-time NLOS imaging when used with emerging single-photon avalanche diode array detectors with resolution only limited by the temporal resolution of the sensor.

One of the primary arguments that is made throughout the manuscript is that the proposed method has a lot lower memory requirements than other real-time NLOS algorithms. This is false, as the required 3D Fast Fourier Transforms in those methods can be sequentially applied to all three dimensions x,y,t to reduce their memory footprint by three orders of magnitude at the cost of increased compute times albeit with the same computational complexity of O(N^3logN).
The proposed method is an approximation: Fourier-based free-space propagation followed by a simple max filter. Other methods, such as LCT and FK, are exact solutions to the inverse problem. Therefore, I don't see why the proposed method could be better for confocally scanned data (some of the results seem questionable, see comments below).
A discussion of available photon counts is missing. It is argued that real time capture is important and that SPAD arrays could achieve that. Yet it is neither shown nor likely as these arrays have significantly worse fill factors and photon detection probabilities. The photon counts of the proposed non-confocal system should actually be similar or worse than confocally scanned data.
Finally, there seems to be some pre-processing required for Phasor Fields and this should be included in the compute time calculations / performance measurements.
Examples of claims that are not demonstrated: -The abstract motivates real-time capture with SPAD arrays, yet there is no discussion of actually capturing data with a SPAD array. Was any of the data captured in real time? Was a SPAD array used? It seems unlikely, because available SPAD arrays neither have the time resolution nor the photon detection probability (i.e., "sensitivity") to actually capture a sufficient number of thirdbounce photons of diffuse room-sized scenes as claimed in the paper. No discussion of SPAD array hardware or optical setup used to capture any of the presented data is included and all of the data shown is captured, in previous work, with a single scanned SPAD.
-"SPADs can potentially be manufactured at low cost and in large arrays enabling fast parallel NLOS capture." -this statement (among others) indicates that SPAD arrays and thus real-time capture was not actually implemented, but that seemed like a major claim in the abstract and early parts of the introduction -there is no quantitative evaluation of any of the results; it is claimed that the proposed method is "better" and "less blurry" than other methods, but that is not quantified anywhere Examples of false or misleading statement: -"current reconstruction algorithms have computational and memory requirements that prevent real-time application on a desktop or embedded computer" (abstract) -as discussed below, LCT and FK demonstrated NLOS imaging in real time already -"The fastest existing algorithms also require a point-scanning or confocal acquisition of the data making capture hardware inherently noisy, slow, and complex." -single SPADs have significantly better noise performance and fill factors than SPAD arrays; the complexity of scanning confocal systems is the same as commercial LiDAR systems -"Existing real-time demonstrations, therefore, use retroreflecting targets and reconstruct at resolutions far below the hardware limits." -this is false as O'Toole et al. 2018 demonstrated that their LCT algorithm achieves exactly the theoretical resolution limit -"Running on a desktop computer, our method recovers a room-sized volume in seconds and has a computational complexity equal to the fastest existing algorithms and less memory usage." -this is not new and has been demonstrated previously -"An algorithm suitable for fast NLOS imaging must fulfill three separate requirements: The ability to use data that can be captured in real time, a computational complexity allowing for execution in a fraction of a second..." -the proposed work does not show real-time capture and only relies on non-real-time data captured in previous work; real-time algorithms have been demonstrated in the past -"Real time reconstruction of low resolution retro-reflective scenes has also been demonstrated in a confocal scanning scenario with both LCT and FK Migration methods. However, reconstruction of higher resolution scenes with diffuse surfaces is hindered by the large memory requirements and the slow confocal capture process requiring sequential point scanning capture with a single SPAD pixel." -these previous methods require a 3D FFT; this can simply be implemented by successively Fourier-transforming each of the dimensions x,y,t to reduce the memory footprint by three-orders of magnitude at the cost of slightly increased computing times (although the same order of compute time O(N^3 logN)) -"This new method performs at speed similar to LCT and FK Migration when used on confocal data, while requiring significantly less memory" -false, see comment above; it is also not clear why non-confocal measurements would be better than confocal measurements -The results in Figs. 11,12 are questionable. The reconstruction quality of LCT and FK seem a lot worse than what was presented for that same data in the FK paper.
Other issues: -non-confocal measurements are being discussed as being significantly better than confocal measurements, but it seems largely unclear why that is, because the scanning and system calibration complexity seems a lot higher for non-confocal measurements. This submission presents an exciting and timely contribution to the field of non-line-of-sight image reconstruction algorithms. Because of its convincing results, accessible technical description, and the promised dataset/source code, this work will have a high-impact and inspire other work with eventual applications across domains.
The authors present a memory and compute-efficient approach based on Rayleigh-Sommerfeld Diffraction for non-confocal non-line-of-sight measurements. While the best results of previous confocal approaches are obtained by arguably "cheating" with retroreflective materials [27], that is by engineering the BRDF of the unknown hidden scene to substantially increase SNR of the confocal measurement setup, this work considers generic scenes with a non-confocal measurement arrangement. The presented reconstruction algorithm is well-motivated and explained in detail, even providing pseudo-code of the method itself. The authors could only have provided the measurement and the algorithm source code to go beyond what is provided at this stage. The method is well evaluated against recent reconstruction methods, with only a few open points (see these listed below), and provides both a solid theoretical framework for further algorithm development in this space, as well as, a solid baseline for additional non-line-of-sight methods.
This work should be published in a timely manner, especially to keep pace with this fast-moving field. I would like the authors to consider the following requests in the final manuscript: 1) Non-confocal acquisition: The differences between the non-confocal and confocal measurements should be highlighted early in the manuscript. I would suggest a separate paragraph/section for this comparison. Confocal measurements discard a substantial amount of measurements and this should be explicitly highlighted. I also don't quite buy the claim of prior work [27,28] that confocal detection is substantially easier to implement as gating is absolutely necessary to not suffer from massive pileup due to the direct reflection. It would significantly add to the paper if the authors describe and analyze the additional transient light transport information briefly and make compare their setup briefly to these prior works.
2) Memory requirements and runtime: The claims regarding memory and runtime are validated on a broad set of reconstructions. The only open item here is the comparison against matrix-free backprojection and linear inverse solvers. While these assume isotropic reflectance only, it may be better to report compute and runtime complexity here instead of absolute runtime. Note that a single third-bounce transport on comparable resolution can be executed in under a second using plain CUDA on high-end GPU hardware. In the light of recent raytracing hardware acceleration, matrix-free solvers may also offer very high runtime performance, as speculated by the authors for a GPU version of their algorithm. However, as both is a bit speculative (without actually having done the GPU implementation), I would feel more comfortable softening this claim a bit in the final manuscript.
3) Comparisons: It would have been nice for reviewers to try and compare the reconstructions against other baselines, such as an isotropic solver with a quadratic program. Unfortunately, the measurements and reconstruction source code was not provided. That said, I applaud the authors for giving a very detailed description of their method in the supplemental material. To make the results fully reproducible, and facilitate future work building upon the ideas presented in this work, I'd encourage the authors to release their measurements and source code along with the publication of this manuscript. 4) Algorithm comparison overview: To parse the algorithm results a bit better, it would be great if the authors could provide a table comparing recent algorithms both in terms of compute and memory complexity. The Matlab runtimes are a bit of a strawman, as these depend on how Matlab schedules compute, organizes memory (not well), and implements array ops.
Reviewer #3 (Remarks to the Author): The authors present a detailed account of their recent work aimed at further developing the "virtual phasor field" approach to non-line-of-sight imaging. The phasor field approach is not itself new and has been reported before in several papers, coauthored also by the same authors of the present manuscript. The main novelty presented here refers to adaptation of this technique with the goal of improving reconstruction quality whilst reducing the required computer effort and resources in particular memory. I also appreciated the clear exposition of the numerical technique and underlying maths, although I do have some comments about this, as detailed below.
Given that the phasor approach to NLOS is not itself new, there is a somewhat weaker case for publishing in Nature Commun. However, I believe that the quality of the results together with the improvements and developments proposed here, will make this an important reference point for researchers not only in this field but potentially also in related fields, e.g. imaging in scattering media where similar approaches could find applications. I would therefore suggest acceptance after some suggested changes.
Comments: 1) page 2: "To the best of our knowledge none of them have been applied to larger and more complex scenes" -I am not sure I agree with this comment. Ref. [28] in the manuscript for example, presents 3D reconstruction based on the f-K transform (i.e. different from the phasor approach) of a large, complex scene with very high quality.
2) Presentation of the virtual phasor field approach: the authors carefully present their equations together with text explaining the steps. However I feel that there are still some points that could be further clarified. Looking at Eq.s 5 and 6, it seems that the phasor fields are nothing other than the frequency components of the light pulse (either outgoing or the return signal), frequency shifted away from zero frequency by an arbitrary quantity Omega_c. -How is this Omega_c chosen and why not just take Omega_c=0, i.e. use the actual FT of the laser pulse envelope? -The authors say, after eq. 14, that they typically deal with "dozens" of Omega values. It is not clear where this number comes from. Given that the Omega spectrum is the FT of the direct space pulse envelope, there could/should potentially be many thousands of Omega points in the spectrum, depending only on the extent of the time scale over which the temporal measurements are recorded. -It might be useful to provide to provide a graph showing an actual example of a typical pulse shape used for eq. 5, how this transforms to eq. 6 and then how eq. 14 is applied (i.e. how the Omega spectrum is discretised before summing elements).
-I am not sure what to make of Figure 6. Maybe this was indeed an attempt to illustrate the various steps, including those mentioned in my comment. But I could not follow the logic or extract any useful information. Maybe adding the suggested graphs, thus making this a bit more quantitative, will help.
3) All of the figures lack any form of axis labels or indication of length scales, making it very hard to appreciate the results. Ideally, an additional picture of the scene should also be included as there is no ground truth image to compare with and thus judge the actual complexity of the scene and quality of the retrieval. 4) Acquisition times are indicated throughout the paper (also figures and tables) in ms. But it is not clear if this is the acquisition time for each pixel or acquisition point or if it is the total acquisition time. In the former case, total acquisition times should be also indicated. Also, total size of camera aperture on the observation screen (relay wall) and number of scan points should be clearly indicated, for example in figure 1 or maybe in a dedicated figure showing the details of the experiment layout (and maybe also at the beginning of the Results section where the setup is described, e.g. where the detector and laser features are described). 5) What is the laser wavelength and illumination power required? Clearly, this will affect SNR so acquisition times alone do not provide a valid indication of the speed of a system that one may attempt to build in their own lab. ********************************** Reviewer 1: ********************************** The recent Nature paper on phasor fields was a valuable contribution and very interesting. Phasor fields turn time-resolved NLOS measurements into a virtual wave and then propagate that wave to the hidden scene using a free-space propagation operator, such as the Rayleigh-Sommerfeld Diffraction Integral (Eqs. 1-4). These concepts were thoroughly explored in the recent Nature paper. Unfortunately, this manuscript adds little new. To be precise, the primary contribution of this manuscript is to implement the free-space propagation operator in the Fourier domain (Eq. 8). This is common practice in computational optics and outlined in standard textbooks, such as Goodman's Fourier Optics book, and also taught in introductory Fourier optics classes.

Authors:
We thank the reviewer for their through review and many helpful suggestions that help improve our manuscript. R1 raises many concerns regarding statements made in our paper that we will cover below. First we would like to mention that our main contribution is to provide a fast non-approximative reconstruction method for a general non-confocal measurement. This has never ever been shown before. This core contribution does not appear to be contested by R1. We added a section (line 627 -693) to the manuscript to better explain the inherent advantages in non-confocal NLOS capture.
We believe there is a misunderstanding of some important details of our manuscript: R1 mentions that our convolutional solution presented in Eq. 8 is approximate, was introduced in the recent nature paper (Ref.[29]), and is described in Goodman's Book on Fourier Optics. R1 may be thinking of the Fresnel approximation. The Fresnel approximation is indeed an approximate convolutional solution to the diffraction problem that has all the asserted properties. What we however derive here is a convolutional solution to the RSD which is not an approximation. The reviewer is correct in that Goodman's Fourier Optics provides many fundamental insights for the diffraction imaging. The convolutional solution to the RSD integral, however does not appear to be part of it and is instead introduced as a novel contribution in recent publications ). The fresnel approximation is a convolutional approximate solution for the diffraction propagator. It is used in the recent Nature paper and covered in Goodman's Book. As is shown in our Nature paper the Fresnel approximation has large phase artifacts and does not provide a great reconstruction even for very simple scenes. It does not work as part of the approach presented here. We are happy to add more data to show this if requested. So to emphasize again: What we are presenting here is an exact propagation operator directly solving the RSD. It is not an approximation. We apologize for this misunderstanding. We modified the text to better clarify.
What is more concerning is that the authors make contradictory and false statements throughout the manuscript, possibly trying to mislead the reader and misrepresent their contributions; they also do not actually demonstrate many of their claims or quantitatively evaluate them. Therefore, I cannot support this submission.

Authors:
We apologize for the misconceptions arising from our first manuscript. Diffraction, FK Migration, and the spherical inverse radon transform (LCT) are very mature fields with large bodies of research that are unfamiliar to NLOS researchers. It is therefore difficult to summarize the important information from the related work concisely. We have tried our best to explain the matter better in the paper and address the concerns of R1 below.
One of the primary arguments that is made throughout the manuscript is that the proposed method has a lot lower memory requirements than other real-time NLOS algorithms. This is false, as the required 3D Fast Fourier Transforms in those methods can be sequentially applied to all three dimensions x,y,t to reduce their memory footprint by three orders of magnitude at the cost of increased compute times albeit with the same computational complexity of O(N^3logN).

Authors:
We appreciate the reviewer's concern about memory complexity. The key point here is that both LCT and FK require oversampling and interpolation of the data that is the actual complexity bottleneck for memory, speed, and accuracy. The FFT itself does not contribute significantly to the memory requirements except that it requires that the entire input data and reconstructed sparse 3D volume to be stored. This alone is small compared to the re-sampling and oversampling requirements, but it is already more than what our algorithm needs.
The interpolation issue is actually stated directly in the FK Migration literature. See for example (Numerical methods of exploration seismology: with algorithms in MATLAB®. Cambridge University Press, 2019.) on the page 151: A naive interpolation algorithm would require computing nearest neighbors which would an N^6 computational complexity or require extra memory by storing a map of nearest neighbors computed with a more efficient algorithm. More efficient methods may be possible, but require further research. We don't consider it likely that they can be used to reduce memory needs to something comparable to our approach.
Even if we assume that re-sampling and 3D FFT can be done with no memory requirements, the methods would still require significantly more memory than what we propose due to the need to build time domain histograms and store the complete sparse 3D volumetric result of the FFT.
We will explain this in more detail in the responses below and quote some relevant statements from the associated literature. We also added a section (line 303 -368 ) to the paper to clarify memory requirements further.
"The memory complexity of our algorithm is defined by the need to store the FDH and the resulting 2D image. … ..." (line 303 -368) In the existing published algorithms, the 3D fourier transforms do not dominate the resource need for both compute time and memory. We provide our code and the FK Migration and LCT codes for the reviewers to test with this revision. In addition to illustrate the computational and memory needs we provide a matlab script below that can be executed without any real data. We provide a matlab example code for the f-k migration method, the same method is being used for Ref. [28]. This is intended for illustrating the actual memory and run time for Stolt's method. You can run the code by simply copying text below. It will output the actual run time for each of the three steps in the f-k method. The memory profile is quite heavy which almost reaches 25GB for processing the interpolation step.
Program output actual execution for each step: (*perform on the same PC for our results submission). The raw dataset is captured in 512 by 512 scanning points across 2 by 2 meter relay wall. The LCT and FK re-sampling steps need extensive memory to satisfy the mathematical formula, thus it applies spatial downsampling from 512-512 to 128-128.
The proposed method is an approximation: Fourier-based free-space propagation followed by a simple max filter. Other methods, such as LCT and FK, are exact solutions to the inverse problem. Therefore, I don't see why the proposed method could be better for confocally scanned data (some of the results seem questionable, see comments below).

Authors: We would like to point out again that our main contributions is to provide a fast solution for the non-confocal detection scenarios where LCT and FK are no longer valid. The Rayleigh-Sommerfeld Diffraction (RSD) type of phasor field model is not an approximation for the time resolved non-line-of-sight application. The only approximation made is the omission of the obliquity factor. This factor accounts for angle dependent reflectance of the involved surfaces ("lambertian shading"). It is also ignored in LCT and FK Migration.
Besides the need to approximate missing fourier coefficients by interpolation, both FK and LCT (Ref.[27,28]) are exact only for infinite size relay walls, infinite temporal band-width and infinite spatial resolution on the relay wall. For a finite size relay wall and finite time sampling, windowed fourier transforms would have to be used and their kernel may look quite similar to the RSD kernel that is derived in our method as the result of our virtual illumination pulse.
FK and LCT also do not account for noise in the signal. Our convolution kernel acts as a band pass filter only passing frequency components that actually contribute to the reconstruction. LCT and FK pass both high frequencies that are beyond the capabilities of the capture hardware and low frequencies that are not useful for the reconstruction beyond contributing noise. As a result we find that our method is more robust to noise in the photon counts and noise in the positions on the relay wall.
In addition to our main contribution which is to demonstrate an algorithm working with non-confocal data, we also show our model can also be adapted to the special case of confocal measurements where it can be compared to LCT and FK Migration.
A discussion of available photon counts is missing. It is argued that real time capture is important and that SPAD arrays could achieve that. Yet it is neither shown nor likely as these arrays have significantly worse fill factors and photon detection probabilities. The photon counts of the proposed non-confocal system should actually be similar or worse than confocally scanned data.

Authors:
Note that the fill factor is not necessarily a problem for the total achievable photon count. Small pixel sizes available in first generation SPAD arrays (e.g. MPD and Princeton Lightwave) indeed make them worse than large area single pixel SPADs when they are positioned very close to the relay wall. At large distances (>100 meter) the small pixel sizes can be compensated by the magnification of the objective optics. However we do not propose that these first generation SPAD arrays should be used. Upcoming second generation arrays have much better characteristics thanks to the use of 3D stacking allowing for the placement of processing and memory in separate layers behind the SPAD pixels. All available commercial arrays -even second generation ones -are however designed for LiDAR which has very different requirements. We therefore are designing a dedicated SPAD array for NLOS imaging. It is understandable that someone with experience using first generation SPAD arrays and without deeper knowledge in the rationale behind the design choices might come to a negative view on SPAD arrays. With the general availability of 3D stacking, however, neither fill factor, nor time resolution or maximum count rate present any unsurmountable engineering challenges.
Note also that the quantum efficiencies of SPADs are not lower than those of other sensors. Our single pixel SPAD has a quantum efficiency between 40% and 50% in the green. Typical photography cameras have similar efficiencies (<40% for Canon 5D). ICCD cameras, Photomultiplier Tubes, Streak Cameras and Photocathode based Night Vision cameras also have around 40% quantum efficiencies. The only systems we are aware of with consistently higher quantum efficiencies are back illuminated scientific CMOS and EMCCD cameras who typically exceed 90% but are very slow.
The photon counts obtained in our NLOS measurements are given in detail in (Ref.[29]). It is straightforward to see that a system collecting light from more patches on the relay wall using more SPAD pixels would collect proportionally more light and that the amount of photons collected per time unit is proportional to the number of available pixels. Achievable photon counts from an array in the same setup can be estimated given the pixel area, the number of pixels, the focal length of the objective lens, and the distance between the objective and the relay wall.
We believe that the crucial need for SPAD arrays is an important factor motivating our work and we think it is important for readers of our manuscript to understand that motivation. Therefore included a new section motivating the use of SPAD arrays in the manuscript (line 627 -693). We thank R1 for bringing up these concerns and we believe addressing them makes our manuscript much stronger.
Finally, there seems to be some pre-processing required for Phasor Fields and this should be included in the compute time calculations / performance measurements.

Authors:
Pre-processing is required for Phasor Fields, as well as FK Migration and LCT. In all cases the Histograms H need to be generated from raw photon time stamps. The computational cost of this process depends on the number of photons and not the size of the reconstruction. This is why it is omitted in all methods. It is true that the generation of Fourier domain histograms is more time intensive than the generation of time domain histograms. This is because our current hardware measures the information in the time domain. The main reason to generate our fourier domain histograms directly as described in section is that the time domain histograms used in other methods require a lot of memory. If we store the raw data as a time domain histogram it would be the largest structure used in our method and define our memory complexity.
If we are not concerned with optimizing memory and would like to use time domain histograms for a better runtime comparison we can also compute the Fourier domain histograms from the time domain histograms using a set of 1D FFTs (one for each captured time response). Time requirement for doing this is small compared to the rest of the algorithm. In the provided example code it takes about 1.5 seconds for the largest scene.

Examples of claims that are not demonstrated:
-The abstract motivates real-time capture with SPAD arrays, yet there is no discussion of actually capturing data with a SPAD array. Was any of the data captured in real time? Was a SPAD array used? It seems unlikely, because available SPAD arrays neither have the time resolution nor the photon detection probability (i.e., "sensitivity") to actually capture a sufficient number of third-bounce photons of diffuse room-sized scenes as claimed in the paper. No discussion of SPAD array hardware or optical setup used to capture any of the presented data is included and all of the data shown is captured, in previous work, with a single scanned SPAD.

Authors:
We apologize that R1 got the impression that we capture with an array. In our submission, we do not claim that we captured with a SPAD array. The last sentence in the abstract was intended to make this clear "... that are currently under development" (line 21). We do think it is important to realize the inherent advantages of array capture and therefore the need for algorithms that are capable of utilizing array data. One way to assess the performance of an array do this is to invert the capture geometry of the system used in Ref.[29] that is used to generate our data. This system sequentially scans 24000 laser positions with a single stationary SPAD. In each laser position the SPAD is exposed for 5 milliseconds. As is pointed out in Ref.
[29] the capture path is reversible. If we used a 24000 pixel SPAD array and a single stationary laser we could collect the same data, but it would take only 5 milliseconds as all SPAD pixels can capture in parallel. We don't consider this to be a very controversial insight and believe that any reader with the necessary insight will come to a similar conclusion. In fact the existing rendering algorithms for NLOS design all render this inverted light path rather than following the actual direction of the light. We added a section (line 627 -693) to our paper to explain our reasoning. There are existing SPAD arrays with a sufficient number of pixels, good fill factors, and gated detection that can achieve this (Voxtel, currently a pre-release prototype, has 256 by256 pixels with 10% fill factor, 100 ps gate, 200 ps time resolution, 16 million counts per second for all pixels combined). Since they are designed for LiDAR they are however still far from optimal for NLOS imaging. We are currently developing a SPAD array with the lab that designed our single pixel SPAD (Prof. Tosi, Polimi). Given the fast paced nature of this field we decided to not delay publication of our algorithm until an array is available.
-"SPADs can potentially be manufactured at low cost and in large arrays enabling fast parallel NLOS capture." -this statement (among others) indicates that SPAD arrays and thus real-time capture was not actually implemented, but that seemed like a major claim in the abstract and early parts of the introduction

Authors:
We present the first fast NLOS algorithm that can be used with array detectors. In addition, our algorithm is more memory efficient than other algorithms. The SPADs and arrays we are in the process of building are indeed manufactured in the same CMOS facilities and using the same methods as regular CMOS camera chips (apart from the relatively new 3D stacking). One would therefore expect that they can be offered at a similar price.
-there is no quantitative evaluation of any of the results; it is claimed that the proposed method is "better" and "less blurry" than other methods, but that is not quantified anywhere

Authors:
In the paper we only compare existing methods which could solve the non-confocal scenarios with complex O(n^3 log(n)). Since we share similar roots with phasor field, we compare the fast reconstruction with the analytical integral in the Figure 7, -"Real time reconstruction of low resolution retro-reflective scenes has also been demonstrated in a confocal scanning scenario with both LCT and FK Migration methods. However, reconstruction of higher resolution scenes with diffuse surfaces is hindered by the large memory requirements and the slow confocal capture process requiring sequential point scanning capture with a single SPAD pixel." -these previous methods require a 3D FFT; this can simply be implemented by successively Fourier-transforming each of the dimensions x,y,t to reduce the memory footprint by three-orders of magnitude at the cost of slightly increased computing times (although the same order of compute time O(N^3 logN))

Authors:
Unfortunately, the 3D FFT is not the dominant memory requirement in LCT or FK. Even if the FFT, storage of the result, and resampling steps are assumed to require no memory at all, simply storing the raw data in a time domain histogram would require more memory than all memory needs of our method combined. The real memory and complexity bottleneck in both LCT and FK Migration, however, is the need for oversampling and interpolation of the data. We added a section discussing memory requirements with examples to the manuscript to clarify this point.
-"This new method performs at speed similar to LCT and FK Migration when used on confocal data, while requiring significantly less memory"false, see comment above; it is also not clear why non-confocal measurements would be better than confocal measurements

Authors:
We added a section (line 627 -693) explaining the signal advantages of array vs single pixel capture. It is important that this problem is very different from LiDAR in this respect. See our responses above regarding the memory requirements.
-The results in Figs. 11,12 are questionable. The reconstruction quality of LCT and FK seem a lot worse than what was presented for that same data in the FK paper.

Authors:
We use the open-source published code and data from Wave-based non-line-of-sight imaging Ref[28] on the paper). The code for generating the results for LCT and FK is from here: ( https://github.com/computational-imaging/nlos-fk ). In the paper, we only use two exposures data (minimal 10mins, maximum 180mins) for the comparisons. The results we used on the paper match the same with the author's published code (Ref.[28]). Details are provided below.
As for the differences from the visual appearance, there is another missing piece on the visualization side. Lindell et.al (Ref.[28]) use the 3d volume rendering visualization for their main paper (as well as their supplementary document) without axis denoted in both dimensions. This may end up having a different (smaller/bigger) volume and might cause the differences to reproduce the results shown in their paper (Ref.[28]) by using their published code ( https://github.com/computational-imaging/nlos-fk ). We also believe that the results for the paper were rendered at higher resolution than what is provided in the official published code. Running the reconstructions at full resolution requires over 200 GB of RAM and would not be possible on most desktop computers.
Let us use the first row in Figure 11 and 12 from our paper as examples. 1st row on Figure 11 is from bike 10 mins dataset (folder: './bike/meas_10min.mat'), 1st row in Figure 12 is from the bike 180 mins dataset (folder: './bike/meas_180min.mat'). We put the results used for our submission and results generated from the published version side by side in the following. Notice that we use the same code for 2d visualization, which is provided by the FK source code ( https://github.com/computational-imaging/nlos-fk ). Other issues: -non-confocal measurements are being discussed as being significantly better than confocal measurements, but it seems largely unclear why that is, because the scanning and system calibration complexity seems a lot higher for non-confocal measurements.

This submission presents an exciting and timely contribution to the field of non-line-of-sight image reconstruction algorithms. Because of its convincing results, accessible technical description, and the promised dataset/source code, this work will have a high-impact and inspire other work with eventual applications across domains. The authors present a memory and compute-efficient approach based on Rayleigh-Sommerfeld Diffraction for non-confocal non-line-of-sight measurements. While the best results of previous confocal approaches are obtained by arguably "cheating" with retroreflective materials[27], that is by engineering the BRDF of the unknown hidden scene to substantially increase SNR of the confocal measurement setup, this work considers generic scenes with a non-confocal measurement arrangement. The presented reconstruction algorithm is well-motivated and explained in detail, even providing pseudo-code of the method itself. The authors could only have provided the measurement and the algorithm source code to go beyond what is provided at this stage. The method is well evaluated against recent reconstruction methods, with only a few open points (see these listed below), and provides both a solid theoretical framework for further algorithm development in this space, as well as, a solid baseline for additional non-line-of-sight methods.
This work should be published in a timely manner, especially to keep pace with this fast-moving field. I would like the authors to consider the following requests in the final manuscript:

Authors:
We appreciate the recommendation in a timely publication manner from you. We are also happy to see our work could set up a baseline for future research in this area. We also agree with the open points, and the changes are made, as discussed below.

Authors:
We agree with the reviewer. We added a section describing in detail the signal properties in NLOS measurements and current optical limitations to highlight the potential of SPAD arrays.

2) Memory requirements and runtime: The claims regarding memory and runtime are validated on a broad set of reconstructions. The only open item here is the comparison against matrix-free backprojection and linear inverse solvers.
While these assume isotropic reflectance only, it may be better to report compute and runtime complexity here instead of absolute runtime. Note that a single third-bounce transport on comparable resolution can be executed in under a second using plain CUDA on high-end GPU hardware. In the light of recent raytracing hardware acceleration, matrix-free solvers may also offer very high runtime performance, as speculated by the authors for a GPU version of their algorithm. However, as both is a bit speculative (without actually having done the GPU implementation), I would feel more comfortable softening this claim a bit in the final manuscript.

Authors:
We appreciate the reviewer(s) concern. The current GPU implementation relies on backprojection solver, which has a higher algorithmic complexity, and the published one we are aware of runs much longer Ref. [8]. We agree that similar to the problem in the computed tomography (CT), the backprojection approach with higher complexity can be faster with a proper CUDA with high-end GPU implementation.
We soften this claim and remove the sentences below for our new revision, "... The low memory use of our method makes it a good candidate for GPU parallel implementation which may reduce reconstruction times to fractions of seconds . ..."

4) Algorithm comparison overview: To parse the algorithm results a bit better, it would be great if the authors could provide a table comparing recent algorithms both in terms of compute and memory complexity. The
Matlab runtimes are a bit of a strawman, as these depend on how Matlab schedules compute, organizes memory (not well), and implements array ops.

Authors:
We appreciate the reviewer(s)' concern.
We add more details about the algorithm complexity and memory requirement discussion in the Section "Algorithm complexity". Our method, FK Migration, and LCT all have the same memory complexity. In practice LCT and FK migration need about 100 to 200 times more memory than our method due to the need for oversampling and grid interpolation. We added an example to explain the reason behind this large memory requirement.

"The memory complexity of our algorithm is defined by the need to store the FDH and the resulting 2D image. … ..." (line 303 -368)
The reason for us to choose Matlab is that the published code we compare against is in Matlab.
We have a c++ implementation based on the OpenCV library with roughly 20MB memory usage with 6 seconds processing time. We can also make our method faster by pre-computing all the kernels used during reconstruction. This allows us to reconstruct in 0.2 seconds on a desktop computer without using GPU, but the memory is increased to about 5GB in this scenario. ********************************** Reviewer 3: ********************************** The authors present a detailed account of their recent work aimed at further developing the "virtual phasor field" approach to non-line-of-sight imaging. The phasor field approach is not itself new and has been reported before in several papers, co-authored also by the same authors of the present manuscript. The main novelty presented here refers to adaptation of this technique with the goal of improving reconstruction quality whilst reducing the required computer effort and resources in particular memory. I also appreciated the clear exposition of the numerical technique and underlying maths, although I do have some comments about this, as detailed below. Given that the phasor approach to NLOS is not itself new, there is a somewhat weaker case for publishing in Nature Commun. However, I believe that the quality of the results together with the improvements and developments proposed here, will make this an important reference point for researchers not only in this field but potentially also in related fields, e.g. imaging in scattering media where similar approaches could find applications. I would therefore suggest acceptance after some suggested changes.

Authors:
Thank you for this recommendation. We agree that the scope of our manuscript is less broad than the recent Nature publication. We believe however that it demonstrates a crucial capability in NLOS imaging by showing for the first time that data from array detectors can be used with a fast convolutional reconstruction algorithm. We added a section (line 627 -693) on the importance of using array detectors in capture to highlight the relevance of this. We also believe that it will further validate the phasor approach in general which has also been used by other research groups in recent publications.

Comments:
1) page 2: "To the best of our knowledge none of them have been applied to larger and more complex scenes" -I am not sure I agree with this comment. Ref.
[28] in the manuscript for example, presents 3D reconstruction based on the f-K transform (i.e. different from the phasor approach) of a large, complex scene with very high quality.

Authors:
We are sorry for the ambiguity in terms of first submission. We do not infer FK/LCT Ref. [28] can not apply to the complex scene. The paragraph containing the sentence "To the best of our knowledge ..." only refers to the work discussed in the section above. FK and LCT are discussed after. While the scenes they can reconstruct are still significantly smaller in volume than our office scene we agree that they are similar in complexity. We changed the text to avoid this misunderstanding.
To clarify the description, we adjust the text as below, "... To the best of our knowledge , none of the works above have been applied successfully to larger and more complex scenes with the exception of the back-projection based methods. Methods that can process scenes of moderate and high volume and complexity include FK Migration, the Light Cone Transform, and Phasor-Field virtual waves which are discussed below. " (line 53 -57) To clarify the description for confocal FK method, we also add the new sentence in the paragraph discussing current fast method (LCT and FK), "... Both algorithms rely on 3D convolutions allowing for fast reconstruction and demonstrate the ability to recover complex scenes from confocal measurements\cite{Ref FK}.
As we will see below, both methods require interpolation over irregular 3D grids in order to approximate the data points needed for the convolutions. This requires oversampling the reconstructions and computing nearest neighbors which is associated with significant added memory requirements. The crucial limitation of these methods that we explore in more detail below is, however, that they can only utilize the light returning from the confocal location on the relay wall and thus cannot utilize the vast majority of light available in an NLOS measurement. This is illustrated in the appendix ." (line 73 -80) 2) Presentation of the virtual phasor field approach: the authors carefully present their equations together with text explaining the steps. However I feel that there are still some points that could be further clarified. Looking at Eq.s 5 and 6, it seems that the phasor fields are nothing other than the frequency components of the light pulse (either outgoing or the return signal), frequency shifted away from zero frequency by an arbitrary quantity Omega_c.
-How is this Omega_c chosen and why not just take Omega_c=0, i.e. use the actual FT of the laser pulse envelope?

Authors:
The reviewer(s) are right, and we use the frequency components of each collected temporal measurements.
The temporal measurements are collected from picosecond device with temporal resolution around 60-70 picosecond. Because of the time invariant system behavior, we can use each frequency components to calculate the outgoing field from an input illumination wavefront (field).
Overall, the reasons to choose this illumination pulse is also shown and discussed in the previous Nature Ref. [29], and the way we choose $Omega_c$ follows the same as Ref. [29]. The $Omega_c$ in Eq.s 5 and 6 stands for the central frequency for this virtual illumination function (or can be transformed into the central wavelength). The $Omega_c$ central frequency (wavelength) has to be larger than twice the spatial laser or SPAD point spacing on the rely wall and larger than the time resolution of the system. In the Fourier domain the frequency of the virtual illumination $Omega_c=0$ becomes the offset.
We add additional texts below, "... The center frequency $Omega_c=0$ has to be chosen according to the spatial relay wall sampling. The smallest achievable wavelength should be larger than twice the largest distance between neighboring points $\vec{x}_\mathrm{p}$ and $\vec{x}_\mathrm{c}$ and larger than the temporal resolution of the imaging hardware. For example, given a spatial sampling of 1cm, the smallest possible modulation wavelength is larger than 2cm." (line 140 -144) "... the central frequency $Omega_c=0$ as it is shown in Fig.1" (line 150) Moreover, the frequency bandwidth (transform into wavelength) used for the reconstruction is provided in Table 4 and the relation between finite discrete sampling and wavelength is discussed under section "Discrete RSD  We appreciate reviewer(s) suggestions, further improvements on the capture can be made to explore the limits of the existing hardware system.
-The authors say, after eq. 14, that they typically deal with "dozens" of Omega values. It is not clear where this number comes from. Given that the Omega spectrum is the FT of the direct space pulse envelope, there could/should potentially be many thousands of Omega points in the spectrum, depending only on the extent of the time scale over which the temporal measurements are recorded.

Authors:
Similar to the question above, the frequency range depends on a particular virtual illumination function. For the frequency interval used during the reconstruction, it also need to consider the discrete spatial sampling on the wall. For a typical illumination function we used for the results (such as officescene), the number of the frequency components is up to 139 (Table 4

, Number of Fourier components). This is because the spectrum of this type of illumination with the Gaussian envelope is mostly concentrated along the central wavelength and otherwise almost zero. Thus based on the spacing of the scanning pattern, we can disregard most of the Fourier components in the measurements.
We agree with the reviewer(s) that the number of components also depends on the scene depth that determines the length of the recorded transients in time. We addressed this in one section in the text: "... In large scenes, it would also increase with scene depth which would increase computational and memory complexity. To avoid this, large scenes would have to be reconstructed in multiple depth sections. In this work, we reconstruct scenes with depth up to 3 meters representing the largest complex scenes for which data exist. For these scenes, a depth sectioning step is not necessary. " (line 258 -262) In addition, we added the following text for further clarification and quantification: "... Throughout the paper, we set γ to 0.01, meaning that all frequency components with magnitude smaller than 1% of the maximum magnitude are ignored. The discrete spacing Ω res of the considered frequency components is given by the FFT frequency resolution: Ω res = 2πf sampling /N bins , where f sampling is the sampling frequency of the histograms (i.e., 1/bin width) and N bins the number of time bins. " (line 251 -257) -It might be useful to provide to provide a graph showing an actual example of a typical pulse shape used for eq. 5, how this transforms to eq. 6 and then how eq. 14 is applied (i.e. how the Omega spectrum is discretised before summing elements). -I am not sure what to make of Figure 6. Maybe this was indeed an attempt to illustrate the various steps, including those mentioned in my comment. But I could not follow the logic or extract any useful information. Maybe adding the suggested graphs, thus making this a bit more quantitative, will help.

Authors:
Since those two questions are related to each other, we answer them together. We are sorry for our first submission figure about the illustration of the Fourier domain histogram. We appreciate and incorporate the reviewer(s) suggestions. The new figure with additional text to better explain is shown below.
We add the virtual pulse shape as an example used in the paper in time (eq. 5) and frequency (eq. 6) domain into Fig.1,6, as shown below, For the Fourier domain histogram (FDH), we revise our figure (Fig.6). We also add new text for better context flow, "... Fig.6.

We call this new capturing method a Fourier Domain Histogram (FDH) and it is shown in
It can be written as ... " (line 277) "... Fig.6 illustrates the generation of the FDH. Similar to the time domain histogram binning, this FDH performs binning for each captured photon." (line 284 -286) 3) All of the figures lack any form of axis labels or indication of length scales, making it very hard to appreciate the results. Ideally, an additional picture of the scene should also be included as there is no ground truth image to compare with and thus judge the actual complexity of the scene and quality of the retrieval.

Authors:
We add the ground truth images for the scene and a new table for describing the actual targets size and materials. We are sorry for missing the images and descriptions in our first submission.
To address the length and scales, we provide a new table to illustrate the target scene depth complexity and target descriptions. We appreciate the notice from reviewer(s).
The new result figures used for our revision submission: (Figure 7  We also add a new table for the target scene descriptions: (Table 3)

4) Acquisition times are indicated throughout the paper (also figures and tables) in ms. But it is not clear if this is the acquisition time for each pixel or acquisition point or if it is the total acquisition time.
In the former case, total acquisition times should be also indicated. Also, total size of camera aperture on the observation screen (relay wall) and number of scan points should be clearly indicated, for example in figure 1 or maybe in a dedicated figure showing the details of the experiment layout (and maybe also at the beginning of the Results section where the setup is described, e.g. where the detector and laser features are described).

Authors:
Thank reviewer(s) mentioning the missing parameters on our captured dataset, we incorporate them into our new revision submission as following.
We add total acquisition time into the description, " … … Methods comparison on office scene: Exposure time per each pixel measurement from first row to last row is 1 ms, 5 ms, 10 ms, 20 ms, 1000 ms (note that the 1000 ms Office Scene dataset was acquired with slight differences in the object location). The total acquisition time from first row to last row is 23 s, 117 s, 4 min, 8min, 390 min." (Figure 7

5) What is the laser wavelength and illumination power required?
Clearly, this will affect SNR so acquisition times alone do not provide a valid indication of the speed of a system that one may attempt to build in their own lab.

Authors:
We are sorry for the missing laser parameters used in the "Results" section. Notice that our capture hardware is the same as current NLOS in FK Ref. [28] and Ref.[29]. We added the additional text, "... … to measure the time response as well as a pico-second laser (Onefive Katana HP amplified diode laser with 1 W at 532 nm, and a pulse width of about 35 ps used at a repetition rate of 10 MHz) as light source. ... " (line 390 -391) Note that 1W of power, when scanned at high speeds <40FPS over a 1 meter relay wall would lead to an average power of about 0.1 mW/cm^2 which is eye safe and similar to the brightness of an I-Phone flash. In the future we believe it will be easier and more cost effective to reduce laser power and instead utilize larger SPAD arrays.
I appreciate the authors responses to issues raised in the previous review round. After reading the revised paper, the other reviewer responses, and the point-by-point discussion, I understand that the paper presents a new, efficient computational algorithm for processing non-confocal measurements for NLOS imaging, such as would be acquired by a SPAD array. This is an improvement over existing methods which are either more computationally expensive, like in the recent Phasor Fields work, or efficient but approximate, such as was shown by Lindell et al.
(2019). However, even after the revision there are inaccuracies and inconsistencies as well as missing comparisons with the manuscript which prevent me from supporting publication at this stage. I summarize each issue and discuss below.
==== "Real time" referred to in the title, abstract, and elsewhere ==== The authors state that "real-time full-resolution capture of NLOS data is feasible with emerging SPAD array detectors, but current algorithms have computational and memory requirements that prevent real-time application on a desktop or embedded computer." They further assert that their method "will enable real-time full-resolution reconstructions when used with emerging SPAD array detectors". I disagree and would challenge these assertions on two points.
(1) The authors do not demonstrate that full-resolution real-time capture of NLOS data would indeed be feasible with SPAD arrays. I still believe that even with a high-resolution SPAD array the challenge of light efficiency will preclude real-time capture as the authors describe. For example, NLOS imaging diffuse objects would still require long exposure times with SPAD arrays due to the rapid signal decay and would thus not be real time; if the authors wish to claim the opposite, they must demonstrate it with results in the paper. In my remarks below I provide further comments on some of the authors' points addressing this which they have added in Section 6.3 of the revised manuscript.
(2) While claiming that the algorithm could enable "real-time full-resolution reconstructions", the authors demonstrate their method on scenes with 150x150 spatial resolution and require several seconds (e.g. 15-20 s per Table 1) of processing per frame. So even discarding the light-efficiency argument, the claim of enabling real-time imaging at full resolution, which I read as "highresolution", e.g. 1 megapixel, does not seem well-founded. Presumably high-resolution reconstructions with the proposed method would take several minutes and would not be real time. Again, I see the main contribution of this paper as introducing new computation and memory efficient algorithm for processing non-confocal measurements as would be captured by SPAD arrays. The authors do not show that it could enable general real-time NLOS capture at the moderate resolution they demonstrate, much less high-resolution (e.g. 1 megapixel); however, it would enable somewhat faster reconstruction times and certainly more accurate reconstructions in non-confocal imaging scenarios.

==== Lack of quantitative analysis ====
While the authors state in the point-by-point discussion that they are unsure of what a meaningful quantitative analysis would be, they could certainly use the readily available public datasets of non-confocal simulated data to evaluate the reconstruction fidelity. One example is the dataset of Galindo et al. (2019). In these datasets, ground truth geometry is available and so can be used to benchmark reconstruction fidelity. Quantitative results should be used to augment and explain the qualitative trends displayed in the paper.

==== Complexity analysis ====
The computational complexity of the method should be reported as N^3*M*log(N) rather than N^3*log(N), where M is the number of Fourier components, and N is the number of pixels along each of the three spatial dimensions in the reconstructed volume. The authors state that, "calculating the RSD reconstruction requires a 2D FFT at each of the N depth planes for each Fourier component." By my calculation, each 2D FFT requires N^2*log(N) operations, multiplied by N depth planes, multiplied by M Fourier components.
While the authors state that the number of Fourier components is "just a constant", they acknowledge in the point-by-point discussion that "the number of components also depends on the scene depth", and in the revised paper they state that, "The number of Fourier components depends on the choice of the virtual illumination pulse. In large scenes, it would increase with scene depth...".
Moreover, the number of Fourier components is comparable to the spatial resolution of the scene, and cannot be neglected. For example, the office scene is 150x150 spatial resolution with up to 139 Fourier components.

Memory comparison ====
I agree that the authors' method provides improvements in memory efficiency compared to FK and LCT. This seems primarily due to the capability of reconstructing a single plane at a time, whereas other methods can only reconstruct the full 3D volume at once. The authors' description of a FDH binning method that works on timestamps is also compelling.
However, the following should be addressed in the memory analysis. The authors note that the resampling step of FK contributes to its high memory consumption, but this comparison is somewhat skewed by the use of 8 ps time bins, which seems unnecessary given the system resolution of ~70 ps. This is also inconsistent with I also note that code included by the authors in the point-by-point discussion to run the memory analysis has inconsistencies with what is reported in the paper. The code sets the variable "bin_resolution" to 32 ps instead of 8 ps with 2048 bins, effectively calculating reconstruction times for distances in a 20 m round trip light path, rather than the 5 m reported in the paper (L329).

==== Further Comments on Section 6.3 on SPAD array vs confocal architecture ====
The authors state that, "NLOS imaging cannot be performed efficiently with a single pixel." This is too strong of a statement and seems inaccurate. While using a SPAD array certainly eliminates the scanning requirement in measurement acquisition, the tradeoff in efficiency between SPAD arrays vs single-pixel SPAD detectors as it pertains to NLOS imaging is all about light efficiency, and both single-pixel scanned architectures and 2D sensors can be used efficiently.
Consider current state-of-the-art high-resolution SPAD arrays have pixels that are a few microns in size (e.g. the SwissSPAD2 from Ulku et al., 2019, a 512x512 sensor) compared to single-pixel detectors that have pixel sizes >100 microns (e.g. MPD sensors). Neglect fill-factor and consider that we select the optics to achieve an equal f-number where a 1 cm spot on the wall is focused onto each pixel. In this case the larger single-pixel detector would collect significantly more light than each of the smaller pixels of the SPAD array. So the single-pixel could be scanned with short exposures at each point and the large 2D array would require a single long exposure for similar total exposure times.
However, in practice high-resolution SPAD arrays come with an additional tradeoff that increases exposure time. This is because not all SPAD pixels in a 2D array contain their own time-to-digital converter. Even 3D stacking technology won't overcome this limitation because of the prohibitive bandwidth requirements for streaming out photon timestamps at megasample rates from each pixel in a high-resolution array. Instead, high-resolution SPAD arrays capture a full transient over many separate gated acquisitions.They capture photons within a few-nanoseconds-long gate, then sweep this gate in picosecond intervals, requiring many sequential acquisitions to build up the transient.
In summary, non-confocal acquisition with 2D sensors and confocal acquisition with a single-pixel SPAD each have their tradeoffs. To say unilaterally that NLOS imaging cannot be performed efficiently is inaccurate and fails to consider the nuances of each approach.

==== Presentation of results ====
Similar to other recent method, the proposed approach estimates a 3D volume. Yet, the results in figured 7,8,9,11,12 are all shown as 2D images. From the description in the text it's not entirely clear how these 2D images are rendered. Are these maximum intensity projections or do they show the integrals along the z dimension or something else? In either case, it's actually difficult to judge the quality of these results objectively from just a single 2D perspective. The authors show show xy slices (these are included currently) but also xz and yz slices of all volume and ideally also 3D perspective renderings. Otherwise, it's impossible to say whether some reconstruction artifact from some z distance contributes to degradations of the 2D features observed on other z planes.

====
Detailed discussion and comparisons to Ahn et al. ==== After a thorough literature review, I actually found another paper that is more closely related to this submission than any other we have been discussing so far: Ahn et al. "Convolutional Approximations to the General Non-Line-of-Sight Imaging Operator", ICCV 2019. This paper is not currently cited, but a detailed discussion and direct comparisons seem absolutely necessary.
This paper describes a geometric optics approach to non-confocal NLOS imaging. In the nonconfocal setting, the current baselines used for comparison in the manuscript are the approx. LCT and approx. FK methods, but neither of these methods is actually derived for non-confocal configurations, so these comparisons seem unfair. Ahn's method on the other hand is specifically developed for non-confocal imaging; source code is available on the project website.

====
Other ==== -typo in line 402: "non-confcal" is missing an o Reviewer #2 (Remarks to the Author): The authors have addressed all of my remaining concerns. I do support accepting this paper. Together with the source code and experimental validation, the proposed approach is solidly evaluated. I also do not agree with the concerns from Reviewer 1. The authors present an exact convolutional model for propagation and no approximation. While the claims regarding runtime remain a bit murky, I'd argue that the LCT method was also pitched with this contribution while missing comparisons against parallelized implementations of optimization-based methods. Overall, the authors present a new propagation model that can be efficiently implemented with (relatively) low memory consumption and that provides non-approximative reconstruction results. Given the high-impact field, I'd argue that this submission should be accepted.
Reviewer #3 (Remarks to the Author): The authors have replied to all comments in a very clear and satisfactory manner. As far as I can see, this applies to both my comments and also to those of the other referees.
I stand by my initial impression that this is a very high quality piece of work that will become a reference point for the community. I therefore suggest acceptance and publication without any further revisions. ********************************** Reviewer 1: ********************************** I appreciate the authors responses to issues raised in the previous review round. After reading the revised paper, the other reviewer responses, and the point-by-point discussion, I understand that the paper presents a new, efficient computational algorithm for processing non-confocal measurements for NLOS imaging, such as would be acquired by a SPAD array. This is an improvement over existing methods which are either more computationally expensive, like in the recent Phasor Fields work, or efficient but approximate, such as was shown by Lindell et al. (2019). However, even after the revision there are inaccuracies and inconsistencies as well as missing comparisons with the manuscript which prevent me from supporting publication at this stage. I summarize each issue and discuss below.

Authors:
We want to thank the reviewer for their thorough review and many helpful suggestions that help improve our manuscript. We do our best to address each remaining concern below. ==== "Real time" referred to in the title, abstract, and elsewhere ==== The authors state that "real-time full-resolution capture of NLOS data is feasible with emerging SPAD array detectors, but current algorithms have computational and memory requirements that prevent real-time application on a desktop or embedded computer." They further assert that their method "will enable real-time full-resolution reconstructions when used with emerging SPAD array detectors". I disagree and would challenge these assertions on two points.
(1) The authors do not demonstrate that full-resolution real-time capture of NLOS data would indeed be feasible with SPAD arrays. I still believe that even with a high-resolution SPAD array the challenge of light efficiency will preclude real-time capture as the authors describe. For example, NLOS imaging diffuse objects would still require long exposure times with SPAD arrays due to the rapid signal decay and would thus not be real time; if the authors wish to claim the opposite, they must demonstrate it with results in the paper. In my remarks below I provide further comments on some of the authors' points addressing this which they have added in Section 6.3 of the revised manuscript.

Authors:
We will address the comments on SPAD arrays in detail in our reply to the comments about Section 6.3, see below. The contribution of this work is indeed the development of a fast reconstruction algorithm that can work with non-confocal data. We provide a review of the signal characteristics of NLOS imaging and multi-pixel focal plane array imaging in general in the supplement to motivate our algorithm. This section is intended to provide motivation for our work and to put it in the context of larger scale NLOS imaging efforts. We did not intend to claim it as a contribution and do not provide experimental verification of the signal dynamics of multi-pixel sensors. While we consider this demonstration an essential component of the general NLOS imaging effort, it is not the subject of this paper.
We have changed the manuscript to make this more clear: we replaced the expressions pointed out by the reviewer from previous round: " ... real-time full-resolution capture of NLOS data is feasible with emerging SPAD array detectors, but current algorithms have computational and memory requirements that prevent real-time application on a desktop or embedded computer. ... " (Text from the previous round) , into new texts from the revised manuscript: "... We anticipate that our method will enable real time full resolution reconstructions (i.e., only limited by the temporal resolution of the SPAD) when used with emerging SPAD array detectors that are currently under development." (Text from the revised manuscript, line number [20][21] We add in introductory sentence to section 6.3 to clarify that it serves as a theoretical analysis to provide motivation for our work rather than experimental demonstration of the linearity and parallelism of a focal plane array optical imaging system such as a camera or SPAD array. As we point out further below, a high resolution SPAD array, or one with closely spaced pixels (i.e. a high fill factor), is not required for NLOS imaging.
"In this section we review the fundamental constraints on NLOS capture and provide an outlook for future NLOS SPAD array sensors. This section is intended to motivate the development of the non-confocal NLOS reconstruction algorithm that is the main contribution of this paper. The section is not in itself intended as a contribution and experimental demonstration of the signal behavior described here is subject of future work." (line number [646][647][648][649][650] (2) While claiming that the algorithm could enable "real-time full-resolution reconstructions", the authors demonstrate their method on scenes with 150x150 spatial resolution and require several seconds (e.g. 15-20 s per Table 1) of processing per frame. So even discarding the light-efficiency argument, the claim of enabling real-time imaging at full resolution, which I read as "high-resolution", e.g. 1 megapixel, does not seem well-founded. Presumably high-resolution reconstructions with the proposed method would take several minutes and would not be real time.

Authors:
Thank you for pointing out this weak point in our manuscript. We should emphasize that by full resolution we mean a reconstruction resolution that is limited by the temporal resolution of the hardware rather than computational or signal limitations. Given the~50 ps time resolution of current SPADs, this means that a 1-2 cm voxel grid resolution is sufficient. This maximum spatial resolution for a given time resolution is derived for example in O'Toole et. al. For our phasor field method, this also means that SPAD pixels on the relay wall should have about 1 cm side length. For a 2 m by 2 m relay wall this would mean that the largest SPAD array that could be used would be 200 by 200 pixels. In practice much less than that is needed as we describe in more detail below. Future advances in detector development (not necessarily SPAD-based) of course might lead to higher available time resolution. Therefore, we added an additional description for the full resolution reconstructions in the abstract for our revised manuscript. "... We anticipate that our method will enable real time full resolution reconstructions (i.e., only limited by the temporal resolution of the SPAD) when

Authors:
The LCT/FK Migration algorithm by Lindell et. al. achieves similar reconstruction speeds to our method so reconstructions of confocal data measurements are possible in real time with either. LCT/FK, however does not work with non-confocal measurements and the challenge is with the capture time of many minutes (see above) when imaging diffuse objects. Real-time imaging at the mentioned rates can only be achieved with retroreflective targets that provide in about a 10,000 fold signal boost. Those targets are rare in reality and limit the generality of scenes that could be imaged around corners. Therefore, we added this clarification text to the introduction: "... Real time reconstruction of low resolution retro-reflective scenes has also been demonstrated in a confocal scanning scenario with both LCT and FK Migration methods. However, the presented confocal real time captures require retroreflective targets that return most reflected light to the moving laser/detection point, while arbitrary non-retroreflective objects require scan times of at least 10 minutes~\cite{Lindell et. al}. In this case, the bottleneck of these methods is not the computation, but the acquisition. Furthermore, Again, I see the main contribution of this paper as introducing new computation and memory efficient algorithm for processing non-confocal measurements as would be captured by SPAD arrays. The authors do not show that it could enable general real-time NLOS capture at the moderate resolution they demonstrate, much less high-resolution (e.g. 1 megapixel); however, it would enable somewhat faster reconstruction times and certainly more accurate reconstructions in non-confocal imaging scenarios.

Authors:
We agree that the contribution of our paper is to provide a fast reconstruction algorithm for non-confocal data. We include sections about SPAD array capture as a motivation for this algorithm and not as part of the contribution of this work. We have added sentences in section 6.3 to make this more clear.
"In this section we review the fundamental constraints on NLOS capture and provide an outlook for future NLOS SPAD array sensors. This section is intended to motivate the development of the non-confocal NLOS reconstruction algorithm that is the main contribution of this paper. The section is not in itself intended as a contribution and experimental demonstration of the signal behavior described here is subject of future work." (line number 646-650)

==== Lack of quantitative analysis ====
While the authors state in the point-by-point discussion that they are unsure of what a meaningful quantitative analysis would be, they could certainly use the readily available public datasets of non-confocal simulated data to evaluate the reconstruction fidelity. One example is the dataset of Galindo et al. (2019). In these datasets, ground truth geometry is available and so can be used to benchmark reconstruction fidelity. Quantitative results should be used to augment and explain the qualitative trends displayed in the paper.

Authors:
Unfortunately, the public dataset provided by Galindo et. al. only provides non-confocal data with a very small number of grid points and is not suitable for our reconstructions. To provide quantitative analysis, we contacted Galindo et. al. and obtained some new rendered dataset with higher grid resolution. We included the reconstructions and quantitative analysis as a supplement in the paper as shown below.
The new texts and new results for the simulated dataset and quantitative analysis are provided in our new revised manuscript. (line number 451-457) (Figure 14) "... Reconstructions using a rendered dataset with known ground truth are shown in Supplementary Figure 14. Our proposed method reconstructs an image of the hidden scene that resembles the image that would be captured with a camera located at the relay wall. In our reconstructions, we recover phasor field irradiance for the hidden object. It is expected that the reconstruction shows spatial distortions similar to the ones seen by a real camera, as it is shown in our supplementary materials. If an exact depth measurement is desired, these biases would have to be calibrated. This is an interesting subject for future work. ..." (line number 451-457) ( Figure 14) Additional results from simulated non-confocal datasets: Three simulated targets at 0.5 m distance from a 1 m by 1 m relay wall with a single SPAD position locates at the center. For each target, we display results as a 3D volume, a 2D front view image by choosing the maximum intensity along the depth direction and the corresponding depth error in meters. From the front view image, a 2D irradiance map of the hidden target is reconstructed. The virtual camera exhibits distortions similar to the ones seen in real cameras. Since the resulting depth error is preserved for different scenes it can likely be calibrated if more accurate depth is desired. The error appears to be consistent across the different scenes with a variation of less than one voxel.

Complexity analysis ====
The computational complexity of the method should be reported as N^3*M*log(N) rather than N^3*log(N), where M is the number of Fourier components, and N is the number of pixels along each of the three spatial dimensions in the reconstructed volume. The authors state that, "calculating the RSD reconstruction requires a 2D FFT at each of the N depth planes for each Fourier component." By my calculation, each 2D FFT