Learning to synthesize: robust phase retrieval at low photon counts

The quality of inverse problem solutions obtained through deep learning is limited by the nature of the priors learned from examples presented during the training phase. Particularly in the case of quantitative phase retrieval, spatial frequencies that are underrepresented in the training database, most often at the high band, tend to be suppressed in the reconstruction. Ad hoc solutions have been proposed, such as pre-amplifying the high spatial frequencies in the examples; however, while that strategy improves the resolution, it also leads to high-frequency artefacts, as well as low-frequency distortions in the reconstructions. Here, we present a new approach that learns separately how to handle the two frequency bands, low and high, and learns how to synthesize these two bands into full-band reconstructions. We show that this “learning to synthesize” (LS) method yields phase reconstructions of high spatial resolution and without artefacts and that it is resilient to high-noise conditions, e.g., in the case of very low photon flux. In addition to the problem of quantitative phase retrieval, the LS method is applicable, in principle, to any inverse problem where the forward operator treats different frequency bands unevenly, i.e., is ill-posed.


INTRODUCTION
The retrieval of phase from intensity is one of the most important and most challenging problems in Classical Optics.The utility of phase retrieval stems from the fact that it allows the shape of transparent objects, biological cells in particular, to be quantified in two and three spatial dimensions using visible light [1,2].In the x-ray band, quantitative phase imaging is also useful because phase contrast in tissue is orders of magnitude higher than attenuation contrast [3,4].The same argument can be made for identification of liquids [5] and semiconductor materials for integrated circuit characterization and inspection [6].
There are two well-known challenges in phase retrieval.
Firstly, for the phase of the optical field to be well-defined, the illumination needs to be temporally and spatially coherent to a fairly good approximation; this is especially difficult with x-rays.One way to relax this requirement is to acknowledge that it is usually the phase delay through the object that is of interest, and seldom the phase of the optical field itself.The former can be obtained even from partially coherent light [7] but requires the correlation function (mutual intensity) to be measured, which is often problematic because of low contrast.The second challenge is that, since only the intensity of a light beam is observable, the phase may only be inferred indirectly from intensity measurements.Computational approaches to this operation may be classified as interferometric/holographic [8,9], where a reference beam is provided; and non-interferometric, or referenceless such as direct/iterative [10,11] and ptychographic [12,13], which are both nonlinear, and transport-based [14,15], where the problem is linearized through a hydrodynamic approximation.Direct methods attempt to retrieve the phase from a single raw intensity image, whereas the transport and ptychographic methods implement axial and lateral scanning, respectively.What reference-less methods have in common is the need to obtain intensity measurements at some distance away from the conjugate plane of the object, i.e. with a small defocus.Direct measurement with defocus is the approach we take here.All computational phase retrieval approaches, interferometric and non-interferometric, involve the solution of a nonlinear and highly ill-posed inverse problem.For direct phase imaging, which is a nonlinear problem-see Section 2.A-the classical Gerchberg-Saxton-Fienup (GSF) algorithm [10,11,16] and its variants [17] are widely used.The main idea is to start with a random estimate for the unknown phase distribution and then to iteratively update it until the modulus-squared of its Fourier (or Fresnel) transform matches the observed intensity.For well-behaved phase fields, the iteration usually converges to the correct phase [18,19].Alternatively, the Wiener-Tikhonov functional minimization approach, described in Section 2.B, exploits prior knowledge about the class of phase objects being imaged to combat noise artifacts.A modern and popular implementation is compressive imaging [20][21][22][23][24] that utilizes sparsity priors, valid when the object is expressed in the appropriate set of basis functions.If the sparsifying set is unknown, it can be learned from examples according to the k-SVD method or other dictionary approaches [25][26][27][28].
In 2010, [29] proposed a deep neural network in a recurrent scheme to learn the prior from examples, as an alternative to the dictionary approach.Subsequently, generator-discriminator competition [30] was adopted as a more secure means of learning the statistical relationship between valid objects and their forward models; and the recursion was unfolded into a cascade for better numerical stability [31].In these schemes, the physical model of the measurement is taken explicitly into account as a projection operator applied to the reconstruction estimate repeatedly at each recursion or cascade stage.This generalization of dictionaries to deep learning has been successful in a number of linear inverse problems, most notably superresolution [32][33][34] and tomography [35,36].
Recently, deep learning regression has been investigated for application to nonlinear inverse problems, in particular phase retrieval: direct [37][38][39], holographic [40,41], and ptychographic [42,43].The idea, described briefly in Section 2.B, is to train a deep neural network (DNN) in supervised mode from examples of phase objects and their intensity images so that, after training, given an intensity image as input, the DNN outputs an estimate of the phase object.In this case, the physical model is either learned implicitly together with the prior from the examples [37,38]; or incorporated as a pre-processor ("Approximant") [39][40][41].An interesting alternative method for the inverse problem, also nonlinear, of reconstructing the three-dimensional (3D) refractive index distribution from intensity projections, is to define the DNN architecture according to the strong scattering model and store the refractive index values as weights of the DNN after training [44].This "index-storing" DNN itself was subsequently used as Approximant to a traditional DNN for improving the estimates in 3D distributions with exceptionally small and highcontrast features or when the range of available angles of projec-tion is severely limited [45].Extensive reviews of deep learning use for inverse problems can be found in [36,46,47].
Here, we propose a new DNN-based computational architecture for phase retrieval with the unique feature of processing low and high spatial frequency bands as separate channels with two corresponding DNNs trained from an original object database and a high-pass filtered version of the database, respectively.Subsequently, the outputs of the two channels are recombined using a third DNN also specifically trained for this task.The motivation for this new approach is an earlier observation [38] that nonlinearities in DNN training and execution algorithms tend to amplify imbalances in the spatial frequency content of the training database and in the way different spatial frequencies are treated as they propagate through the physical optical system; this amplified imbalance typically results in the lower spatial frequencies becoming dominant and ultimately limiting resolution of fine spatial features in the reconstructions.A more detailed overview of this phenomenon can be found in Section 2.C.Because the essential feature of our new proposed technique is the synthesis of the two spatial bands through a trained DNN, we refer to it as "learning to synthesize" (LS).
Splitting the spatial frequency content into several bands and processing the bands separately has a long history in signal processing [48][49][50][51][52][53][54][55].For image reconstruction, dual band processing has been used in fluorescence microscopy [56][57][58] and phase retrieval [59].However, these cases, unlike ours, required structured illumination.In the context of learning-based inversion, a dual channel method has been tried for superresolution [60] (to be understood as upsampling) albeit the two processed channels were combined as a simple convex sum to form the final image.By contrast, the LS method presented here uses a learned nonlinear filter, implemented as a third DNN trained to optimally recombine the two channels according to the spectral properties of the class of objects that the training database represents.
In addition to requiring a single raw image to retrieve the phase through a learned recombination of the spectral channels, the LS method presented here has the desirable property of resilience to noise, especially in the case of weak photon flux down to a single photon per pixel.We achieved that by using an Approximant filter [39] to pre-process the raw image before submitting it to the two spectral channels.The Approximant produces an inverse estimate that expressly uses the physical model (a single iteration of the GSF algorithm in [39] and here).For very noisy inputs, the Approximant is of very poor quality; yet, if the subsequent learning architecture is trained with this low-quality estimate as input, the final reconstruction results are significantly improved.The LS method with Approximant, as presented here, drastically improves over [39], especially in the reconstruction of fine detail, as [39] did not use separate spectral channels to rebalance frequency content.
The detailed implementation of the LS method is described in Section 3, where we also show how to optionally include the Approximant as initial estimate for input to the LS scheme.Results from experiments are presented in Section 4 with detailed characterization of the LS method's behavior under different noise conditions.Conclusions and suggestions for future work are in Section 5.

Let
ψ obj (x, y) = t(x, y)e i f (x,y) denote the complex transmittance of an optically thin object of modulus response t(x, y) and phase response f (x, y), and let ψ inc (x, y) denote the coherent incident field of wavelength λ on the object plane.The noiseless intensity measurement g 0 (x, y) (also referred to as noiseless raw image) is carried out on the detector plane located at a distance z away from the object plane, and can be written as where F z [•] denotes the Fresnel (paraxial) propagation operator for distance z, i.e. the convolution and H 0 is the (nonlinear) noiseless forward operator.Alternatively, F z may be expressed in the spatial frequency domain (ν x , ν y ) as where F denotes the 2D (spatial) Fourier transform operator and F −1 its inverse.We are interested in weakly absorbing objects, i.e. we assume t(x, y) ≈ 1.In all the experiments described here, the illumination is also a normally incident plane wave ψ inc (x, y) = 1.Therefore, to a good approximation, we may write This is what we refer to as the direct phase retrieval problem, which Gerchberg-Saxton and related algorithms solve iteratively [10,16].
In practice, the measurement is subject to Poisson statistics due to the quantum nature of light; and to Gaussian thermal noise added by the photoelectric conversion process.We express the noisy measurement as where P{θ} denotes a Poisson random variable with mean θ and N a Gaussian random variable with zero mean and variance σ 2 .The photon flux in photons per pixel per frame is denoted as p and the spatial average H 0 f = g 0 of the noiseless raw image in the denominator is necessary as normalization factor.The noisy forward operator is H and the purpose of phase retrieval is to invert H so as to recover f as accurately as possible, despite the nonlinearity and randomness present in the measurements.

B. Solution of the inverse problem
The Wiener-Tikhonov approach to solving inverse problems of the form g = H f is to obtain the estimate f of the inverse as Here, D(H 0 f , g) is the fitness (or data-fidelity) term and D is a distance operator.Under the assumption of solely additive Gaussian noise, it is appropriate to use the L 2 norm Somewhat less rigorously, but conveniently for mathematical manipulations, the same L 2 metric is used for the fitness term even for measurements strongly subject to non-Gaussian statistics, as we consider here.The topic of appropriate fitness metrics is beyond the scope of this paper; in any case, when machine learning is used to approximate (6), the dilemma of choosing a metric shifts to the loss function for training a deep neural network.We will address this latter problem in some detail in Section 3.A.
Here, as in earlier works on direct phase retrieval [37][38][39][40][41][42][43], and due to the nonlinearity of the forward model, we adopt the End-to-End and Approximant methods.These we denote as End-to-End: f = DNN(g); and (8) where DNN(•) is the output of a deep neural network.In the End-to-End approach, the burden is on the DNN to learn from examples both the forward operator H and the prior Φ so as to execute, in one shot, an approximation to the ideal solution (6).
Training takes place in supervised mode, with known pairs of phase objects f and their raw intensity images g generated on a phase spatial light modulator (SLM) and measured on a digital camera, respectively.Note that training is generally very slow, taking several hours or days if a few thousand examples are used.However, after training is complete, the execution of ( 8) or ( 9) is very fast as it only requires forward (non-iterative) computations.This is one significant advantage over the standard way of minimizing the Wiener-Tikhonov functional (6) iteratively for each image.
When the inverse problem becomes severely ill-posed or the noise is extremely strong, the learning burden on the DNN becomes too high; then, generally, better results are obtained by training the DNN to receive as input the Approximant f * instead of the raw measurement g directly.The Approximant is obtained through an approximate inversion of the forward operator; for example, in [40] it was implemented as a digital holographic backpropagation algorithm, whereas in [39] it was the outcome of a single iteration of the Gerchberg-Saxton algorithm [10].While these Approximants f * generally do not look very good, especially in highly noisy situations [39], through training the DNN is able to learn a better association of f * with its corresponding true object f than it can learn with the noisy raw measurement g.

C. Spectral properties of training
The design of deep neural networks is an active field of research and a comprehensive review of methods and caveats is well beyond the scope of this paper.We refer the reader to [36,46,47] for more extensive background and references.Here, we discuss the influence on the quality of training of the spatial power spectral density of the database where example are drawn from.
In both End-to-End and Approximant methods (8-9) the examples determine the object class prior to be learned by the DNN.For instance, if we train on examples drawn from the ImageNet database [63] of natural objects, the prior learned is weaker than if we train on the MNIST [64] database of handwritten digits or a database of integrated circuit (IC) segments because both latter shapes are more restricted than natural images and with stronger spatial correlations across.Recent work [39,65] has shown that, unsurprisingly, training with stronger priors results in more robust reconstructions as long as the test objects are drawn from the same class.
In [38], we addressed the influence of the spatial power spectral density (PSD) S(ν x , ν y ) of the example database on the quality of training.It is well known [66][67][68][69][70][71] that two dimensional (2D) images of natural objects, such as those contained in ImageNet [63], follow the inverse quadratic PSD law Other types of object classes of practical interest exhibit similar power-law decay, perhaps with slightly different exponents.This means that, if a neural network is trained on such an object class, higher spatial frequencies are presented less frequently to the DNN during the training stage.At face value, this is as it should be, since the relative popularity of different spatial frequencies in the database is precisely one of the priors that the DNN ideally should learn.This understanding needs to be modified in the context of inverse problems because the representation of high spatial frequencies in the raw images is also uneven-typically, to the high spatial frequencies' disadvantage.In the specific case of phase retrieval, higher spatial frequencies within the spatial bandwidth (as determined by the numerical aperture NA) have uniform transmission modulus but are more severely scrambled by the chirped oscillations of the transfer function (3).Thus, higher spatial frequencies suffer a double penalty [38]: their recovery becomes more sensitive to noise due to the scrambling; and they are less popular due to the inverse-square (or similar) PSD law so they are presented less frequently to the DNN's training process.Moreover, since the DNN itself and its training routine are both highly nonlinear, there is an acute risk that any unevenness in the treatment of different spatial frequency bands may be amplified in the final result, eventually causing the lower frequencies to dominate.Ref. [38] attributed the inability of the Phase Extraction Neural Network (PhENN) [37] to resolve spatial features well within its admitted spatial bandwidth to this unequal treatment of spatial frequencies.They showed that PhENN's resolution is approximately doubled by pre-filtering the training examples as to flatten their PSD.That is, during the training, each example f (x, y) from the database was replaced with its filtered version The transfer function was defined as the high-pass filter exactly compensating for the inverse-quadratic dependence (10) and flattening the spectrum.Raw images for training were correspondingly filtered as whereas, during the test, the un-filtered measurements (i.e., as received from the camera) were used to obtain the reconstructions.
Unfortunately, with this implementation amplification of high spatial frequency features, especially of artifacts caused even by weak noise, was also evident in the reconstructions.This is not surprising, since technically (11) trades off violating the prior in return for finer spatial resolution.The LS approach that we describe in the next section is meant to fix this problem.

METHODOLOGY OF LEARNING TO SYNTHESIZE
A. Description of the LS-DNN system In [72], we proposed a two-step approach to tackle the difficulty of restoring high frequency components without introducing significant artifacts and distortions.Here we describe the two steps for training and execution in detail, as well as the design of the DNNs involved in the LS system.For unity in notation, we denote the input to the entire LS system as ξ(x, y), to be understood as ) in the End-to-End scheme, and We will discuss the Approximant implementation in more detail in section B below.
The two training steps are shown in block-diagram form in Figure 1.The first step consists of training two separate DNNs in parallel, as follows: • DNN-L is trained to match unfiltered patterns ξ (n) (x, y) at its input with the corresponding unfiltered example phase patterns f (n) (x, y) as ground truth at its output (the superscript n enumerates the examples).
• DNN-H is trained to match unfiltered patterns ξ (n) (x, y) at its input to corresponding spectrally filtered versions p (x, y) of the ground truth examples f (n) (x, y) at its output.The filter's transfer function is chosen more generally than (12) as We have investigated implementations with q spanning a broad range, as we discuss in more detail below.
The output of DNN-L for a general test input ξ(x, y) is denoted as f LF (x, y).Assuming similar training conditions, f LF matches the output of PhENN as presented in [37] in the Endto-End scheme or [39] in the Approximant scheme; that is, f LF is expected to be fairly accurate at low spatial frequencies but missing fine detail.
The output of DNN-H is denoted as f HF (x, y).Note that [38] required spatial pre-filtering the raw inputs g; here, we do not spatially pre-filter the input ξ (i.e., g or f * according to whether the End-to-End or Approximant scheme is used).We instead train DNN-H to produce the filtered output based on an unfiltered input.This leads to better generalization, because DNN-H is trained on the broadest set of possible images (whereas the training in [38] was on high-frequency containing images only).Moreover, using unfiltered inputs to DNN-H allows for the training process to be parallelized for better efficiency.
Depending on the value of q, the PSD of the patterns training DNN-H will be flat or almost flat.The output of DNN-H f HF (x, y) is expected to have better fidelity at fine spatial features of the phase objects.However, spectral flattening may also generate artifacts due to over-learning the high spatial frequencies.Therefore, f HF looks rather like a high-pass filtered version of the true object f , which we found to be more beneficial for subsequent use in the LS scheme.
The second training step consists of combining the two partially accurate reconstructions f LF and f HF into a final estimate f (x, y) with uniform quality at all spatial frequencies, low and high up to the passband.To this end, we train the synthesizer DNN-S to receive f LF and f HF as inputs, and use the un-filtered examples f as the output.To avoid any further damage to the high-spatial frequency content in f HF , we bypass f HF and present it intact to the last layer of DNN-S.By operating on f LF alone, DNN-S learns how to treat the low frequency reconstruction so as to compensate for artifacts at all bands.The use of the synthesizer DNN-S also makes our results less sensitive to the choice of power q in the transfer function (14).We found that q ∈ [0.3, 0.7] can produce reconstructions of approximately equal quality, as will be presented in section 4.There is a wide variety of DNN structures one may choose to implement DNN-L, H and S. In this work, we use for DNN-L the same architecture in [39], a residual U-net architecture with skip connections [73].For simplicity, DNN-H and DNN-S, are also chosen to be structures similar to DNN-L.The details of the implementations are in the Supplementary Material.We made these choices to enable specific and fair comparisons with the earlier works; alternative architectures are certainly possible within the LS scheme, though we judged a full exploration to be outside the scope of the present paper.
Training a neural network is typically implemented as a stochastic optimization procedure [74,75] where the neural network's internal coefficients (weights) are adjusted so as to minimize a metric of the distance between the actual output of the neural network and its desired output to a given input (training example).This distance is called training loss function (TLF).In the context of training to solve an inverse problem, the TLF is defined as where the superscript n is again used to enumerate the examples, and the dilemma of choosing the appropriate metric operator D emerges.
It is generally accepted [37,[76][77][78][79][80] that the L 2 metric (also referred to as mean square error, MSE) is a poor choice that does not generalize well; i.e., deep neural networks trained with MSE do not perform well when presented with examples outside their training set.For image classification tasks, and in some early work on phase retrieval [37], the L 1 metric (mean absolute error, MAE) was used instead.In direct analogy with compressive sensing, the L 1 metric promotes sparsity in the internal connectivity of the neural network, and that leads to better generalization.However, [65] found that in highly ill-posed problems this benefit is eclipsed by the inability of MAE and pixel-wise metrics more generally to learn spatial structure priors about the object class that are crucial for regularization.
In this paper, we train DNN-L, H and S using the Negative Pearson Correlation Coefficient (NPCC) [65,81,82] as the TLF.This is defined as The NPCC has been shown [83] to be more effective in recovering fine features than conventional loss functions such as the Mean Square Error (MSE), Mean Absolute Error (MAE) and the Structural Similarity (SSIM) index [84,85].However, the NPCC has the disadvantage that it is invariant to affine transformations to its arguments, i.e.
for arbitrary real numbers α 1 , α 2 , β 1 , β 2 .For quantitative phase retrieval, where the scale of phase difference matters, the affine ambiguity is resolved with a histogram equalization step after inversion [38].

B. Computation of the Approximant
It has been shown that, even under extreme noise conditions, just a single iteration of the Gerchberg-Saxton (GS) algorithm suffices as Approximant in scheme (9) for phase retrieval [39].We elected to use the same approach here for the LS-DNN architecture.More recently, a comparative study [86] showed that higher iterates or regularized versions of GS do improve the appearance of the Approximant result f * but do not yield significant improvement in the end output f of the DNN.Similar conclusions hold for alternatives to GS, e.g.Gradient Descent.While these alternative schemes are interesting for the LS-DNN method, we chose to not pursue them here.
The general form of the (k + 1)-th GS iterate from the k-th iterate is where we have taken into account that ψ inc = 1.Accordingly, our Approximant is where 1 denotes the function that is uniformly equal to one within the frame [86].Figure 2 compares the 2D (log-scale) Fourier spectrum magnitude of a ground truth image (from ImageNet [63]), Approximant ( 19) computed without noise, and Approximant (19) computed from an input subject to Poisson statistics corresponding to average flux of one photon per pixel.We can see that although the single photon Approximant (which we will later use as the input to the LS-DNN) has a large support in its spectrum, it is the noise that dominates the mid-to-high frequency range.Therefore, the learning process still bears the burden of restoring the correct high frequency contents and relying heavily on high frequency priors, as our DNN-H does, is justified.

A. Experimental Apparatus and Data Acquisition
In each experiment carried out to train and test different LS-DNN schemes, 10,000 image objects from ImageNet [63] were successively projected on a phase SLM as phase objects (i.e., with phase value at each pixel proportional to the intensity of the corresponding pixel in the original from the database) and their raw images were recorded by the EM-CCD camera at an out-offocus plane.These 10,000 ground-truth phase images and their corresponding raw intensity images were split into a training set of 9,500 images, a validation set of 450 images and a test set of 50 images.The choice of ImageNet [63] is reasonable, since the lowfrequency dominance in its spatial PSD is representative of the broader classes of objects of interest, and therefore, we anticipate our results will generalize well in practical applications.
The experiments were carried out using the apparatus described in Figure 3.The light source was a continuous wave Helium-Neon gas laser at 632.8nm.The laser beam first passed through a variable neutral density filter (VND) that served the purpose of adjusting the photon flux.The beam was then spatially filtered and expanded into a 18mm diameter collimated pencil and sent onto a transmission spatial light modulator (SLM) with 256 × 256 pixels, each of size 36 × 36µm.Phase objects were projected on the SLM and imaged by a telescope (4F system) consisting of lenses L1 (focal length 230mm) and L2 (100mm).The 2.3× reduction factor in the 4F system was designed to reduce the spatial extent of the defocused raw image to approximately fit the size of the camera.An aperture was placed in the Fourier plane to suppress higher diffraction orders due to the periodicity of the SLM pixels.The raw intensity images were captured by a Q-Imaging EM-CCD camera with 1004 × 1002 square-shaped pixels of size 8×8µm placed at a distance z = 400mm from the image plane of the 4F system.Additional details about the implementation of the optical apparatus and its numerical simulation with digital Fresnel transforms are in Supplementary Material.
The photon flux is quantified as the number of photons p received by each pixel on average for an unmodulated beam, i.e. with no phase modulation driving the SLM.During an initial calibration procedure, for different positions of the VND filter, the photon level is measured using a calibrated silicon photodetector placed at the position of the camera.The quoted photon count p is also corrected for the quantum efficiency of the CCD (60% at λ = 632.8nm),meaning that we refer to the number of photons actually detected and not the incident number of photons.
Here, we report results for two levels of photon flux p = 9.8 ± 5% and p = 1.1 ± 5%, respectively quoted in the text as "10" and "1" photons.The data acquisition, training and testing procedures of the entire LS-DNN architecture were repeated separately for each value of p.

B. Reconstructions and Analysis
Figure 4 shows the reconstructions obtained by the LS-DNN method (q = 0.5) and its components at fluxes p = 1 photon and 10 photons per pixel, as defined immediately above.As expected, the reconstructions f LF by DNN-L have good fidelity at low spatial frequencies but lose fine details, as in [39]; whereas the reconstructions f HF by DNN-H look like high-pass filtered versions of the true objects with some additional high-frequency artifacts due to the noise.The reconstructions f by DNN-S preserve detail at both low and high frequencies, while significantly attenuating the artifacts.The improvement of f over f LF is more pronounced under severe noise, i.e. in the p = 1 photon/pixel case.More examples of reconstructions (obtained with q = 0.5) for the noisier case (p = 1) are given in the Supplementary Material.
In Figures 5 and 6, we compare reconstructions by LS-DNN with different values of the pre-filtering parameter q for p = 1 photon and p = 10 photons per pixel, respectively.The most detail at high frequencies in the DNN-S output is preserved in the range 0.3 q 0.7.At lower values of q, the quality of reconstructions by DNN-S does not noticeably exceed that of DNN-L.This is expected, since in the limit q = 0, training DNN-H becomes identical to DNN-L.On the other hand, for values q 0.7, the DNN-H output is dominated by high-frequency artifacts and again the quality of DNN-S reconstructions regresses to that of DNN-L, since the high-frequency channel is no longer contributing.These observations are valid for both values of p, and even stronger for the most severely noise-limited case p = 1.
Similar trends are evident according to various quantitative metrics averaged over the entire set of test examples compared to the true phase signals f , summarized in Table 1.For comparisons we used the PCC, defined as in (16) but without the minus sign; peak signal-to-noise ratio (PSNR) [87]; and structural simi-larity index metric (SSIM) [84,85].As we noted in Section 3.A, DNN-H is trained with spectrally pre-filtered version of the true object f so a quantitative comparison of its output with the ground truth does not make sense.
It is noteworthy that in both visualization and quantitative comparisons of Figures Figures 5-6 and Table 1, respectively, the performance of DNN-S remains approximately the same within the range 0.3 q 0.7.This is desirable as it suggests that one need not prefilter exactly with the inverse of the PSD power law.This further suggests that for datasets that do not represent natural images and may obey power laws different than (10), not knowing the exact value of q may not be catastrophic.We have not tested this hypothesis exhaustively, as it is beyond the scope of this paper.
In Table 2 and in Supplementary Material, we also analyze the case of a bigger DNN (denoted DNN-L-3) with computational capacity equal to the sum of DNN-L, H and S together, but trained with un-filtered examples, and show that DNN-L-3 cannot achieve reconstructions of equal quality.Therefore, the improvements over [39] resulted from the training procedure followed in the LS-DNN method and not simply by brute force due to the use of larger computational capacity.
To further study the behavior of the LS components in the low and high spatial frequency bands, we studied the reconstructions in the Fourier domain.Figure 7 shows the spectra (2D Fourier transforms) of two randomly selected test examples.Figure 8 and Figure 5 in Supplementary Material show normalized diagonal cross-sections of the PSD averaged over all the 50 test images, for p = 1 and 10 photons per pixel, respectively.These plots illustrate that DNN-L and DNN-H's outputs are depleted at the high and low frequencies, respectively, with the losses being more severe in the noisy case p = 1; whereas DNN-S's output mostly recovers the frequency content at both bands, albeit still with some minor loss at high frequencies.
Lastly, we experimentally characterized the spatial resolution of the LS-DNN reconstructions, i.e. the ability of DNN-S to resolve two pixels at nearby locations having phase delay higher than the rest of the signal.Similar analyses were carried out in [38,65], where the methodology was also described in detail.In the work presented here, we carried out the analysis under ample illumination, i.e. not under strong Poisson statistics.We

CONCLUDING REMARKS
The LS-DNN reconstruction scheme [72] for quantitative phase retrieval has been shown to be resilient to highly noisy raw intensity inputs while preserving high spatial frequency detail better than [39].The reconstructions are also quite robust to variations of the pre-filtering power law q around the value ≈ 1/2 following from natural image statistics.Beyond the scope of the work reported here, further improvements may be obtained through modifying the architecture of the DNNs used to process and recombine the two spatial frequency bands.It is also possible that for highly restricted datasets, q needs to be investigated further.However, our observations for natural images suggests that the LS-DNN approach is relatively insensitive to q under fairly wide conditions.
Another obvious alternative strategy is to split the signals into more than two bands, then process and recombine these multiple bands with a synthesizer DNN, according to the LS scheme.While we did not investigate this approach in detail here, we expect it to present a trade-off between the improvements and the complexity of having to train multiple neural networks implying the need for more examples and the danger of poor generalization.2. Quantitative comparison of reconstructions by Approximant, DNN-L-3 and DNN-S (for q = 0.5).Each entry takes the form of 'average ± standard deviation'.
where a and b are the spatial averages of the generic functions a(x, y), b(x, y).If a and b are uncorrelated, the expected value of D NPCC (a, b) is zero, whereas if they are identical then D NPCC (a, b) = −1.Thus, training the neural network minimizes the TLF toward L ≈ −N, where N is the number of training examples.

Fig. 2 .
Fig. 2. From top to bottom: ground truth image and its 2D Fourier spectrum; noiseless Approximant and its 2D Fourier spectrum; Approximant for 1 photon/pixel illumination and its 2D Fourier spectrum.

Fig. 5 .
Fig. 5. Comparisons of LS-DNN reconstructions under different q s for p = 1 photon/ pixel.Rows from top to bottom: DNN-L output; DNN-H output under different q's; DNN-S under different q's; 1D cross-section (along the dashed line in the row below) of DNN-S output under different q s; ground truth.

Fig. 6 .
Fig.6.Comparisons of LS-DNN reconstructions under different q s for p = 10 photon/ pixel.Rows from top to bottom: DNN-L output; DNN-H output under different q's; DNN-S under different q's; 1D cross-section (along the dashed line in the row below) of DNN-S output under different q s; ground truth.

Fig. 8 .
Fig. 8. 1D diagonal cross-sections of the average 2D power spectral density (PSD) of 50 test images for p = 1 photon per pixel.The case of p = 10 photons per pixel is in the Supplementary Material.