On the use of deep learning for phase recovery

Phase recovery (PR) refers to calculating the phase of the light field from its intensity measurements. As exemplified from quantitative phase imaging and coherent diffraction imaging to adaptive optics, PR is essential for reconstructing the refractive index distribution or topography of an object and correcting the aberration of an imaging system. In recent years, deep learning (DL), often implemented through deep neural networks, has provided unprecedented support for computational imaging, leading to more efficient solutions for various PR problems. In this review, we first briefly introduce conventional methods for PR. Then, we review how DL provides support for PR from the following three stages, namely, pre-processing, in-processing, and post-processing. We also review how DL is used in phase image processing. Finally, we summarize the work in DL for PR and provide an outlook on how to better use DL to improve the reliability and efficiency of PR. Furthermore, we present a live-updating resource (https://github.com/kqwang/phase-recovery) for readers to learn more about PR.


Introduction
Light, as a complex electromagnetic field, has two essential components: amplitude and phase 1 . Optical detectors, usually relying on photon-to-electron conversion (such as chargecoupled device sensors and the human eye), measure the intensity that is proportional to the square of the amplitude of the light field, which in turn relates to the transmittance or reflectance distribution of the sample (Fig. 1a and Fig. 1b). However, they cannot capture the phase of the light field because of their limited sampling frequency 2 .
Actually, in many application scenarios, the phase rather than the amplitude of the light field carries the primary information of the samples [3][4][5][6] . For quantitative structural determination of transparent and weakly scattering samples 3 (Fig. 1c), the phase delay is proportional to the sample's thickness or refractive index (RI) distribution, which is critically important for bioimaging because most living cells are transparent. For quantitative characterization of the aberrated wavefront 5 (Fig. 1d and Fig. 1e), the phase aberration is caused by atmospheric turbulence with an inhomogeneous RI distribution in the light path, which is mainly used in adaptive aberration correction. Also, for quantitative measurement of the surface profile 6 (Fig. 1f), the phase delay is proportional to the surface height of the sample, which is very useful in material inspection. Since the phase delay across the wavefront is necessary for the above applications, but the optical detection devices can only perceive and record the amplitude of the light field, how can we recover the desired phase? Fortunately, as the light field propagates, the phase delay also causes changes in the amplitude distribution; therefore, we can record the amplitude of the propagated light field and then calculate the corresponding phase. This operation generally comes under different names according to the application domain; for example, it is quantitative phase imaging (QPI) in biomedicine 3 , phase retrieval in coherent diffraction imaging (CDI) 4 which is the most commonly used term in x-ray optics and nonoptical analogues such as electrons and other particles, and wavefront sensing in adaptive optics (AO) 5 for astronomy and optical communications. Here, we collectively refer to the way of calculating the phase of a light field from its intensity measurements as phase recovery (PR).
As is common in inverse problems, calculating the phase directly from an intensity measurement after propagation is usually ill-posed 7 . Suppose the complex field at the sensor plane is known. We can directly calculate the complex field at the sample plane using numerical propagation 8 (Fig. 2a). However, in reality, the sensor only records the intensity but loses the phase, and, moreover, it is necessarily sampled by pixels of finite area size; because of these complications, the complex field distribution at the sample plane generally cannot be calculated in a straightforward manner (Fig. 2b). We can transform phase recovery into a well-posed/deterministic problem by introducing extra information, such as holography or interferometry at the expense of having to introduce a reference wave 8,9 , transport of intensity equation requiring multiple throughfocus amplitudes 10,11 , and Shack-Hartmann wavefront sensing which introduces a micro-lens array at the conjugate plane 12,13 . Alternatively, we can solve this ill-posed phase recovery problem in an iterative manner by optimization, i.e., the so-called phase retrieval such as Gerchberg-Saxton-Fienup algorithm [14][15][16] , ptychographic iterative engine 17,18 , and Fourier ptychography 19 . Next, we introduce these classical phase recovery methods in more detail.
Holography/interferometry. By interfering the unknown wavefront with a known reference wave, the phase difference between the object wave and the reference wave is converted into the intensity of the resulting hologram/interferogram due to alternating constructive and destructive interference of the two waves across their fronts. This enables direct calculation of the phase from the hologram 8 .
In in-line holography, where the object beam and the reference beam are along the same optical axis, four-step phase-shifting algorithm is commonly used for phase recovery (Fig.   3) 20 . At first, the complex field of the object wave at the sensor plane is calculated from the four phase-shifting holograms. Next, the complex field at the sample plane is obtained through numerical propagation. Then, by applying the arctangent function over the final complex field, a phase map in the range of (-π, π] is obtained, i.e., the so-called wrapped phase. The final sample phase is obtained after phase unwrapping. Other multiple-step phaseshifting algorithms are also possible for phase recovery 21 . Spatial light interference microscopy (SLIM), as a well-known QPI method, combines the phase-shifting algorithm with a phase contrast microscopy for phase recovery over transparent samples 22 . In off-axis holography, where the reference beam is slightly tilted from the optical axis, the phase is modulated into a carrier frequency that can be recovered through spatial spectral filtering with only one holographic measurement (Fig. 4) 23 . By appropriately designing the carrier frequency, one can well separate the baseband that contains the reference beam from the object beam. After transforming the measured hologram into the spatial frequency domain through a Fourier transform (FT), we can select the +1 st or -1 st order beam and move it to the baseband. By applying an inverse FT, the complex sample beam can be retrieved.
One has to be careful, however, not to exceed the Nyquist limit on the camera as the angle between reference and object increases. Moreover, as only a small part of the spatial spectrum is taken for phase recovery, off-axis holography typically wastes a lot of spatial bandwidth product of the system. To enhance the utilization of the spatial bandwidth product, the Kramers-Kronig relationship and other iterative algorithms have been recently applied in off-axis holography [24][25][26] .
Both the in-line and off-axis holography discussed above are lensless, where the sensor and sample planes are not mutually conjugate. Therefore a backward numerical propagation from the former to the latter is necessary. The process of numerical propagation can be omitted if additional imaging components are added to conjugate the sensor plane and the sample plane, such as digital holographic microscopy 27 .

Transport of Intensity Equation.
For a light field, the wavefront determines the axial variation of the intensity in the direction of propagation. Specifically, there is a quantitative relationship between the gradient and curvature of the phase and the axial differentiation of intensity, the so-called transport of intensity equation (TIE) 10 . This relationship has an elegant analogy to fluid mechanics, approximating the light intensity as the density of a compressible fluid and the phase gradient as the lateral pressure field. TIE may be derived from the Fresnel-Schrödinger 10 , and it is subject to the scalar, paraxial, and weak-defocusing approximations 28,29 . The gradient and curvature of the phase together determine the shape of the wavefront, whose normal vector is then parallel to the wavevector at each point of the wavefront, and consequently to the direction of energy propagation. In turn, variations in the lateral energy flux also result in axial variations of the intensity. Convergence of light by a convex lens is an intuitive example (Fig. 5): the wavefront in front of the convex lens is a plane, whose wavevector is parallel to the direction of propagation. As such, the intensity distribution on different planes is constant, that is, the axial variation of the intensity is equal to zero. Then, the convex lens changes the wavefront so that all wavevectors are directed to the focal point, and therefore as the light propagates, the intensity distribution becomes denser and denser, meaning that the intensity varies in the axial direction (equivalent, its axial derivative is not zero). As there is a quantitative relationship between the gradient and curvature of the phase and the axial differentiation of intensity, we can exploit it for phase recovery (Fig. 6). By shifting the sensor axially, intensity maps at different defocus distances are recorded, which can be used to approximate the axial differential by numerical difference, and thus calculate the phase through TIE. Due to the addition of the imager, the sensor plane and the sample plane are conjugated. It is worth noting that TIE is suitable for a complete and partially coherent light source, and the resulting phase is continuous and does not require phase unwrapping, while it is only effective in the case of paraxial approximation and weak defocus 11 .

Shack-Hartmann wavefront sensing.
If we can obtain the horizontal and vertical phase gradients of a wavefront in some ways, then the phase can be recovered by integrating the phase gradients in these orthogonal directions. Shack-Hartmann wavefront sensor 12,13 is a classic way to do so from the perspective of geometric optics. It usually consists of a microlens array and an image sensor located at its focal plane (Fig. 7). The phase gradient of the wavefront at the surface of each microlens is calculated linearly from the displacement of the focal point on the focal plane, in both horizontal and vertical (x-axis and y-axis) directions.
The phase can then be computed by integrating the gradient at each point, whose resolution depends on the density of the microlens array. In addition, quantitative differential interference contrast microscopy 30 , quantitative differential phase contrast microscopy 31 , and quadriwave lateral shearing interferometry 32 also recover the phase from its gradients. There may achieve higher resolution than the Shack-Hartmann wavefront sensor. Phase retrieval. If extra information is not desired to be introduced, then calculating the phase directly from a propagated intensity measurement is an ill-posed problem. We can overcome such difficulty through incorporating prior knowledge. This is also known as regularization. In the Gerchberg-Saxton (GS) algorithm 14 , the intensity at the sample plane and the far-field sensor plane recorded by the sensor are used as constraints. A complex field is projected forward and backward between these two planes using the Fourier transform and constrained by the intensity iteratively; the resulting complex field will gradually approach a solution (Fig. 8a). Fienup changed the intensity constraint at the sample plane to the aperture (support region) constraint, so that the sensor only needs to record one intensity map, resulting in the error reduction (ER) algorithm and the hybrid input-output (HIO) algorithm ( Fig. 8b) 15,16 . Naturally, if more intensity maps are recorded by the sensor, there will be more prior knowledge for regularization, further reducing the ill-posedness of the problem. By moving the sensor axially, the intensity maps of different defocus distances are recorded as an intensity constraint, and then the complex field is computed iteratively like the GS algorithm ( Fig. 9a) [33][34][35] . In this axial multi-intensity alternating projection method, the distance between the sample plane and the sensor plane is usually kept as close as possible, so that numerical propagation is used for projection instead of Fourier transform. Meanwhile, with a fixed position of the sensor, multiple intensity maps can also be recorded by radially moving the aperture near the sample, and then the complex field is recovered iteratively like the ER and HIO algorithms (Fig. 9b), the so-called ptychographic iterative engine (PIE) 17,18 . In this radial multi-intensity alternating projection method, each adjoining aperture constraint overlaps one another. Furthermore, angular multi-intensity alternating projection is also possible. By switching the aperture constraint from the spatial domain to the frequency domain, multiple intensity maps with different frequency information are recorded by changing the angle of the incident light (Fig. 9c), the so-called Fourier ptychography (FP) 19 . In addition to alternating projections, there are two most representative non-convex optimization methods, namely the Wirtinger flow 36 and truncated amplitude flow algorithms 37 . They can be transformed into convex optimization problems through semidefinite programming, such as the PhaseLift algorithm 38 .
Deep learning (DL) for phase recovery. In recent years, as an important step towards true artificial intelligence (AI), deep learning 39 has achieved unprecedented performance in many tasks of computer vision with the support of graphics processing units (GPUs) and large datasets. Similarly, since it was first used to solve the inverse problem in imaging in 2016 40 , deep learning has demonstrated good potential in the field of computational imaging 41 . In the meantime, there is a rapidly growing interest in using deep learning for phase recovery (Fig. 10). used search code is "TS=(("phase recovery" OR "phase retrieval" OR "phase imaging" OR "holography" OR "phase unwrapping" OR "holographic reconstruction" OR "hologram" OR "fringe pattern") AND ("deep learning" OR "network" OR "deep-learning"))".
For the vast majority of "DL for PR", the implementation of deep learning is based on the training and inference of artificial neural networks (ANNs) 42 through input-label paired dataset, known as supervised learning (Fig. 11). In view of its natural advantages in image processing, the convolutional neural network (CNN) 43 is the most widely used ANN for phase recovery. Specifically, in order for the neural network to learn the mapping from physical quantity A to B, a large number of paired examples need to be collected to form a training dataset that implicitly contains this mapping relationship (Fig. 11a). Then, the gradient of the loss function is propagated backward through the neural network, and the network parameters are updated iteratively, thus internalizing this mapping relationship (Fig.   11b). After training, the neural network is used to compute Bx from an unseen Ax (Fig. 11c).
In this way, deep learning has been used in all stages of phase recovery and phase processing. In fact, the rapid pace of deep-learning-based phase recovery has been documented in several excellent review papers. For example, Barbastathis et al. 44 and Rivenson et al. 45 reviewed how supervised deep learning powers the process of phase retrieval and holographic reconstruction. Zeng et al. 46 and Situ et al. 47 mainly focused on the use of deep learning in digital holography and its applications. Wang et al. 48 reviewed and compared different usage strategies of deep learning in phase unwrapping. Dong et al. 49 introduced a unifying framework for various algorithms and applications from the perspective of phase retrieval and presented its advances in machine learning. Differently, depending on where the neural network is used, we review various methods from the following four perspectives: • In DL-pre-processing for PR (Section 2), the neural network performs some preprocessing on the intensity measurement before phase recovery, such as pixel superresolution (Fig. 12a), noise reduction, hologram generation, and autofocusing.
• In DL-in-processing for PR (Section 3), the neural network directly performs phase recovery (Fig. 12b) or participates in the process of phase recovery together with the physical model or physics-based algorithm.
• In DL-post-processing for PR (Section 4), the neural network performs some postprocessing after phase recovery, such as noise reduction (Fig. 12c), resolution enhancement, aberration correction, and phase unwrapping.
• In DL for phase processing (Section 5), the neural network uses the recovered phase for specific applications, such as segmentation (Fig. 12d), classification, and imaging modal transformation. Finally, we summarize how to effectively use deep learning in phase recovery and look forward to potential development directions (Section 6). To let readers learn more about phase recovery, we present a live-updating resource (https://github.com/kqwang/phaserecovery).

DL-pre-processing for phase recovery
A summary of "DL-pre-processing for phase recovery" is presented in Table 1 and is described below, including pixel super-resolution (Section 2.1), noise reduction (Section 2.2), hologram generation (Section 2.3), and autofocusing (Section 2.4). "---" indicates not available. "LR" is short for low-resolution. "HR" is short for high-resolution. "Expt." is short for experiment. "Sim." is short for simulation. "GAN loss" means training the network in an adversarial generative way. "MLP" is short for multi-layer perceptron.

Pixel super-resolution
A high-resolution image generally reveals more detailed information about the object of interest. Therefore, it is desirable to recover a high-resolution image from one or multiple low-resolution measurements of the same field of view, a process known as pixel superresolution. Similarly, from multiple sub-pixel-shifted low-resolution holograms, a highresolution hologram can be recovered by pixel super-resolution algorithms 84 . Luo et al. 50 proposed to use the U-Net for this purpose. Compared with iterative pixel super-resolution algorithms, this deep learning method has an advantage in inference time while ensuring a same level of resolution improvement, and maintains high performance even with a reduced number of input low-resolution holograms.
After the pixel super-resolution CNN (SRCNN) was proposed for single-image superresolution in the field of image processing 85 , this type of deep learning method was also used in other optical super-resolution problems, such as brightfield microscopy 86 and fluorescence microscopy 87 . Similarly, this method of inferring corresponding high-resolution images from low-resolution versions via deep neural networks can also be used for holograms pixel superresolution before doing phase recovery by conventional recovery methods (Fig. 13). Ren et al. 53 proposed to use a CNN, incorporating the residual network (ResNet) and sub-pixel network (SubPixelNet), for pixel super-resolution of a single off-axis hologram.
They found that compared to l1-norm and structural similarity index (SSIM) 88 , the neural network trained using l2-norm as the loss function performed best. Moreover, this deep learning method reconstructs high-resolution off-axis holograms with better quality, compared with conventional image super-resolution methods, such as bicubic, bilinear, and nearest-neighbor interpolations.

Noise reduction
Most phase recovery methods, especially holography, are performed with a coherent light source; therefore, coherent noise is an unavoidable issue. In addition, noise can be caused by environmental disturbances and the recording process of the image sensor. Therefore, it is very important to reduce the noise from the hologram before phase recovery. Filter-based methods, such as windowed Fourier transform (WFT) 89 , have been widely used in hologram noise reduction, but most of these methods face a trade-off between good filtering performance and time cost.
In 2017, Zhang et al. 90 opened the door to image denoising using the deep CNN, called DnCNN. Subsequently, the DCNN was introduced to the field of fringe analysis for fringe pattern denoising (Fig. 14). Yan et al. 54 first applied the DnCNN to fringe pattern denoising, which has higher precision around image boundaries and needs less inference time than WFT. Similar conclusions can also be seen in the work of Lin et al. 55

Hologram generation
As mentioned in Introduction, in order to recover the phase, multiple intensity maps are needed in many cases, such as phase-shifting holography and axial multi-intensity alternating projection. Given its excellent mapping capability, the neural network can be used to generate other relevant holograms from known ones, thus enabling phase recovery that requires multiple holograms (Fig. 15). In this approach, the input and output usually belong to the same imaging modality with high feature similarity, so it is easier for the neural network to learn. Moreover, the dataset is collected only by experimental record or simulation generation, without the need for phase recovery as ground-truth in advance by conventional methods.
Zhang et al. 61,62 first proposed the idea of generating holograms with holograms before phase recovery with the conventional method ( Fig. 15a). From a single hologram, the other three holograms with π/2, π, and 3π/2 phase shifts were simultaneously generated by the Y-Net 92 , and then phase recovery was implemented by the four-step phase-shifting method. The motivation to infer holograms instead of phase via a network is that for different types of samples, the spatial differences between their holograms were significantly lower than that of their phase. Accordingly, this phase recovery based on the hologram generation has better generalization ability than recovering phase from holograms directly with the neural network, especially when the spatial characteristics differences of the phase between the training and testing datasets are relatively large 62 . Since the phase-shift between the generated holograms are equal, Yan et al. 63 proposed to generate noise-free phase-shifting holograms using a simple end-to-end generative adversarial network (GAN) in a manner of sequential concatenation. Subsequently, for better performance in balancing spatial details and highlevel semantic information, Zhao et al. 64 applied the multi-stage progressive image restoration network (MPRNet) 93 for phase-shifting hologram generation. Huang et al. 65 and Wu et al. 66 then expanded this approach from four-step to three-step and two-step phaseshifting methods, respectively.
Luo et al. 67 proposed to generate holograms with different defocus distances from one hologram via a neural network, and then achieve phase recovery with alternating projection (Fig. 15b). Similar to the work of Zhang et al. 62 , they proved that the use of neural networks with less difference between the source domain and the target domain could enhance the generalization ability. As for multi-wavelength holography, Li et al. 68,69 harnessed a neural network to generate a hologram of another wavelength from one or two holograms of known wavelength, thereby realizing two-wavelength and three-wavelength holography. At the same time, Xu et al. 70 realized a one-shot two-wavelength and three-wavelength holography by generating the corresponding single-wavelength holograms from a two-wavelength or threewavelength hologram with information crosstalk.

Autofocusing
In lensless holography, the phase of the sample plane can only be recovered if the distance between the sensor plane and the sample plane is known. Defocus distance estimation thus becomes a fundamental problem in holography, which is also known as autofocusing.
Deep learning methods for autofocus essentially use the neural network to estimate the defocus distance from the hologram (Fig. 16), which can be regarded as either a classification problem [71][72][73][74] or a regression problem 75-78,80-83 . From the perspective of classification, Pitkäaho et al. 71 first proposed to estimate the defocus distance from the hologram by a CNN. In their scheme, the zero-order and twinimage terms need to be removed before the trained neural network classifies the holograms into different discrete defocus distances. Meanwhile, Ren et al. 72 advocate directly using raw holograms collected at different defocus distances as the input of the neural networks.
Furthermore, they revealed the advantages of neural networks over other machine learning algorithms in the task of autofocusing. Immediately afterward, Son et al. 73 also verified the feasibility of autofocus by classification through numerical simulations. Subsequently, Couturier et al. 74 improved the accuracy of defocus distance estimation by using a deeper CNN for categorizing defocus distance into a greater number of classes.
Nevertheless, no matter how many classes there are, the defocus distance estimated by these classification-based methods is also discrete, which is still not precise enough in practice. Thus, Ren et al. 75 further developed an approach to treat the defocus distance estimation as a regression problem, where the output of the neural network is continuous.
They verified the superiority of this deep-learning-based regression method with amplitude samples and phase samples, respectively, and tested the adaptability under different exposure times and incident angles. Later, Pitkäaho et al. 76 also extended their previous classificationbased work 71 to this regression-based approach. While these methods estimate the defocus distance of the entire hologram, Jaferzadeh et al. 77 and Moon et al. 78 proposed to take out the region of interest from the whole hologram as the input to estimate the defocus distance. In order to get rid of the constraint of known defocus distance as the label of the training dataset, Tang et al. 79 proposed to iteratively infer the defocus distance by an untrained network with a defocus hologram and its in-focus phase. Later on, Cuenat et al. 81 demonstrated the superiority of the vision Transformer (ViT) over typical CNNs in defocus distance estimation.
Because the spatial spectrum information is also helpful for the defocus distance estimation 94 , Lee et al. 82 and Shimobaba et al. 83 proposed to use the spatial spectrum or power spectrum of holograms as the network input to estimate the defocus distance.

DL-in-processing for phase recovery
In "DL-in-processing for phase recovery", the neural network directly performs the inference process from the measured intensity image to the phase (network-only strategy in Section 3.1), or together with the physical model or physics-based algorithm to achieve the inference (network-with-physics strategy in Section 3.2).

Network-only strategy
The network-only strategy uses a neural network to perform phase recovery, where the network input is the measured intensity image and the output is the phase. A summary of various methods is presented in Table 2 and described below, where we classify them into dataset-driven (DD) approaches and physics-driven (PD) approaches.  134 Two holograms Phase and amplitude ---Sim.: 100,000 (input only) l 2 -norm and Fourier domain l 1 -norm Dataset-driven approach. As one of the most commonly adopted strategies, datadriven deep learning phase recovery methods presuppose a large number of paired input-label datasets. Usually, it is necessary to experimentally collect a significant number of intensity images (such as diffraction images or holograms, etc.) as input, and use conventional methods to calculate the corresponding phase as ground-truth (Fig. 17a). The key lies in that this paired dataset implicitly contains the mapping relationship from intensity to phase. Then, an untrained/initialized neural network is iteratively trained with the paired dataset as an implicit prior, where the gradient of the loss function propagates into the neural network to update the parameters (Fig. 17b). After training, the network is used as an end-to-end mapping to infer the phase from intensity (Fig. 17c). Therefore, the DD approach is to guide/drive the training of the neural network with this implicit mapping, which is internalized into the neural network as the parameters are iteratively updated. Sinha et al. 95 were among the first to demonstrate this end-to-end deep learning strategy for phase recovery, in which the phase of objects is inferred from corresponding diffraction images via a trained deep neural network. In dataset collection, they used a phase-only spatial light modulator (SLM) to load different public image datasets to generate the phase as ground-truth, and after a certain distance, place the image sensor to record the diffraction image as input. The advantage is that both the diffraction image and the phase are known, and is easily collected in large quantities. Through comparative tests, they verified the adaptability of the deep neural network to unseen types of datasets and different defocus distances. Although this scheme cannot be used in practical application due to the use of the phase-type spatial light modulator, their pioneering work opens the door to deep-learninginference phase recovery. For instance, Li et al. 96 introduced the negative Pearson correlation coefficient (NPCC) 135 as a loss function to train the neural network, and enhanced the spatial resolution by a factor of two by flattening the power spectral density of the training dataset.
Deng et al. 97 found that the higher the Shannon entropy of the training dataset, the stronger the generalization ability of the trained neural network. Goy et al. 98 extended the work to phase recovery under weak-light illumination.
Meanwhile, Wang et al. 99 extended the diffraction device of Sinha et al. 95 to an in-line holographic device by adding a coaxial reference beam, and used the in-line hologram instead of the diffraction image as the input to a neural network for phase recovery. Nguyen et al. 100 applied this end-to-end strategy for FP, inferring the high-resolution phase from a series of low-resolution intensity images via a U-Net, and Cheng et al. 101 further used a single lowresolution intensity image under optimized illumination as the neural network input.
Cherukara et al. 102 extended this end-to-end deep learning strategy to CDI, in which they trained two neural networks with simulation datasets to infer the amplitude or phase of objects from far-field diffraction intensity maps, respectively. Ren et al. 103  In addition to expanding the application scenarios of this end-to-end deep learning strategy, some researchers focused on the performance and advantages of different neural networks in phase recovery. Xue et al. 109 118 , for phase recovery. By comparing in a one-sample-learning scheme, they found that MCN is more accurate and compact than the conventional U-Net. Ding et al. 119 added ViT into U-Net and trained it with low-resolution intensity as input and high-resolution phase as ground-truth using cycle-GAN.
The trained neural network can do phase recovery while enhancing the resolution, and has higher accuracy than the conventional U-Net. In CDI, Ye et al. 120  As a similar deep learning phase recovery strategy in adaptive optics, researchers demonstrated that neural networks could be used to infer the phase of the turbulence-induced aberration wavefront or its Zernike coefficient from the distortion intensity of target objects 137 . In these applications, only the wavefront that is subsequently used for aberration correction is of interest, not the RI distribution of turbulence itself that produces this aberration wavefront.
Physics-driven approach. Different from the dataset-driven approach that uses inputlabel paired dataset as an implicit prior for neural network training, physical models, such as numerical propagation, can be used as an explicit prior to guide/drive the inference or training of neural networks, termed physics-driven (PD) approach. On the one hand, this explicit prior can be used to iteratively optimize an untrained neural network to infer the corresponding phase and amplitude from the measured intensity image as input, referred to as the untrained PD (uPD) scheme (Fig. 18a). On the other hand, this explicit prior can be used to train an untrained neural network with a large number of intensity images as input, which then can infer the corresponding phase from unseen intensity images, an approach called the trained PD (tPD) scheme (Fig. 18b). In order to more intuitively understand the difference and connection between the DD and PD approaches, let us compare the loss functions in Fig. 17 and Fig. 18: where ‖•‖ 2 2 denotes the square of the l2-norm (or other distance functions), (•) is a neural network with trainable parameters , (•) is a physical model (such as numerical propagation, Fourier transform, or FP measurement model), is the measured intensity image in the training dataset, is the phase in the training dataset, is the measured intensity image of a test sample, and is the number of samples in the training dataset. In Eq.
(1) for the DD approach, the priors used for network training are the measured intensity image and corresponding ground-truth phase. Meanwhile, in Eqs. (2) and (3) for the PD approaches, the priors used for network inference or training are the measured intensity image and physical model, instead of the phase.
This PD approach was first implemented in the work on Fourier Ptychography by Boominathan et al. 124 . They proposed it in the higher overlap case, including the scheme of directly using an untrained neural network for inference (uPD) and the scheme of training first and then inferring (tPD), and demonstrated the former by simulation.
For the uPD scheme, Wang et al. 125 used a U-Net-based scheme to iteratively infer the phase of an object from a measured diffraction image whose de-focus distance is known.
Their method demonstrates higher accuracy than conventional algorithms (such as GS and TIE) and the DD scheme, at the expense of a longer inference time (about 10 minutes for an input with 256 × 256 pixels). Zhang et al. 126 extended this work to the case where the defocus distance is unknown, by including it as another unknown parameter together with the phase to the loss function. Yang et al. 127,128 further generalized this to the complex field inference by introducing an aperture constraint into the loss function, and pointed out that it would cost as much as 600 hours to infer 3,600 diffraction images with this uPD scheme. Meanwhile, Bai et al. 129 extended this from a single-wavelength case to a dual-wavelength case. Galande et al. 130 found that this way of neural network optimization with a single-measurement intensity input lacks information diversity and can easily lead to overfitting of the noise, which can be mitigated by introducing an explicit denoiser. This way of using the object-related intensity image as the neural network input makes it possible to internalize the mapping relationship between intensity and phase into the neural network through pre-training. It is worth mentioning that some researchers proposed to make adjustments to the uPD scheme, using the initial phase and amplitude recovered by backward numerical propagation as the neural network input [138][139][140] , which reduces the burden on the neural network to obtain higher inference accuracy.
untrained neural network without any ground-truth, the uPD scheme inevitably requires a large number of iterations, which excludes its use in many dynamic applications. Therefore, to adapt the PD scheme to dynamic inference, Yang et al. 127,128 adjusted their previously proposed uPD scheme to the tPD scheme by pre-training the neural network using a small part of the measured diffraction images, and then using the pre-trained neural network to infer the remaining ones. Yao et al. 131 trained a 3D version of the Y-Net 92 with simulated diffraction images as input, and then used the pre-trained neural network for direct inference or iterative refinement, which is 100 and 10 times faster than conventional iterative algorithms, respectively. Li et al. 132 proposed a two-to-one neural network to reconstruct the complex field from two axially displaced diffraction images. They used 500 simulated diffraction images to pre-train the neural network, and then inferred an unseen diffraction image by refining the pre-trained neural network for 100 iterations. Bouchama et al. 133 further extended the tPD scheme to Fourier Ptychography of low overlap cases by simulated datasets. Different from the above ways of generating training datasets from natural images or real experiments, Huang et al. 134 proposed to generate training datasets from randomly synthesized artificial images with no connection or resemblance to real-world samples. They further trained a neural network with this generated dataset and a physics-consistency loss, which showed superior external generalization to holograms of real tissues with arbitrarily defocus distances.

Network-with-physics strategy
Different from the network-only strategy, in the network-with-physics strategy, either the physical model and neural network are connected in series for phase recovery (physicsconnect-network, PcN), or the neural network is integrated into a physics-based algorithm for phase recovery (network-in-physics, NiP), or the physical model or physics-based algorithm is integrated into a neural network for phase recovery (physics-in-network, PiN). A summary of the network-with-physics strategy is presented in Table 3 and is described below. Physics-connect-network (PcN). In this scheme, the role of the neural network is to extract and separate the pure phase from the initial estimate that may suffer from spatial artifacts or low resolution, which allows the neural network to perform a simpler task than the network-only strategy; typically, the initial phase is calculated using a physical model (Fig.   19). which inferred more fine details than that of the NPCC loss function. They also improved the spatial resolution and noise robustness by learning the low-frequency and high-frequency bands, respectively, through two neural networks and synthesizing these two bands into fullband reconstructions with a third neural network 144 . By introducing random phase modulation, Kang et al. 145 further improved the phase recovery ability of the PcN scheme under weaklight illumination. Zhang et al. 146 extended the PcN scheme to FP, inferring high-resolution phase and amplitude using the initial phase and amplitude synthesized from the intensity images as input to a neural network. Moon et al. 147 extended the PcN scheme to off-axis holography, using numerical propagation to obtain the initial phase from the Gaber hologram as the input to the neural network.

Network-in-physics (NiP).
Regarding phase recovery as one of the most general optimization problems, this approach can be expressed as where (•) is the physical model (such as numerical propagation, Fourier transform, or FP measurement model), is the phase, is the measured intensity image of a test sample, and ( ) is a regularized constraint. According to the Regularization-by-Denoising (RED) 178 framework, a pre-trained neural network for denoising can be used as the regularized constraint: where ( ) is a pre-trained neural network for denoising, and is a weight factor to control the strength of regularization. Metzler et al. 148  In addition, according to the deep image prior (DIP) 180,181 , even an untrained neural network itself can be used as a structural prior for regularization (Fig. 20): where (•) is an untrained neural network with trainable parameters that usually takes a generative decoder architecture, is the measured intensity image of a test sample, and is a fixed random vector as latent code. This DIP-based approach was first introduced to phase recovery by Jagatap et al. 155 .
They solved Eq. (6)  Similarly, a pre-trained generative neural network can also be used as a generative prior, assuming that the target phase is in the range of the output of this trained neural network (Fig. 21): where (•) is a pre-trained fixed neural network that usually takes a generative decoder architecture, is the measured intensity image of a test sample, and is a latent code tensor to be searched. Due to the use of the generative neural network, the multi-dimensional phase that originally needed to be iteratively searched is converted into a low-dimensional tensor, and the solution space is also limited within the range of the trained generative neural network. Hand et al. 164 used generative prior for phase recovery with rigorous theoretical guarantees for random Gaussian measurement matrix, showing better performance than SPARTA at low subsampling ratios. Later on, Shamshad et al. 165 experimentally verified the robustness of the generative-prior-based algorithm to low subsampling ratios and strong noise in the coded diffraction setup. Then, Shamshad et al. 166 extended this generative-prior-based algorithm to subsampled FP. Hyder et al. 168 improved over this by combining the gradient descent and projected gradient descent methods with AltMin-based non-convex optimization methods. As a general defect, the trained generative neural network will limit the solution space to a specific range related to the training dataset, so that the iterative algorithm cannot search beyond this range. Therefore, Shamshad et al. 167 set both the input and previously fixed parameters of the trained generative neural network to be trainable. As another solution, Uelwer et al. 169

DL-post-processing for phase recovery
A summary of "DL-post-processing for phase recovery" is presented in Table 4 and is described below, including noise reduction (Section 4.1), resolution enhancement (Section 4.2), aberration correction (Section 4.3), and phase unwrapping (Section 4.4).  238 Wrapped count Wrap count gradient U-Net Sim.: >70,000 pairs Cross entropy, Jaccard distance, and l 1 -norm Li et al. 239 Wrapped count Wrap count gradient U-Net and ResNet

Noise reduction
In addition to being part of the pre-processing in Section 2.2, noise reduction can also be performed after phase recovery (Fig. 23). Jeon et al. 183   GAN to do speckle noise reduction for phase. Murdaca et al. 192 applied this deep-learningbased phase noise reduction to interferometric synthetic aperture radar (InSAR) 243 . The difference is that in addition to the sine and cosine images of the phase, the neural network also reduces noise for the amplitude images at the same time. Tang et al. 193 proposed to iteratively reduce the coherent noise in phase with an untrained U-Net.

Resolution enhancement
Similar to Section 2.1, resolution enhancement can also be performed after phase recovery as post-processing (Fig. 24). Liu et al. 194,195 first used a neural network to infer the corresponding high-resolution phase from the low-resolution phase. They trained two GANs with both a pixel super-resolution system and a diffraction-limited super-resolution system, which was demonstrated on biological thin tissue slices with the analysis of spatial frequency spectrum. Moreover, they pointed out that this idea can be extended to other resolutionlimited imaging systems, such as using a neural network to build a passageway from off-axis holography to in-line holography. Later, Jiao et al. 196 proposed to infer the high-resolution noise-free phase from an off-axis-system-acquired low-resolution version with a trained U-Net. To collect the paired dataset, they developed a combined system with diffraction phase microscopy (DPM) 244 and spatial light interference microscopy (SLIM) 22 to generate both holograms from the same field of view. After training, the U-Net retains the advantages of both the high acquisition speed of DPM and the high transverse resolution of SLIM. Subsequently, Butola et al. 197 extended this idea to partially spatially coherent off-axis holography, where the phase recovered at low-numerical-apertures objectives was used as input, and the phase recovered at high-numerical-apertures objectives was used as groundtruth. Since low-numerical-apertures objectives have a larger field of view, they aim to obtain a higher resolution at a larger field of view, i.e., a higher spatial bandwidth product. Meng et al. 198 used structured-illumination digital holographic microscopy (SI-DHM) 245 to collect the high-resolution phase as ground-truth. To supplement more high-frequency information by two cascaded neural networks, they used the low-resolution phase along with the highresolution amplitude inferred from the first neural network both as inputs of the second neural network. Subsequently, Li et al. 199 extended this resolution-enhanced post-processing method to quantitative differential phase-contrast (qDPC) 246 imaging for high-resolution phase recovery from the least number of experimental measurements. To solve the problem of outof-memory for the large size of the input, they disassembled the full-size input into some subpatches. Moreover, they found that the U-Net trained on the paired dataset has a smaller error than the paired GAN and the unpaired GAN. While for GAN, there is more unreasonable information in the inferred phase, which is absent in the ground-truth. Gupta et al. 200 took advantage of the high spatial bandwidth product of this method to achieve a classification throughput rate of 78,000 cells per second with an accuracy of 76.2%.
For ODT, due to the limited projection angle imposed by the numerical aperture of the objective lens, there are certain spatial frequency components that cannot be measured, which is called the missing cone problem. To address this problem via a neural network, Lim et al. 201 and Ryu et al. 202 built a 3D RI tomogram dataset for 3D U-Net training, in which the raw RI tomograms with poor axial resolution were used as input, and the resolution-enhanced RI tomograms from the iterative total variation algorithm were used as ground-truth. The trained 3D U-Net can infer the high-resolution version directly from the raw RI tomograms.
They demonstrated the feasibility and generalizability using bacterial cells and a human leukemic cell line. Their deep-learning-based resolution-enhanced method outperforms conventional iterative methods by more than an order of magnitude in regularization performance.

Aberration correction
For holography, especially in the off-axis case, the lens and the unstable environment of the sample introduce phase aberrations superimposing on the phase of the sample. To recover the pure phase of the sample, the unwanted phase aberrations should be eliminated physically or numerically. Physical approaches compensate for the phase aberrations by recovering the background phase without the sample from anther hologram, which requires more setups and adjustments 247,248 .
As for numerical approaches, the compensation of the phase aberrations can be directly achieved by Zernike polynomial fitting (ZPF) 249 or principal-component analysis (PCA) 250 . Yet, in these numerical methods, the aberration is predicted from the whole phase, where the object area should not be considered as an aberration. Thus, before using the Zernike polynomial fitting, the neural network can be used to find out the object area and the background area to avoid the influence of the background area and improve the compensation effect (Fig. 25). This segmentation-based idea, namely CNN+ZPF, was first proposed by

Phase unwrapping
In the interferometric and optimization-based phase recovery methods, the recovered light field is in the form of complex exponential. Hence the calculated phase is limited in the range of (-π, π] on account of the arctangent function. Therefore, the information of the sample cannot be obtained unless the absolute phase is first estimated from the wrapped phase, the so-called phase unwrapping. In addition to phase recovery, the phase unwrapping problem also arises in magnetic resonance imaging 251 , fringe projection profilometry 252 , and InSAR. Most conventional methods are based on the phase continuity assumption, and some cases, such as noise, breakpoints, and aliasing, all violate the Itoh condition and affect the effect of the conventional methods 253 . The advent of deep learning has made it possible to perform phase unwrapping in the above cases. According to the different use of the neural network, these deep-learning-based phase unwrapping methods can be divided into the following three categories (Fig. 26) 48 . Deep-learning-performed regression method (dRG) estimates the absolute phase directly from the wrapped phase by a neural network (Fig. 26a) [209][210][211][212][213][214][215][216][217][218][219][220][221][222] . Deeplearning-performed wrap count method (dWC) first estimates the wrap count from the wrapped phase by a neural network, and then calculates the absolute phase from the wrapped phase and the estimate wrap count (Fig. 26b) 185,[223][224][225][226][227][228][229][230][231][232][233] . Deep-learning-assisted method (dAS) first estimates the wrap count gradient or discontinuity from the wrapped phase by a neural network; next, either reconstruct the wrap count from the wrap count gradient and then calculate the absolute phase like dWC 238,239 , or directly use optimization-based or branch-cut algorithms to obtain the absolute phase from the warp count gradient or the discontinuity (Fig.   26c) 236,237,[240][241][242] . processing. In addition, they proposed to generate a phase dataset by weighted adding Zernike polynomials of different orders. Immediately after, Zhang and Yan et al. 227 verified the performance of the network DeepLab-V3+, but the resulting wrap count still contained a small number of wrong pixels, which will propagate error through the whole phase maps in the conventional phase unwrapping process. They thus proposed to use refinement to correct the wrong pixels. To further improve the unwrapped phase, Zhu et al. 228 proposed to use the median filter for the second post-processing to correct wrong pixels in the wrap count predictions. Wu et al. 229 enhanced the simulated phase dataset by adding the noise from real data. They also used the full-resolution residual network (FRRNet) with U-Net to further optimize the performance of the U-Net in the Doppler optical coherence tomography. By comparison with real data, their proposed network holds a higher accuracy than that of the Phase-Net and DeepLab-V3+. As for applying the dWC to point diffraction interferometer, Zhao et al. 230 proposed an image-analysis-based post-processed method to alleviate the classification imbalance of the task and adopted the iterative-closest-point stitching method to realize dynamic resolution. Vengala et al. 92 into the branch-cut algorithm to predict the branch-cut map from the residual image, which reduced the computational cost of the branch-cut algorithm.

Deep learning for phase processing
A summary of "Deep learning for phase processing" is presented in Table 5 and is described below, including segmentation (Section 5.1), classification (Section 5.2), and imaging modal transformation (Section 5.3).  262 Phase of HeLa cells Segmentation map U-Net and EfficientNet Expt.: 2,046 pairs focal loss and dice loss Zhang et al. 263 Phase of tissue slices Segmentation map mask R-CNN Expt.: 196 pairs Cross entropy Jiang et al. 264 Phase and amplitude Segmentation map DeepLab-V3+

Segmentation
Image segmentation, aiming to divide all pixels into different regions of interest, is widely used in biomedical analysis and diagnosis. For un-labeled cells or tissues, the contrast of the bright field intensity is low and thus inefficient to be used for image segmentation. Therefore, segmentation according to the phase distribution of cells or tissues becomes a potentially more efficient way. Given the great success of CNNs in semantic segmentation 302 , it seems that we can easily transplant it for phase segmentation, that is, doing segmentation with the phase as input of the neural network (Fig. 27). Different from conventional machine learning strategies that require manual feature extraction, deep learning usually takes the phase or its further version directly as input, in which the deep CNNs will automatically perform feature extraction (Fig. 28). This automatic feature extraction strategy tends to achieve higher accuracy, but usually requires a larger number of paired input-label datasets as support. The use of phase as input to deep CNNs for classification was first reported in the work of Jo et al. 267 . They revealed that, for cells like anthrax spores, the accuracy of the neural network using phase as input is higher than that of the neural network using binary morphology image obtained by conventional microscopy as input. Subsequently, this deep-learning-based phase classification method has been used in multiple applications, including assessment of T cell activation state 268 , cancer screening 269 , classification of sperm cells under different stress conditions 270 , prediction of living cells mitosis 271 , and classification of different white blood cells 272 . Accuracy in these applications is generally higher than 95% for the binary classification, but cannot achieve comparable accuracy in multi-type classification. More phase in temporal dimension (Fig. 29b). Wang et al. 278 280 proposed to use the phase at a specific moment and the corresponding spatiotemporal fluctuation map as the inputs of a neural network to improve the accuracy of cancer cell classification.
More phase in wavelength dimension (Fig. 29c) Amplitude together with the phase (Fig. 29d). Lam et al. 284,285 used the amplitude and phase as the inputs of a neural network to do the classification of occluded and/or deformable objects, and achieved accuracy over 95%. With the same strategy, they performed a ten-type classification for biological tissues with an accuracy of 99.6% 286 .
Further, Terbe et al. 287 proposed to use a type of volumetric network input by supplementing more amplitude and phase in different defocus distances. They built a more challenging dataset with seven classes by alga in different counts, small particles, and debris. The network with volumetric input outperforms the network with a single amplitude and phase inputs in all cases by approximately 4% accuracy. Similarly, Wu et al. 288 used real and imaginary parts of the complex field as network input to do a six-type classification for bioaerosols, and achieved an accuracy of over 94%. In pursuit of extreme speed for real-time classification, some researchers also choose to directly use the raw hologram recorded by the sensor as the input of the neural network to perform the classification tasks [325][326][327][328][329] . Since the information of amplitude and phase are encoded within a hologram, the hologram-trained neural network should achieve satisfactory accuracy with the support of sufficient feature extraction capabilities, which has been proven in practices including molecular diagnostics 325 , microplastic pollution assessment [326][327][328] , and neuroblastoma cells classification 329 .

Imaging modal transformation
Let us start this subsection with image style transfer, which aims to transfer a given image to another specified style under the premise of retaining the content of this image as much as possible 330,331 . Similarly, for a biological sample, its different parts usually have different RI, different chemical staining properties, or different fluorescent labeling properties, which makes it possible to achieve "image style transfer" from phase recovery/imaging to other different imaging modalities (Fig. 30). Fig. 30 Description of deep-learning-based imaging modal transformation. 48 The bright-field images of some color biological samples have sufficient contrast due to their strong absorption of visible light, so for such samples, bright-field imaging can be used as the target imaging modality, in which a neural network is used to transfer the complex image of the sample into its virtual bright-field image. In 2019, Wu et al. 289 presented the first implementation of this idea, called bright-field holography, in which a neural network was trained to transfer the back-propagated complex images from a single hologram to their corresponding speckle-and artifact-free bright-field images (Fig. 31a). This type of "bright-field holography" is able to infer a whole 3D volumetric image of a color sample like pollen from its single-snapshot hologram. Further, Terbe et al. 290 implemented "bright-field holography" with a cycle-GAN in the case of unpaired datasets.
For most transparent/colorless biological samples, chemical staining enables them to be clearly observed or imaged under bright-field microscopy. This allows the above "brightfield holography" to be used for transparent biological samples as well, which is called virtual staining. Rivenson et al. 291 applied this virtual staining technique to the inspection of histologically stained tissue slices and named it PhaseStain, in which a well-trained neural network was used to directly transfer the phase of tissue slices to their bright-field image of virtual staining (Fig. 31b). Using label-free slices of human skin, kidney, and liver tissue, they conducted an experimental demonstration of the efficacy of "PhaseStain" by imaging them with a holographic microscope. The resulting images were compared to those obtained through brightfield microscopy of the same tissue slices that were stained with HandE, Jones' stain, and Masson's trichrome stain, respectively. The reported "PhaseStain" greatly saves time and costs associated with the staining process. Similarly, Wang et al. 292 applied the "PhaseStain" in Fourier ptychographic microscopy and adapted it to unpaired dataset with a cycle-GAN. Liu et al. 293 used six images of amplitude and phase at three wavelengths as network input to infer the corresponding virtual staining version. In addition to tissue slices, Nygate et al. 294 demonstrated the advantages and potential of this deep learning virtual staining approach on a single biological cell like sperm (Fig. 31c). To improve the effectiveness of virtual staining, they used the phase gradients as an additional handengineered feature along with the phase as the input of the neural network. In order to assess the effectiveness of virtual staining, they used virtual staining images, phase, phase gradients, and stain-free bright-field images as input data for the five-type classification of sperm, and found that the recall values and F1 scores of virtual staining images were higher than those of other data twice or even four times. This type of single-cell staining approach provides ideal conditions for real-time analysis, such as rapid stain-free imaging flow cytometry. Guo et al. 295 proposed the concept of "transferring the physical-specific information to the molecular-specific information via a trained neural network" (Fig. 32a). Specifically, they used the phase and polarization of cell samples as multi-channel inputs to infer the corresponding fluorescence image, and further demonstrated its performance by imaging the architecture of brain tissue and prediction myelination in slices of a developing human brain.
Almost simultaneously, Kandel et al. 296 used a neural network to infer the fluorescencerelated subcellular specificity from a single phase, which they called phase imaging with computational specificity (Fig. 32b). With these label-free methods, they monitored the growth of both nuclei and cytoplasm for live cells and the arborization process in neural cultures over many days without loss of viability 297 . Guo et al. 298

Conclusion and outlook
The introduction of deep learning provides a data-driven approach to various stages of phase recovery. Based on where they are used, we provided a comprehensive review of how neural networks work in phase recovery. Deep learning can provide pre-processing for phase recovery before it is performed, can be directly used to perform phase recovery, can postprocess the initial phase obtained after phase recovery, or can use the recovered phase as input to implement specific applications. Despite the fact that deep learning provides unprecedented efficiency and convenience for phase recovery, there are some common general points to keep in mind when using this learn-based tool.
Datasets. For supervised-learning mode, a good paired dataset provides enough rich and high-quality prior knowledge as a guide for neural network training. As one of the most common ways, some researchers choose to collect the intensity image of the real sample through the experimental setup as the input, and calculate the corresponding phase through conventional model-based methods as the ground-truth. Numerical simulations can be a convenient and efficient way to generate datasets for some cases, such as hologram resolution enhancement 51 and phase unwrapping 48 103 and adapting to more types of samples 107 . One can use Shannon entropy to quantitatively represent the richness of the amount of information contained in the dataset, which directly affects the generalization ability of the trained neural network 97 . In addition, the spatial frequency content of the training samples in datasets also limits the ability of the trained neural network to resolve fine spatial features, which can be improved to some extent by pre-processing the power spectral density of the training samples 96  Networks and loss functions. Guided/Driven by the dataset, the neural network is trained to learn the mapping relationship from the input domain to the target domain by minimizing the difference between the actual output and the ground-truth (loss functions).
Therefore, the fitting ability of the neural network itself and the perception ability of the loss function determines whether the implicit mapping relationship in the dataset can be well internalized into the neural network. Conventional encoder-decoder-based neural networks have sufficient receptive fields and strong fitting capabilities, but down-sampling operations such as max-pooling lose some high-frequency information. Dilated convolutions can improve the receptive field while retaining more high-frequency information 118 . In addition, convolution in the Fourier frequency domain guarantees a global receptive field, since each pixel in the frequency domain contains contributions from all pixels in the spatial domain 121,122 . In order to make the neural network more focused on different spatial frequency information, one can also use two neural networks to learn the high-and lowfrequency bands, respectively, and then use the third neural network to merge them into a full spatial frequency version 144 . Neural architecture search is another potential technology, which automatically searches out the optimal network structure from a large structure space 123 . As the most commonly used loss functions, l2-norm and l1-norm are more responsive to lowfrequency information and less sensitive to high-frequency information. That is to say, the low-frequency information in the output of the neural network contributes more to the l2norm and l1-norm loss functions than the high-frequency information. Therefore, some researchers have been trying to find more efficient loss functions, such as NPCC 96 , GAN loss 109,116,117 , and default feature perceptual loss of VGG layer 143 . So far, what kind of neural network and loss function is the best choice for phase recovery is still inconclusive.
Network-only or physics-connect-network (PcN). Network-only strategy aims to infer the final phase from the raw measured intensity image in an end-to-end fashion using a neural network. It's a one-shot approach, letting the neural network do it all in one go. Neural networks not only need to perform regularization to remove twin-image and self-interferencerelated spatial artifacts but also undertake the task of free-space light propagation. Therefore, the inference results of the network-only strategy are not satisfactory in some severely illposed cases, including weak-light illumination 98 and dense samples 114 . Since free-space light propagation is a well-characterized physical model that can be reproduced and enforced numerically, using numerical propagation in front can relief the burden on the neural network and allow it to focus on learning regularization. In fact, PcN can indeed infer better results than network-only in the above ill-posed cases 98,114 . In another similar scheme, the neural network only performs the task of hologram generation before the phase-shifting algorithm, thus achieving better generalization ability than network-only 62 . In addition, using specklecorrelation processing before the neural network makes the trained neural network suitable for unknown scattering media and target objects 332 .
Interpretability. In phase recovery, learning-based deep learning techniques usually attempt to automatically learn a specific mapping relationship by optimizing/training neural network parameters with the real-world paired dataset. Deep neural networks usually adopt a multi-layer architecture and contain a large number of trainable parameters (even greater than millions), and are thus capable of learning complicated mapping relationships from datasets.
Unlike physics-based algorithms, such network architectures that are general to various tasks often lack interpretability, meaning that it is difficult to discover what the neural network has learned internally and what the role of a particular parameter is by examining the trained parameters. This makes one helpless in practical applications when encountering a failure of neural network inference, in which they can neither analyze why the neural network failed for that sample nor make targeted improvements for the neural network to avoid this failure in subsequent uses. The algorithm unrolling/unfolding technique proposed by Gregor and LeCun gives hope for the interpretability of neural networks 182 , in which each iteration of physics-based iterative algorithms is represented as one layer of the neural network. One inference through such a neural network is equivalent to performing a fixed number of iterations of the physics-based iterative algorithm. Usually, physics-based parameters and regularization coefficients are transferred into the unrolled network as trainable parameters.
In this way, the trained unrolled network can be interpreted as a physics-based iterative algorithm with a fixed number of iterations. In addition, the unrolled network naturally inherits prior structures and domain knowledge from a physics-based iterative algorithm, and thus its parameters can be efficiently trained with a small dataset.
Uncertainty. When actually using a trained neural network to do inference for a tested sample, its ground-truth is usually unknown, which makes it impossible to determine the reliability of the inferred results. To address this, Bayesian CNNs perform phase inference while giving uncertainty maps to describe the confidence measure of each pixel of the inferred result 109,[333][334][335] . This uncertainty comes from both the model itself and the data, called epistemic uncertainty and aleatoric uncertainty, respectively. The network-output uncertainty maps are experimentally verified to be highly consistent with the real error map, which makes it possible to assess the reliability of inferred results in practical applications without any ground-truth 109,335 . In addition to Bayesian neural networks, there are three other uncertainty estimation techniques, including single deterministic methods, ensemble methods, and test time augmentation methods 336 .
From electronic neural networks to optical neural networks. So far, the artificial neural networks involved in this review mostly run in the hardware architecture with electronics as the physical carrier such as the graphic processing unit, which is approaching its physical limit. Replacing electrons with photons is a potential route to high-speed, parallel and low-power artificial intelligence computing, especially optical neural networks 337,338 .
Among them, spatial-structure-based optical neural networks, represented by the all-optical diffractive deep neural network 339 , are particularly suitable for image processing. Some examples have initially demonstrated the potential of using optical neural networks for phase recovery 340,341 .
There is enormous potential and efficiency in learning-based deep neural networks, while conventional physics-based methods are more reliable. We thus encourage the incorporation of physical models with deep neural networks, especially for those well modeling from the real world, rather than letting the deep neural network perform all tasks as a black box. One possible way is to consider the dataset, network structures, and loss functions as much as possible during the training stage to obtain a good pre-trained neural network; in actual use, the pre-trained neural network is used for one-time inference to deal with the situation with high real-time requirements, and the physical model is used to iteratively fine-tune the pre-trained neural network to obtain more accurate results.