Introduction

Most physical theories allow us to make predictions: given a complete description of the state of a physical system, we can predict or simulate some measurements. Recovering parameters that describe the physical state of a system from measurement requires solving an inverse problem, which often can be a problem that cannot be solved deterministically and non-iteratively. The forward simulation is explicit, in many cases, much faster than any inversion, and may often also include more physically significant effects to the observed signal than most inverse algorithms. However, in many situations, physically meaningful information can only be extracted from experimental data by solving inverse problems, either by least-squares-minimization or probabilistic approaches, which often tend to be slow and probably under-determined due to a lack of data and their inner non-linearity. Inverse problems are some of the most important mathematical problems in science. However, in most cases, their solutions are designed to solve particular problems under particular conditions, making good use of domain-specific knowledge but lacking transferability to other inverse problems. The non-linear nature of many of these problems has so far prevented the framework of a generic solver.

Motivated by recent advances in the development of computing hardware and driven by large datasets1,2, deep learning3 has recently shown great potential particularly in image(s)-to-image(s) translation tasks, such as super resolution4, image denoising5 and image generation6. Many inverse applications7,8 have been purposed and have achieved promising results, since the universal approximation theorem guarantees that neural networks can approximate arbitrary functions well9,10,11. However, there are no guarantees those neural network weights can be found by optimization, and because of the high non-linearity of such deep architectures, their performance crucially depends on proper hyper-parameter configurations. Recent advanced big models even require years to develop and tens of thousands of dollars to train12,13. Deep learning has been widely used especially in recent computational imaging applications14,15, although many tricks exist to tune the hyper-parameters of a neural network such as the proper setting of the learning rate, the regulations, the design of the hidden units, the convolutional kernels and the network architectures16,17, none of these measures, achieves both a speed-up of convergence and quality of the resulting solution18.

Aware of the fact that a deep convolutional neural network (DCNN) often first quickly recovers the dominant low-frequency components, and afterward the high-frequency ones in a rather slow manner19,20, and inspired by the idea of multi-grid methods21, we propose a novel multi-resolution deep convolutional neural network (MCNN). This architecture extends the functionality of the hidden layers in the decoder of a U-Net22,23 by connecting these hidden layers to additional convolution layers to produce coarse outputs, in an attempt to match the low-frequency components, as is demonstrated in Fig. 1. Further modification by attaching different coarse inputs to the layers in the encoder has also been tested, but no apparent improvements in convergence is observed. Another possible variation is mutating the architecture progressively by starting to fit only the components of low frequencies at the initial phases, then inserting new layers to match the increasingly high-frequency features during the training, as has been demonstrated in some of the recent applications24,25. This architecture speeds up the network convergence and dramatically stabilizes the training process. As is shown in the lower-left of Fig. 1, with identical setups, when solving an inverse Laplacian problem from a second-order gradient approximation, MCNN quickly reaches a mean-absolute-error (MAE) loss around 0.07 while the conventional U-Net is challenging to train, being trapped around 0.2. Additional normalization layers can also accelerate the convergence26,27, but will very likely introduce undesired distortions (extended data Fig. 2a,b), and should be used carefully. A topological setup of those coarse MCNN output branches are visualized in the lower right part of Fig. 1, in which the outputs matching the low-frequency components of different resolutions are shown; while the U-Net shown in the upper part directly matches the input layer to the desired output of the same resolution. More details, including the implementation, the training settings, and the prediction results, of this comparison, can be found within the notebook attached in the released code.

Figure 1
figure 1

By matching outputs at all frequencies, MCNN achieves better stability and faster convergence than U-Net. A simplified architecture of an MCNN is demonstrated. This application is designed to predict phases from a defocused image. The network is composed of a classic U-Net (the upper part) with an additional 7 branches for multi-resolution reconstruction (from Output-2 to Output-8). With this topology, the input image is first encoded into a high dimension tensor, then is decoded into 8 images of different sizes to match different frequency components of the desired phases. The convergence curves of the test set show the significant advantages of an MCNN over a classic U-Net: the MAE of MCNN drops quickly in 100 iterations, while the U-Net converges slowly and can get stuck at local minima.

Figure 2
figure 2

Extended Figure: Distortions introduced by normalization layers. Inserting normalization layers before or after the activation layers will improve the convergence of the networks, but further research is desired to deal with the distortions introduced by the normalization layers. This can be directly observed from the intermediary results when training to predict the phases and amplitudes from 8 defocused HeLa cell images. (a) Stripe distortion as a result of group normalization. (b) Speckle defects from batch normalization.

In contrast to conventional problem-oriented inverse applications, MCNN aims to solve every inverse problem falling into the category of image(s)-to-image(s) conversion, without being limited to specific applications, but relying on massive datasets from fast numerical simulation or direct measurement. This generalization capability is demonstrated by solving three different inverse problems.

The phase problem

The first problem is to retrieve phases of a propagated complex wave by using the direct intensity measurements. This problem is a very fundamental inverse problem in optics, astronomy, or microscopy with neutrons, X-rays, or electrons. It has attracted a lot of research effort and led to the invention of numerous methods to reconstruct the missing phases from intensity measurements. Following Dennis Gabor’s proposal of the holographic principle28, numerous methods have been invented to extract phases by post-processing images29,30,31,32. These methods often-times only work under deliberate approximations or elaborate experimental configurations.

In conventional phase retrieval schemes, a trade-off between simple invertibility and accuracy has to be made. When approximating the imaging process by the transport of intensity equation (TIE)29, the relationship between the wavefront and a gradient measurement in the intensity domain can be approximated by a second-order partial differential equation, the solution of which can be found by assuming either periodic33 or other, often more appropriate boundary conditions34. To obtain a reliable gradient estimation, multiple intensities have to be measured at focal planes below and above the plane of focus, in pairs symmetrically to the in-focus plane31,32. One of the recent applications based on deep learning takes the complex wave of the back-propagated hologram intensities as inputs35, another variant retrieves the complex phase also making use of prior information from Fresnel propagation36, rather than directly mapping the measured intensities to desired phases. The proposed MCNN does not suffer from such restrictions. In our application, a direct resolution is demonstrated by mapping measurements to phases straightforwardly in an end-to-end manner. Phases can be predicted from intensities recorded at arbitrary plane(s) of focus. Since predicting the phase from its second-order gradient is a straight forward application, this problem has been included as a tutorial in our open-source codebase.

The results are presented in Fig. 3, in which three models have been trained to predict the amplitudes and the phases of HeLa cells directly from recorded intensities. Comparing with the reference results shown in Fig. 3a, which is produced by multi-focus TIE (MFTIE)32 using all 51 images, these applications converge very quickly in 2 epochs with 2048 training samples, and yield, albeit predicted from different images, very similar phases, which are shown in Fig. 3b,c. The predicted amplitudes are quite blurred compared to the MFTIE result. And no apparent improvement has been observed by including more intensities and training even longer. This behavior is expected, since the neural networks are trained by minimizing the MAE, with which the optimizer will try to minimize the averaged error, giving equal weights to all pixels despite some pixels having particularly high errors. Other applications dealing with the pure phase case are also presented for a reference. In these applications, the prediction results from one and two measured intensities are given in extended Fig. 4b–d, which are quite similar to the result obtained by Gaussian process TIE (GPTIE) given in a for reference. It is worth mentioning that the conventional U-Net produces a worse result with identical settings of MCNN, as is shown in Fig. 3d, although it tends to give out similar results in the long run. Moreover, coherent diffraction imaging neural networks37 (CDINN) gives a rough contour, only reconstructing the low frequency components of the Hela cells, as is shown in Fig. 3e.

Figure 3
figure 3

Amplitudes and phases predicted from defocused images. First row: 51 defocused Hela cell images recorded at focal planes exponentially spaced from  −500 μm to 500 μm (only 11 are shown). (a) Phases (top) and amplitudes (bottom) reconstructed by MFTIE from all 51 images32. A range-of-interest of 470 × 520 pixels is selected; (b) MCNN prediction using 2 images taken from  −1 μm and 1 μm; (c) MCNN prediction using 4 images taken from  −1.30 μm to 1.30 μm; (d) U-Net prediction using 4 images taken from  −1.30 μm to 1.30 μm; (e) CDINN prediction using 4 images taken from  − 1.30 μm to 1.30 μm.

Figure 4
figure 4

Extended Figure: Phase predicted from defocused image(s). From one or more defocused images of a pure phase object made up of human cheek cells acquired equally spaced by dz = 4 μm from  −256 μm to 256 μm (only 11 shown), MCNNs are capable of phase prediction. (a) GPTIE reconstruction from gradients estimated using 129 images from different focal planes, with a range-of-interest of 945 × 888 pixels selected31. (b) Phase prediction from 1 image at a distance of  −108 μm, with 1024 × 1024 pixels. (c) Prediction from 1 image at a distance of 108 μm. (d) Prediction from 2 images at  −52 μm and 52 μm, showing good performance on par with the state-of-the-art reconstruction algorithm demonstrated in (a).

To further explore the prediction qualities in the frequency domain, the Fourier ring corrections (FRC) of several numerical experiments are calculated. Figure 5a shows that MCNN recovers low frequency components well from a single image. The high-frequency features can be compensated by introducing fine details in the inputs, as is shown in Fig. 5b. Including more measurements improves the performance, but not very apparent with the high frequencies, as is shown in Fig. 5c,d.

Figure 5
figure 5

Extended Figure: MCNN gives good results in low-frequency domains. The simulated intensities are presented in the first row, and their predictions are presented in the second row (with their MAE shown in the upper right corner). The ground truths are presented in the third row for a visual comparison. The last row shows the Fourier ring correlations between the predictions and the ground truths. In (a), the inputs are defocused images, in which low-frequency details are prevalent. The low-frequency features recover well in this case. In (b), the inputs are intensity gradients, in which high-frequency details dominates. The high-frequency features are largely recovered in this experiment. In (c,d), the inputs are astigmatic images, simulated by rotating a cylinder lens with different angles. Multiple rotations can improve the quality of the output phases, as are shown in the last row of (c,d), but not so much in the high frequencies.

Imaging objects from diffuse reflection

The second problem is to image objects that are hidden from direct view using their indirect diffuse light reflections. Observing objects located in inaccessible regions is particularly useful in the fields of remote sensing, computer vision, and autonomous driving. This problem has drawn significant attention in recent publications38,39,40, but most of these applications require controlled or time-varying illumination, high-speed sensing or complicated inversion algorithms employing ray optics.

Although deep learning has been employed to do object classification from non-line-of-sight (NLOS) imaging41, our application demonstrates experimentally for the first time, to our best knowledge, that a series of colorful two-dimensional objects can be predicted from DNN.

The training of our neural networks relies on the massive dataset. The dataset can be simulated from existing theories if the domain-specific knowledge is well-established. In cases of simulations that are not feasible due to theory or unknown parameters, the training dataset can also be collected from direct measurements. Using an ordinary laptop, with 768 images captured by its camera and 768 screenshots recorded from its screen, our MCNN could be trained and enabled to reconstruct additional screenshots from diffusely reflected rays, i.e., the camera-captured images. The data collection procedure is shown in b of Fig. 6: a laptop is facing a door with a program running on it to record images from the screen and the camera at two frames-per-second (fps). One pair of the captured images are shown in Fig. 6a,c. The predicted images from the testing set match the ground truth well for images dominated by low-frequency features. The prediction of the prediction can be verified from the visualized absolute difference shown at the last row in Fig. 6d–h, where most differences are close to zero (black), especially the regions corresponding to the sky, the cloud, and the green grass. Nevertheless, the high-frequency details are missing in the predicted images as the optimizer is minimizing MAE. For example, in Fig. 6d, the ear of the rabbit and the red apple are lost; in e and g, the details of the eyes are lost; in f, all details of the animals are blurry. More results on the test set are included in the extended data video 1.

Figure 6
figure 6

Imaging objects from diffuse reflections. This is done by training an MCNN matching the camera captured diffuse reflective images to the screenshot images. (a) A screenshot of the video Big Buck Bunny42 being displayed. This is one of the target images matching the MCNN output. (b) Experimental setup. Everything is done with a laptop facing a door, no additional devices required. (c) A camera captured image corresponding to the screenshot shown in (a). The cropped zone marked with a dotted rectangle is one of the input images of the MCNN. (dh) Predictions randomly sampled from test set. The first row shows the selected range of the camera captured images (cropped and flipped), the second row shows the predictions, the third row shows the screenshots and the last row shows the absolute difference between the predictions and the screenshots, with the MAE value presented at the lower-left corner.

Denoising STEM images

The third problem is to denoise heavily noisy scanning transmission electron microscopy (STEM) images. Modern STEM can provide sub-ångström imaging resolution43,44, but this is limited by, e.g, the beam sensitivity of the specimen. Lowering the electron dose results in noisy images with a poor signal-to-noise ratio (SNR) less than 0 dB, and complicated extraction of relevant specimen information. Moreover, STEMs can produce millions of images in a few hours at a speed of 100 fps, and this amount of data can require months to process for a conventional algorithm. It is, therefore, essential to predict realistic images from noisy observations very quickly without loss of information45. Due to the complex unknown environmental variables of specific STEM setups, real-world STEM image denoising is too complicated for a single monolithic denoising algorithm. Many conventional algorithms exist to address this problem, but they heavily depend on domain-specific knowledge or a priori information46,47, and their application for real-time denoising is difficult. Recent variations48,49,50,51,52,53 based on DCNNs work well by directly matching a noisy input image to a clean output image, or using unclean images or unpaired images at a price of a small performance penalty54,55,56, but they are constrained to high SNR images, mostly above 10.0 dB, with known noise sources, and most of them only consider a single signal independent noise with known levels.

Here we train our MCNN to recover clean results from massively noisy STEM images recorded with less than ten counts per pixel, falling into an SNR less than 0 dB, which means that our experimental data contain more noises than signals. If all the noise parameters are known, a training set can be simulated according to these parameters. With such a training set, a neural network can yield a good prediction, as is shown in Fig. 7d. However, since various noises of unknown levels exist in the recorded STEM images, including but not limited to the Poisson noise, Gaussian noise, and clipping noise, it is nearly impossible to simulate a proper training set. The first layer of our neural network is manually designed to mimic the performance of four LPFs to address this problem. This strategy can be understood as converting the denoising problem into a deblurring problem. With much deeper architecture and much more trainable parameters, this model gives out clearer results than the most recent noise2void model56 and the well-known residual Gaussian denoising convolution neural networks5 (DnCNN), as is demonstrated in the lower rows in Fig. 7d. To further study the behavior of our model, the frequency responses of two randomly selected denoised HAADF images acquired at different conditions are presented in Extend Fig. 8a,b. While identifying most of the high-frequency components of the experimental images as noises, our model effectively modulates the low-frequency components as well, predicting much more Gaussian-like images than conventional denoising methods using a low pass filter. Our model gives a Gaussian-like shape for atomic peaks, which is to be expected as the experimental images are formed by a Gaussian-like electron beam being scanned (convolved) over sub-pixel sized atomic nuclei. Our network achieves an excellent performance of up to 440 fps when working with images taken at 150 fps with 128 × 128 pixels, more than three orders of magnitude faster than conventional methods, taking significant advantage of modern hardware acceleration like other deep learning applications (extended data Fig. 9).

Figure 7
figure 7

Validation of the denoising application on heavily-noised aberration-corrected high-annular dark-field (HAADF) STEM images of sub-nanometre sized Platinum clusters. MCNN performs well on heavily-noised datasets, as is demonstrated in (a). The numbers on the upper-left corner are the SNRs and the upper-right are MAEs. Also MCNN gives out clear and consistent results on consecutive experimental image frames shown in the left columns, which are recorded at 150 fps with 128  ×  128 pixels (b) and at 15 fps with 512  ×  512 pixels (c), and are taken under electron dose in range [105, 106]eÅ−2s−1 with a FEI Titan Themis. The upper images are for the first frames, the lower images are for the second frames. Similar denoising results produced by PGURE-SVT are shown in the middle columns for a comparison, which are not as clear as the MCNN results in the right columns. The neural network without the LPFs layer can still predict clear result, if the input images have the same noisy features, as is shown in the upper rows of (d); but when applying this model to experimental images, the neural network equipped with the LPFs layer gives much better result than conventional neural network such as the recent noise2void model and DnCNN model. Fine-running the model with the LPFs layer by connecting a conditional generative adversarial network (GAN), the clusters in the predicted image are more atomic-like, as is shown in lower rows of (d).

Figure 8
figure 8

Extended Figure: Denoising behavior in frequency domain. Our method gives much more Gaussian-like results than conventional denoising method using a low pass filter. Our method identifies most of the high frequency components as noises, and also modifies the low frequency components in pursuit of clear atomic images.

Figure 9
figure 9

Extended Figure: MCNN outperforms conventional denoising algorithms by more than three orders. This figure shows the average denoising time for MCNN, BM3D and PGURE-SVT methods, on experimental images from 128 × 128 pixels to 1024 × 1024 pixels.

The effectiveness of this MCNN application has been validated in two ways. First, it is tested using simulated noisy atomic images with Poisson noise, Gaussian noise, and white noise of SNR less than  −10 dB. In this task, the MCNN achieves an average SNR as high as 20 dB. Two examples are shown in Fig. 7a, in which all atoms are successfully restored, and the MAEs are less than 0.01. Second, it is cross-validated with the state-of-the-art Poisson-Gaussian Unbiased Risk Estimator for Singular Value Thresholding (PGURE-SVT) algorithm47. Two randomly selected consecutive frames from two experimental dataset are shown in Fig. 7b,c. Visually comparing their denoised results, the neural network gives similar, but much clearer atomic images on consecutive frames, demonstrating the effectiveness of MCNN, despite the predictions are only from single frames. This result allows for studying the actual dynamics of the atoms in real time. More results on the experimental dataset can be found in the extended data video 2 and 3.

Additionally, our model gives clear and consistent Gaussian-like results on consecutive frames suffering from coma defects, as is shown in extend Fig. 10. This probably due to the hand-crafted LPFs in the first layer which effectively removes this kind of error, even though our model has never been trained on such data.

Figure 10
figure 10

Extended Figure: Denoising HAADF images with coma distortions. MCNN gives clear and consistent results on consecutive frames containing coma distortions, even though our model has never seen this kind of data before.

Conclusion

An application-neutral framework for solving inverse problems in different domains that involve image to image mappings has been proposed and demonstrated. The generic capability of this framework has been demonstrated using three applications in very different domains. The difficulties of a challenging application, the complex experimental setups, and the complicated inverse algorithm implementation has been alleviated with this framework. Quick and smooth convergence is guaranteed by matching additional output layers to corresponding low-frequency features, reducing the frustration of DNN hyper-parameter tuning, providing sufficient datasets available either from numerical simulation or direct measurement. We expect that this robust, general-purpose architecture, will inspire the emergence of a new branch of new schemes that deal with different kinds of inverse problems, broadening the scope of inverse problem applications.

Method

Dataset

For phase retrieval applications, we simulated the training sets using Fourier optics57, including slight Poison noise and white noise. We randomly sampled the phases and amplitudes from Open Images Dataset58 (OID). Then we simulated millions of defocused images. However, for each MCNN, only 2,048 random samples were selected to train, as we did not find apparent performance improvement when expanding the training set even to 51,200. The exponentially spaced defocused images shown in Fig. 4 come from an open access GPTIE dataset31. Each image has 1024 × 1024 pixels with effective size 0.31 × 0.31 μm2 per pixel. The sample is unstained cheek cells from a human’s mouth, placed on a microscope slide and sealed with a cover slide. This sample represents nearly a pure phase object, though there are some specks of amplitude variation. The defocused images shown in Fig. 3 come from the MFTIE dataset32. Each image has 2560 × 2160 pixels with effective size 0.1625 × 0.1625 μm2 per pixel. The sample is HeLa cancer cells, which are relatively transparent.

For the diffused reflection reconstruction application, we collected the dataset using an HP OMEN 17 laptop. With a full-screened film, Big Buck Bunny42 with 4K resolution and 60 Hz frame-rate, being played at a half-speed on a 17.3 inches LCD screen (16:9, 43.9 cm in diagonal), this laptop, mounted on a laptop stand, was positioned facing to a door at a distance about 30 cm, as is shown in Fig. 6 (B), and a Python script was running simultaneously, acquiring pictures (RGB) from its camera with a resolution of 1280 × 720 pixels and capturing screenshot images (RGB) with a resolution of 1920 × 1280 pixels at 2 fps. Multiple sources of noise exist in the acquired image-pairs coming from three main contributors. Firstly, the sensitivity of the camera was adjusted automatically and adaptively, resulting in slightly varying brightness and contrast between frames. This is very apparent in the first few frames. Secondly, the laptop vibrated a lot during recording. A powerful gaming GPU is equipped on this laptop, and a loud fan is attached to the GPU for thermal control. This GPU was under heavy pressure when a 4K resolution film was displaying at a screen with a refresh rate of 120 Hz, and the fan switched its working mode frequently during the acquisition time. Lastly, there was a small but random time interval between the screen-shooting and camera capturing. The acquisition program is coded carefully in a way ensuring both the screen images and the camera images are cached in the random access memory (RAM) before saving to the hard disk, but still, the Python script runs slowly in nature and its garbage collection behavior is uncontrollable, and therefore the uncertainty of the timing difference is inevitable. After 1024 image pairs were collected, we selected the first 256 pairs for testing, and the rest 768 pairs for training by matching the images from the camera to the screenshots. We cropped an area of 640 × 360 pixels of each input image to fit our model into the GPU model, and the output images were scaled from 1920 × 1280 pixels down to 1280 × 720 pixels. Besides, as a mirror reflect an image with a left/right reversal, the cropped input images were horizontally flipped. We presented all the predicted results from the first 256 camera captured images in the extended data video 1.

For the denoising application, we included Poisson noise, Gaussian noise, white noise, and clipping noise in the training set generation. We generated millions of images from the clear images randomly sampled from OID images and simulated annular dark-field (ADF) STEM images. These ADF images are simulated using convolutions between random pulse signals and 2D Gaussian atomic peaks, with 512 × 512 pixels. For the first training stage, we trained the neural network with half of the images from OID and the other half from ADF images. It is crucial to include random images sampled from OID. This strategy prevents the network from predicting everything to be zero at the very beginning because the simulated ADF images contain small average intensities. For the second stage, we fine-tuned this neural network on millions of simulated ADF images with a minimal learning rate. Later we re-tuned this neural network with millions of atomic images simulated from molecular dynamics using up to 14 atoms with 128 × 128 pixels but did not observe an apparent difference in the trained denoising model. We tested the denoising network on various experimental STEM images recorded using a probe corrected FEI Titan Themis at 300 kV, with an electron dose ranging from 105eÅ−2 to 106eÅ−2. The samples were made by plasma sputtering Pt onto Protochips Fusion thermal chips. The typical pixel sizes of the recorded images are 6.3–12.5 pm acquired at 15-150 fps and resolutions from 128 × 128 pixels to 512 × 512 pixels.

Network architectures

The convolution layers inside each cell shown in Fig. 1 and the depth of the network are flexible of design but are generally restricted by the hardware and dataset. In our applications, we tested two types of architectures commonly used in recent deep learning applications. For the first type, there is a single convolutional layer with stride 2 in the encoder and a single deconvolutional layer with stride 2 in the decoder. This is the choice of the phase-only retrieval shown in Fig. 4 and the denoising application shown in Fig. 7, with about 50 million trainable parameters inside. For the phase-and-amplitude retrieval application shown in Fig. 3 and diffuse reconstruction application shown in Fig. 6, a bottleneck structure is selected in each of the unit cells, expanding the network to hundreds of layers, but reducing the trainable parameters to 20 million, to accelerate the training process and fit the model into the GPU memory. This is inspired by the recent architectures ResNeXt59 and Xception60. Also, the low-frequency output branches are reduced from 7 to 3 to fit everything into GPU memory. For the denoising network, an extra four-channeled layer is inserted right after the input layer. The filters of this layer are specially designed to mimic the functionality of 4 LPFs:

$$\begin{array}{ll}{f}_{1}(r,c)=1 & \ r,c\in [1,5],\\ {f}_{2}(r,c)=1 & \ r,c\in [1,7],\\ {f}_{3}(r,c)={e}^{-[{(r-8)}^{2}+{(c-8)}^{2}]{\rm{/}}\sqrt{20}} & \ r,c\in [1,15],\\ {f}_{4}(r,c)={e}^{-[{(r-8)}^{2}+{(c-8)}^{2}]{\rm{/}}\sqrt{30}} & \ r,c\in [1,15],\end{array}$$
(1)

in which r and c are the row and column index of the filters. This hand-crafted layer significantly stabilized the performance of the neural network, enabling it to provide robust prediction when the noise levels vary from images to images as the experimental setup changes with time. Furthermore, to prevent the network from predicting every pixel to be zero during the training and to make the high-intensity pixel clusters more atomic-like, a GAN in the conditional setting has been employed to adjust the back-propagated errors61. The different prediction behaviours of MCNN, MCNN with LPFs, and MCNN with LPFs and GAN are demonstrated in Fig. 7d.

Training settings

We trained all the networks with 2 Nvidia GTX 1080 Ti GPUs using an Adam optimizer62. The phase retrieval applications converge very quickly, and we trained all these networks two epochs with a batch size of 4 in a few hours. We trained the diffuse reflection reconstruction application for 256 epochs with a batch size of 2 in a few days. For the denoising model, we pre-trained it eight epochs without attaching a GAN, and then we constructed a U-Net using the extracted weights from the corresponding MCNN layers. We fine-tuned this U-Net model by connecting an additional GAN with a tiny learning rate for 4096 epochs with a batch size of 8. The total training time is about one week. More detailed training settings are available within the released source code.