Dynamical machine learning volumetric reconstruction of objects’ interiors from limited angular views

Limited-angle tomography of an interior volume is a challenging, highly ill-posed problem with practical implications in medical and biological imaging, manufacturing, automation, and environmental and food security. Regularizing priors are necessary to reduce artifacts by improving the condition of such problems. Recently, it was shown that one effective way to learn the priors for strongly scattering yet highly structured 3D objects, e.g. layered and Manhattan, is by a static neural network [Goy et al. Proc. Natl. Acad. Sci. 116, 19848–19856 (2019)]. Here, we present a radically different approach where the collection of raw images from multiple angles is viewed analogously to a dynamical system driven by the object-dependent forward scattering operator. The sequence index in the angle of illumination plays the role of discrete time in the dynamical system analogy. Thus, the imaging problem turns into a problem of nonlinear system identification, which also suggests dynamical learning as a better fit to regularize the reconstructions. We devised a Recurrent Neural Network (RNN) architecture with a novel Separable-Convolution Gated Recurrent Unit (SC-GRU) as the fundamental building block. Through a comprehensive comparison of several quantitative metrics, we show that the dynamic method is suitable for a generic interior-volumetric reconstruction under a limited-angle scheme. We show that this approach accurately reconstructs volume interiors under two conditions: weak scattering, when the Radon transform approximation is applicable and the forward operator well defined; and strong scattering, which is nonlinear with respect to the 3D refractive index distribution and includes uncertainty in the forward operator.


Introduction
Optical tomography reconstructs the three-dimensional (3D) internal refractive index profile by illuminating the sample at several angles and processing the respective raw intensity images. The reconstruction scheme depends on the scattering model that is appropriate for a given situation. If the rays through the sample can be well approximated as straight lines, then the accumulation of absorption and phase delay along the rays is an adequate forward model, i.e. the projection or Radon transform approximation applies. This is often the case with hard x-rays through most materials including biological tissue; for that reason, Radon transform inversion has been widely studied [1][2][3][4][5][6][7][8][9][10] . The problem becomes even more acute when the range of accessible angles around the object is restricted, a situation that we refer to as "limitedangle tomography," due to the missing cone problem [11][12][13] .
The next level of complexity arises when diffraction and multiple scattering must be taken into account in the forward model; then, the Born or Rytov expansions and the Lippmann-Schwinger integral equation [14][15][16][17][18] are more appropriate. These follow from the scalar Helmholtz equation using different forms of expansion for the scattered field 19 . In all these approaches, weak scattering is obtained from the first order in the series expansion. Holographic approaches to volumetric reconstruction generally rely on this first expansion term [20][21][22][23][24][25][26][27][28][29][30][31] . Often, solving the Lippmann-Schwinger equation is the most robust approach to account for multiple scattering, but even then, the solution is iterative and requires excessive amount of computation especially for complex 3D geometries. The inversion of these forward models to obtain the refractive index in 3D is referred to as inverse scattering, also a well-studied topic [32][33][34][35][36][37][38][39] .
An alternative to the integral methods is the Beam Propagation Method (BPM), which sections the sample along the propagation distance z into slices, each slice scattering according to the thin transparency model, and propagates the field from one slice to the next through the object 40 . Despite some compromise in accuracy, BPM offers comparatively light load of computation and has been used as forward model for 3D reconstructions 18 . The analogy of the BPM computational structure with a neural network was exploited, in conjunction with gradient descent optimization, to obtain the 3D refractive index as the "weights" of the analogous neural network in the learning tomography approach [41][42][43] . BPM has also been used with more traditional sparsity-based inverse methods 33,44 . Later, a machine learning approach with a Convolutional Neural Network (CNN) replacing the iterative gradient descent algorithm exhibited even better robustness to strong scattering for layered objects, which match well with the BPM assumptions 45 . Despite great progress reported by these prior works, the problem of reconstruction through multiple scattering remains difficult due to the extreme ill-posedness and uncertainty in the forward operator; residual distortion and artifacts are not uncommon in experimental reconstructions.
Inverse scattering, as inverse problems in general, may be approached in a number of different ways to regularize the ill-posedness and thus provide some immunity to noise 46,47 . Recently, thanks to a ground-breaking observation from 2010 that sparsity can be learnt by a deep neural network 48 , the idea of using machine learning to approximate solutions to inverse problems also caught on ref. 49 . In the context of tomography, in particular, deep neural networks have been used to invert the Radon transform 50 and recursive Born model 32 , and were also the basis of some of the papers we cited earlier on holographic 3D reconstruction [28][29][30] , learning tomography [41][42][43] , and multi-layered strongly scattering objects 45 . In prior work on tomography using machine learning, generally, the intensity projections are all fed simultaneously as inputs to a computational architecture that includes a neural network, and the output is the 3D reconstruction of the refractive index. The role of the neural network is to learn (1) the priors that apply to the particular class of objects being considered and (2) the relationship of these priors to the forward operator (Born, BPM, etc.) so as to produce a reasonable estimate of the inverse.
Here we propose a rather distinct approach to exploit machine learning for a generic 3D refractive index reconstruction independent of the type of scattering. Our motivation is that, as the angle of illumination is changed, the light goes through the same scattering volume, but the scattering events, weak or strong, follow a different sequence. At the same time, the raw image obtained from a new angle of illumination adds information to the tomographic problem, but that information is constrained by (i.e. is not orthogonal to) the previously obtained patterns. We interpret this as similar to a dynamical system, where the output is constrained by the history of earlier inputs as time evolves and new inputs arrive. (The convolution integral is the simplest and best-known expression of this relationship between the output of a system and the history of the system's input.) An alternative interpretation is as dynamic programming 51 , where the system at each step reacts so as to incrementally improve an optimality criterion-in our case, the reconstruction error metric.
The analogy between tomography and a dynamical system suggests the RNN architecture as a strong candidate to process raw images in sequence, as they are obtained one after the other; and process them recurrently so that each raw image from a new angle improves over the reconstructions obtained from the previous angles. Thus, we treat multiple raw images under different illumination angles as a temporal sequence, as shown in Fig. 1. The angle index replaces what is a dynamical system would have been the time t. This idea is intuitively appealing; it also leads to considerable improvement in the reconstructions, removing certain artifacts that were visible in the strong scattering case of ref. 45 .
The dynamic reconstruction methodology applies, for example, too weak scattering where the raw images are the sinograms; and too strong scattering, where the raw images are better interpreted as intensity diffraction patterns. The purpose of the learning scheme is to augment this relationship with regularization priors applicable to a certain class of objects of interest.
The way we propose to use RNNs in this problem is quite distinct from the recurrent architecture proposed first in ref. 48 and subsequently implemented, replacing the recurrence by a cascade of distinct neural networks, in refs. 50,52,53 , among others. In these prior works, the input to the recurrence can be thought of as clamped to the raw measurement, as in the proximal gradient 54 and related methods; whereas, in our case, the input to the recurrence is itself dynamic, with the raw images from different angles forming the input sequence. Moreover, by utilizing a modified gated recurrent unit (more on this below) rather than a standard neural network, we do not need to break the recurrence up into a cascade.
Typical applications of RNNs 55,56 are in temporal sequence learning and identification. In imaging and computer vision, RNN is applied in 2D and 3D: video frame prediction 57-60 , depth map prediction 61 , shape inpainting 62 ; and stereo reconstruction 63,64 or segmentation 65,66 from multi-view images, respectively. Stereo, in particular, bears certain similarities to our tomographic problem here, as sequential multiple views can be treated as a temporal sequence. To establish the surface shape, the RNNs in these prior works learn to enforce consistency in the raw 2D images from each view and resolve the redundancy between adjacent views in recursive fashion through the time sequence (i.e. the sequence of view angles). Non-RNN learning approaches have also been used in stereo, e.g. Gaussian mixture models 67 . In computed tomography, in particular, an alternate dynamical neural network of the Hopfield type has been used successfully 68 .
In this work, we replaced the standard Long Short-Term Memory (LSTM) 56 implementation of RNNs with a modified version of the newer Gated Recurrent Unit (GRU) 69 . The GRU has the advantage of fewer parameters but generalizes comparably with the LSTM. Our GRU employs a separable-convolution scheme to explicitly account for the asymmetry between the lateral and axial axes of propagation. We also utilize an angular attention mechanism whose purpose is to learn how to reward specific angles in proportion to their contribution to reconstruction quality 70 . We found that for the strongly anisotropic samples or scanning schemes the angular attention mechanism is effective.
The results of our simulation and experimental study on a generic interior volumetric reconstruction are in Results. We first show numerically that the dynamical machine learning approach is suitable to tomographic reconstruction under more restrictive and commonly used Radon transform assumption, i.e. weak scattering. Then, we demonstrate the applicability of the dynamical approach to strong scattering tomography. We show significant improvement over static neural network-based reconstructions of the same experimental data under the strong scattering assumption. The improvement is shown both visually and in terms of several quantitative metrics. Results from an ablation study indicate the relative significance of the new components we introduced to the quality of the reconstructions.

Results
Our first investigation of the recurrent reconstruction scheme is for weak scattering, i.e. when the Radon transform approximation applies and with a limited range of available angles. For simulation, each sample consists of random number, between 1 and 5 with equal probability, of ellipsoids at random locations with arbitrarily chosen sizes, amplitudes, and angles, thus spatially isotropic in average. Rotation is applied along the x-axis, from −10°to +10°with 1°increment, thus 21 projections per sample under a parallel-beam geometry. The Filtered Backprojection (FBP) algorithm 3 is used to generate crude estimates from the projections. nth FBP Approximant (n = 1,2,…,21) is the reconstruction by the FBP algorithm using n projections of n angles starting from −10°.
The reconstructions by the RNN are compared in Fig. 2 with FBP and Total Variation (TV)-regularized reconstructions using TwIST 71 for qualitative and visual comparison. Here a TV-regularization parameter is set to be 0.01, and the algorithm is run up to 200 iterations until its objective function saturates. Figure 3 shows the quantitative comparison on test performance using three different quantitative metrics, where FBP and FBP + TV yielded much lower values than the recurrent scheme. Figure 4 shows the evolution of test reconstructions as new projections or FBP Approximants are presented to the dynamical scheme. When the recurrence starts with n = 1, the volumetric reconstruction is quite poor; as more projections are included, the reconstruction improves as expected. It is also interesting to see that not all the angles are needed to achieve reasonable quality of reconstructions as the graphs and reconstructions in Fig. 4a, b saturate around n = 19.
Details of the recurrent architecture for the weak scattering assumption are presented in Fig. 16b. To quantify the relative contributions to reconstruction quality of each element in the architecture, the elements one by one are ablated (-) or substituted ($) with  alternatives. This ablation study is performed on a strategy to giving weights on hidden features, separable convolution, and ReLU (rectified linear unit) activation function 72 inside the recurrence cell. Specifically, in the study, (1) an angular attention mechanism is ablated, or only last hidden feature h n is taken into account before the decoder instead of the angular attention mechanism; (2) a separable convolution scheme is ablated, thus the standard 3D convolution; and (3) ReLU activation function is ablated and then substituted with the tanh activation function, which is more usual 69 . The ablated architectures are trained under the same training scheme (for more details, see Training the recurrent neural network in Materials and methods) and tested with the same simulated data.
Visually, in Fig. 5, test performance largely degrades as the ablation happens on the ReLU activation function and separable convolution, which is also found quantitatively in Fig. 6. Therefore, for the Radon case, we find that (1) ReLU activation function is highly desirable instead of the native tanh function; (2) the separable convolution is helpful when designing a recurrent unit and encoder/decoder for tomographic reconstructions under the weak scattering assumption. However, for the ablation of the angular attention mechanism, it makes no large difference from the performance of the proposed model, which is because the objects used for training and testing are isotropic in average and do not show any preferential direction.
Here, we perform an additional study on the role of the angular attention mechanism by granting a preferential direction or spatial anisotropy to the objects of interest. In this study, the mean major axis of the objects is assumed to be parallel to the z-axis, i.e. the objects are elongated along the axis. Instead of the previously examined symmetric angular scanning range (−10°, +10°), we now consider the asymmetric case (−15°, +5°).
In Fig. 7a, the angular attention with the asymmetry now gives attention on n differently, i.e. its peak translated to the left by 5. It is because the projections from higher angles may contain more useful information on the objects due to the directional preference of the objects, thus the distribution of the attention probabilities is now attracted to the lower indices. In Fig. 7b, we quantitatively show that the angular attention improves performance in the case of the and methods" for details). Networks trained with examples with small and dense features tend to generalize better and with less artifacts than large and sparse features, in agreement with ref. 73 . Lastly, and not surprisingly, overall reconstruction quality is better when feature size is large and features are sparse.  Fig. 7 Attention probabilities and angular attention mechanism for different ranges of scanning. a Attention probabilities according to different ranges of scanning. Here, for the scanning range of (−10°, +10°), n = 1 and 21 correspond to −10°and +10°, respectively; and for the scanning range of (−15°, +5°), they correspond to −15°and +5°, respectively. b Test performance for our examined scanning ranges with and without the angular attention mechanism. The graphs show the means and 95% confidence intervals  Table S1 in Supplementary Information (SI) Section S1 for more details) Next, we investigate the case when the Radon transform is not applicable, i.e. tomography under strong scattering conditions and under a similarly limited-angle scheme. The RNN is first trained with the single-pass, gradient descent-based Approximants Eq. (4) of simulated diffraction patterns (see Training and testing procedures in Materials and methods), and then tested with the simulated ones and additionally with the TV-based Approximants Eq. ; 12Þ. See "Materials and methods" for more details in the DWMA process. The evolution of the RNN output as more DWMA Approximants are presented is shown in Fig. 10 and shows a similar improvement with recurrence m as in the Radon case of Fig. 4. Also, like the Radon case, it is interesting to see that not all the Approximants are needed to acquire reasonable quality of reconstructions: the graphs in Fig. 10a saturate around m = 10 and the visual quality of the reconstructions at m = 10-12 in Fig. 10b does not largely differ. For comparison, the 3D-DenseNet architecture with skip connections in ref. 45 Table S1 in SI Section S1 parameters are shown in Fig. 11. The RNN results show substantial visual improvement, with fewer artifacts and distortions compared to the static approaches of ref. 45 , even when the number of parameters in the latter matches ours. PCC, SSIM, Wasserstein distance, and PE are used to quantify test performance using simulated and experimental data in Fig. 12.
We also conducted an ablation study of the learning architecture of Fig. 16d. Similar to the Radon case, each component in the architecture was ablated or substituted with its alternative, one at a time: (1) ReLU was ablated and then substituted with the native tanh activation function, (2) the separable convolution was ablated, thus the standard 3D convolution, and (3) the angular attention mechanism was ablated, or only the last hidden feature was given attention. The ablated architectures are also trained under the same training scheme (see Training the recurrent neural network in "Materials and methods" for more details) and tested with both the simulated Eq. (4) and experimental Approximants Eq. (5).
Visually in Fig. 13, unlike the Radon case, paying attention only to the last hidden feature affects and degrades the testing performance worst. Also, it is important to note that the ablation of the separable convolution scheme brings degradation in test performance according to Fig. 13. The decrease in test performance by the substitution of ReLU with the more common tanh is comparatively marginal. These findings are supported quantitatively as well in Fig. 14.
Thus, under the strong scattering condition, we find that (1) hidden features from all angular steps need to be taken into consideration with the angular attention mechanism for reconstructions to get a better test performance although the last hidden feature is assumed to be informed of the history of the previous angular steps; (2) replacing the standard 3D convolution with the separable convolution helps when designing a recurrent unit and a convolutional encoder/decoder for tomographic reconstructions; and (3) the substitution of tanh with ReLU is still useful but may be application dependent. Here, DWMA Approximantsf  Fig. 12b can be found in Table S2 in SI Section S2

Discussion
We have proposed a new recurrent neural network scheme for a generic interior-volumetric reconstruction by processing raw inputs from different angles of illumination dynamically, i.e. as a sequence, with each new angle improving the 3D reconstruction. We found this approach to work well under two types of scattering assumptions: weak (Radon transform) and strong. In the second case, in particular, we observed significant qualitative and quantitative improvement over the static machine learning scheme of ref. 45 , where the raw inputs from all angles are processed at once by a neural network.
Through the ablation studies, we found that sandwiching the recurrent structure with some key elements between a convolutional encoder/decoder helps improve the reconstructions. We found that the angular attention mechanism takes an important role especially when the objects of interest are spatially anisotropic and performs better than placing all the attention on only the last hidden feature. Even though the last hidden feature is a nonlinear multivariate function of all the previous hidden features, as it has a propensity to reward the latter representations but the former ones 74 , the last hidden feature may not sufficiently represent all angular views. Hence, the angular attention mechanism adaptively merges information from all angles. This is particularly important for our strong scattering case as each DWMA Approximant involves a diffraction pattern of a certain illumination angle; whereas an FBP Approximant under

( ) tanh
Δn (x 10 -2 ) Fig. 13 Visual quality assessment from the ablation study on elements (see "Computational architecture in Materials and methods" for details) using experimental data. Rows 3-5 show reconstructions based on experimental data for each layer upon the ablation and substitution of ReLU activation in Eq. (10) with the more common tanh activation function instead (row 3); ablating the separable convolution scheme, thus the standard 3D convolution (row 4); ablating the angular attention mechanism and putting attention to only last hidden feature (row 5). The rows are ordered by increasing severity of the ablation effect according to Fig. 14b the weak scattering case is computed from several projections in a cumulative fashion. In addition, interestingly, the relative contributions of other elements, e.g. the separable convolution scheme and ReLU activation, differ in weak and strong scattering assumptions. The substitution of the ReLU with a more common tanh activation brings forth more severe degradation of performance under the weak scattering assumption. Thus, we suggested different guidelines for each scattering assumption.
Lastly, alternative implementations of the RNN could be considered. Examples are LSTMs, Reservoir Computing [75][76][77] , separable convolution or DenseNet variants for the encoder/decoder and dynamical units. We leave these investigations to future work.

Experiment
For the experimental study under the strong scattering assumption, the experimental data are the same as in ref. 45 , whose experimental apparatus is summarized in Fig. 15. We repeat the description here for the readers' convenience. The He-Ne laser (Thorlabs HNL210L, power: 20 mW, λ = 632.8 nm) illuminated the sample after spatial filtering and beam expansion. The illumination beam was then de-magnified by the telescope (f L 3 : f L 4 ¼ 2 : 1), and the EM-CCD (Rolera EM-C2, pixel pitch: 8 μm, acquisition window dimension: 1002 1004) captured the experimental intensity diffraction patterns. The integration time for each frame was 2 ms, and the EM gain was set to × 1. The optical power of the laser was strong enough for the captured intensities to be comfortably outside the shot-noise limited regime.
Each layer of the sample was made of fused silica slabs (n = 1.457 at 632.8 nm and at 20°C). Slab thickness was 0.5 mm, and patterns were carefully etched to the depth of 575 ± 5 nm on the top surface of each of the four slabs. To reduce the difference between refractive indices, gaps between adjacent layers were filled with oil (n = 1.4005 ± 0.0002 at 632.8 nm and at 20°C), yielding binary-phase depth of −0.323 ± 0.006 rad. The diffraction patterns used for training were prepared with simulation precisely matched to the apparatus of Fig. 15. For testing, we used a set of diffraction patterns that was acquired both through simulation (see Approximant computations in Materials and methods for details) and experiment.
For the strong scattering case, objects used for both simulation and experiment are dense-layered, transparent, i.e. of negligible amplitude modulation, and of binary refractive index. They were drawn from a database of IC layout segments 45 . The feature depth of 575 ± 5 nm and refractive index contrast 0.0565 ± 0.0002 at 632.8 nm and at 20°C were such that weak scattering assumptions are invalid and strong scattering has to be necessarily taken into account. The Fresnel number ranged from 0.7 to 5.5 for the given defocus amount Δz = 58.2 mm for the range of object feature sizes.
To implement the raw image acquisition scheme, the sample was rotated from −10°to +10°with a 1°increment along both the x and y axes, while the illumination beam and detector remained still. This resulted in N = 42 angles and intensity diffraction patterns in total. Note that ref. 45 Table S3 in SI Section S2 utilized 22 patterns out of with a 2-degree increment along both x and y axes. The comparisons we show later are still fair because we retrained all the algorithms of ref. 45 for the 42 angles and 1°increment.
Computational architecture Figure 16 shows the proposed RNN architectures for both scattering assumptions in detail. Details of the forward model and Approximant (pre-processing) algorithm, the separable-convolution GRU, convolutional encoder and decoder, and the angular attention mechanism are described in Materials and methods. The total number of parameters in both computational architectures is ∼ 21 M (more on this topic in Training the recurrent neural network in Materials and methods.).

Approximant computations
Under the weak scattering condition, amplitude phantoms with the random number between 1 and 5 of ellipsoids with arbitrarily chosen dimensions and amplitude values at random locations are illuminated within a limited-angle range along one axis, thus spatially isotropic in average. The angle is scanned from −10°to +10°with a 1°increment. Intensity patterns on a detector are simple projections of the objects along certain angles according to the Radon transform as a forward model. Filtered Backprojection (FBP) 3 is chosen to perform backward operation. Here a crude estimate of n projections, i.e. g 1 ; ; g n , using to the FBP algorithm without any regularization is the nth FBP Approximant or f 0 n . Thus, the quality of the FBP Approximant is improved as n increases. As n spans from 1 to N(=21), a sequence of the FBP Approximants f 0 1 ; f 0 2 ; ; f 0 N À Á becomes the input to an encoder and a recurrence cell as shown in Fig. 1b. The FBP Approximant sequences for training, validation, and testing are generated with these procedures and based on threedimensional simulated phantoms.
However, under the strong scattering condition, the denselayered, binary-phase object is illuminated at a sequence of angles, and the corresponding diffraction intensity patterns are captured by a detector. At the nth step of the sequence, the object is illuminated by a plane wave at angles θ nx ; θ ny À Á with respect to the propagation axis z on the xz and yz planes, respectively. Beyond the object, the scattered field propagates in free space by distance Δz to the digital camera (the numerical value is Δz = 58.2 mm. Let the forward model under the nth illumination angle be denoted as H n ; n ¼ 1; 2; ; N; that is, the nth measurement at the detector plane produced by the phase object f is g n .
The forward operators H n are obtained from the nonparaxial BPM 33,40,45 , which is less usual so we describe it in some additional detail here. Let the jth cross-section of the computational window perpendicular to z-axis be where J is the number of slices the we divide the object into, each of axial extent δz. The optical field at the (j + 1)th slice is expressed as where δz is is equal to the slab thickness, i.e. 0.5 mm; F and F À1 are the Fourier and inverse Fourier transforms, respectively; and χ 1 χ 2 denotes the Hadamard (elementwise) product of the functions χ 1 and χ 2 :The Hadamard product is the numerical implementation of the thin transparency approximation, which is inherent in the BPM. To obtain the intensity at the detector, we define the (J + 1) th slice displaced by Δz from the Jth slice (the latter is the exit  ; g n create an Approximant f ′ n by a FBP operation, and a sequence of FBP Approximants f ′ n ; n ¼ 1; ; N ¼ 21 ð Þ; is followed by an encoder and a recurrent unit. There is an angular attention block before a decoder for the 3D reconstructionf , (c-d) Strong-scattering case: the raw intensity diffraction pattern g n ; n ¼ 1; ; N ð¼ 42Þ; of the nth angular sequence step is followed by gradient ; 12, and the angular attention mechanism them into a single representation a, which is finally decoded to produce the 3D reconstructionf . For both cases, training adapts the weights of the learned operators in this architecture to minimize the training loss function E f ;f betweenf and the ground truth object f surface of the object) to yield The purpose of the Approximant, in general, is to produce a crude estimate of the volumetric reconstruction using the forward operator alone. This has been well established as a helpful form of pre-processing for subsequent treatment by machine learning algorithms 45,78 . Previous works constructed the Approximant as a singlepass gradient descent algorithm 33,45 . Here, due to the sequential nature of our reconstruction algorithm, as each intensity diffraction pattern from a new angle of illumination n is received, we instead construct a sequence of Approximants, indexed by n, by solving the problem The gradient descent update rule for this functional where f 0 ½ n ¼ 0 and s is the descent step size and in the numerical calculations was set to 0.05 and the superscript † denotes the transpose. The single-pass, gradient descentbased Approximant was used for training and testing of the RNN with simulated diffraction patterns but with an additional pre-processing step that will be explained in Eq. (7).
We also implemented a denoised TV-based Approximant, to be used only at the testing stage of the RNN with experimental diffraction patterns, where the additional preprocessing step in Eq. (7) also applies. In this case, the functional to be minimized is where the TV-regularization parameter was chosen as κ ¼ 10 À3 , and for x 2 R P Q the anisotropic l 1 -TV operator is with reflexive boundary conditions 79,80 . To produce the Approximants of experimentally obtained diffraction patterns for testing from this functional, we first ran 4 iterations of the gradient descent and ran 8 iterations of the FGP-FISTA (Fast Gradient Projection with Fast Iterative Shrinkage Thresholding Algorithm) 79,81 .

Dynamically weighted moving average
The N Approximants of the strong scattering case form a 4D spatiotemporal sequence f which we process with a Dynamical Weighted Moving Average (DWMA) operation. For the weak scattering case, we omit this operation. The purpose of the DWMA is to smooth out short-term fluctuations, such as experimental artifacts in raw intensity measurements, and highlight longer-term trends, e.g. the change of information conveyed by different forward operators along the angular axis. The resulting DWMA Approximantsf ½1 m have a shorter length M than the original Approximants f ½1 n , i.e. M < N. Also, the weights in the moving average are dynamically determined as follows. follows the convention of an additive attention mechanism 74 . α nm indicates relative importance of f ½1 n with respect tof ½1 m . Here, W m e is a hidden layer assigned for eachf ½1 m , which is subject to be trained for several epochs. The relative importance is determined by computing its associated energy e nm and the softmax function normalizes it. More details are available in the Angular attention mechanism in Materials and methods. Supplementary Information (SI) Section S3 explains why the DWMA is more favorable than the Simple Moving Average (SMA) with fixed and uniform weights, i.e. 1/M.
To be consistent, the DWMA was applied to the original Approximants for both training and testing. In this study, N w = 15, N h = 6, M = 12. These choices follow from the following considerations. We have N = 42 diffraction patterns for each sequence: 21 captured along the x-axis (1 -21; θ x ¼ À10 ; À9 circ; ; þ10 ) and the remaining ones along the y-axis (22 -42; θ y ¼ À10 ; À9 ; ; þ10 ). The DWMA is first applied to 21 patterns from x-axis rotation, which thus generates 6 averaged diffraction patterns, and then it is applied to the remaining 21 patterns from yaxis rotation, resulting in the other 6 patterns. Therefore, the input sequence to the next step in the architecture of Fig. 16c, i.e. to the encoder (see Convolutional encoder and decoder in "Materials and methods" for details), consists of a sequence of M = 12 DWMA Approximantsf ½1 m : In SI Section S4, we discuss performance change due to different ways of numbering DWMA Approximantsf ½1 m entering the neural network. SI Section S5 provides visualization of DWMA Approximants.

Separable-Convolution Gated Recurrent Unit (SC-GRU)
Recurrent neural networks involve a recurrent unit that retains memory and context based on previous inputs in a form of latent tensors or hidden units. It is well known that the LSTM is robust to instabilities in the training process. Moreover, in the LSTM, the weights applied to past inputs are updated according to usefulness, while less useful past inputs are forgotten. This encourages the most salient aspects of the input sequence to influence the output sequence 56 . Recently, the GRU was proposed as an alternative to LSTM. The GRU effectively reduces the number of parameters by merging some operations inside the LSTM, without compromising the quality of reconstructions; thus, it is expected to generalize better in many cases 69 . For this reason, we chose to utilize the GRU in this paper as well.
The governing equations of the standard GRU are as follows: where ξ m ; h m ; r m ; z m are the inputs, hidden features, reset states, and update states, respectively. Multiplication operations with weight matrices are performed in a fully connected fashion.
We modified this architecture so as to take into account the asymmetry between the lateral and axial dimensions of optical field propagation, motivated from the concept of separable convolution in deep learning 82,83 as shown in Fig. 17. This is evident even in free-space propagation, where the lateral components of the Fresnel kernel exp iπ are shift invariant and, thus, convolutional, whereas the longitudinal axis z is not. The asymmetry is also evident in nonlinear propagation, as in the BPM forward model Eq.
(1) that we used here. This does not mean that space is anisotropicof course space is isotropic! The asymmetry arises because propagation and the object are 3D, whereas the sensor is 2D. In other words, the orientation of the image plane breaks the symmetry in object space so that the scattered field from a certain voxel within the object apparently influences the scattered intensity from its neighbors at the detector plane differently in the lateral direction than in the axial direction. To account for this asymmetry in a profitable way for our learning task, we first define the operators W r ; U r ; etc. as convolutional so as to keep the number of parameters down (even though in free space propagation the axial dimension is not convolutional and under strong scattering neither dimension is nonlinear); and we constrain the convolutional kernels of the operators to be the same in the lateral dimensions x and y, and allow the axial z dimension kernel to be different. This approach justifies the term separable convolution, and we found it to be a good compromise between facilitating generalization and adhering to the physics of the problem. We also replaced the tanh activation function of the standard GRU with ReLU activation 84 as the ReLU is computationally less expensive and helpful to avoid local minima with fewer vanishing gradient problems 72,85 . The final form of our SC-GRU dynamics is where * denotes the separable convolution operation.

Convolutional encoder and decoder
CNNs are placed before and after the SC-GRU as encoder and decoder, respectively. This architectural choice was inspired by refs. [86][87][88][89] . The encoder and decoder also utilize separable convolution, in conjunction with residual learning, which is known to improve generalization in deep networks 90 . As in ref. 86 , the encoder and decoder utilize Down-Residual Blocks (DRB), Up-Residual Blocks (URB), and Residual Blocks (RB), whose details can be found in Fig. S5 in SI Section S6; however, there are no skip connections in our case, i.e. this is not a U-net 91 architecture. The encoder learns how to map its input (i.e. the FBP Approximant f 0 n or the DWMA Approximantf ½1 m sequence) onto a low-dimensional nonlinear manifold. For the weak scattering case, the compression factor for both lateral and axial input dimensions is 8, whereas for the strong scattering case, the compression factor is 16 for the lateral input dimensions, but the axial dimension is left intact. This eases the burden on the training process as the number of parameters is reduced; more importantly, encoding abstracts features out of the high-dimensional inputs, passing latent tensors over to the recurrent unit. Letting the encoder for the mth Approximant be symbolized as Enc m Á ð Þ, ξ m ¼ in Eq. (10). The decoder restores the output of the RNN to the native dimension of the object we are reconstructing.

Angular attention mechanism
Each raw image (either a projection under a weak scattering assumption or a diffraction pattern under a strong scattering condition) from a new angle of illumination is combined at the SC-GRU input with the hidden feature from the same SC-GRU's previous output. After N iterations, there are N different hidden features resulting from N illumination angles, as seen in Eq. (10). Since the forward operator under both scattering assumptions is object dependent, the qualitative information that each such new angle conveys will vary with the object. It then becomes interesting to consider whether some angles of illumination convey more information than others.
The analog in temporal dynamical systems, the usual domain of application for RNNs, is the attention mechanism. It decides which elements of the system's state are the most informative. In our case, of course, time has been replaced by the angle of illumination, so we refer to the same mechanism as angular attention: it evaluates the relative importance of information from each illumination angle in generating the overall reconstruction and thus adaptively assigns different weights to every angle on how much attention should be paid to.
Following the convention of an additive attention mechanism 74 , we compute the weight α m from its associated energy e m as output of a neural network with a hidden layer W e as The output of the angular attention as a single representation a is then computed from a linear combination of the hidden features as where a can be also viewed as the expected hidden representation since the weight α m is essentially a probability that a hidden representation h m is taken into account to the angular attention output a. There is an alternative, dot-product attention mechanism 92 , but we chose not to implement it here.

Training the recurrent neural network
For the weak scattering case, 2000 and 400 synthetic amplitude phantoms are used for training and validation, respectively. Projections are acquired by the Radon transform along several angles within the limited angular range, as described in Approximant computations in Materials and methods. The FBP Approximants are obtained by the FBP algorithm from the projections.
For the strong scattering case, 5000 and 500 synthetic layered objects are used for training and validation, respectively. For each object, a sequence of intensity diffraction patterns from the N = 42 angles of illumination are produced by BPM, as described earlier. The Approximants are obtained each as a single iteration of the gradient descent, followed by the DWMA process.
For both scattering cases, all of the architectures are trained with a Training Loss Function (TLF) of negative Pearson Correlation Coefficient (NPCC) 93 : where f andf are a ground truth image and its corresponding reconstruction. In this article, our NPCC function is defined to perform computation in 3D. We use a stochastic gradient descent scheme with the Adam optimizer 94 . The learning rate is set to be 10 −3 initially and halved whenever validation loss plateaued for 5 consecutive epochs. The lower bound is set to be 10 −6 , and the batch size is set to be 10. The computer used for training has Intel Xeon Gold 6248 CPU at 2.50 GHz with 27.5 MB cache, 384 GB RAM, and dual NVIDIA Volta V100 GPUs with 32 GB VRAM 95 , and it took approximately 5 min per each training epoch.
For comparison, we also re-train the 3D-DenseNet architecture with skip connections in ref. 45 with the same training scheme above. This serves as baseline; however, the number of parameters in this network is 0.5 M, whereas in our RNN architecture the number of parameters is 21 M. We also train an enhanced version of the 3D-DenseNet by tuning the number of dense blocks, the number of layers inside each dense block, filter size, and growth rate to match the total number of parameters with that of the RNN, i.e. 21 M. These two versions of the 3D-DenseNet are referred to as Baseline (0.5 M) and Baseline (21 M), respectively, in Figs. 8,9,11, and 12.

Testing procedures and metrics
Test performance was demonstrated with only the simulated projections under the weak scattering condition. The projections were processed with the FBP to generate 100 sequences of FBP Approximants f 0 n for testing the trained network. Under the strong scattering condition, both simulated and experimental diffraction patterns were used for testing, but the patterns were processed differently. The simulated patterns were directly processed with a single-pass gradient descent to generate 50 sequences of original Approximants f 1 ½ n Eq. (4), whereas a simple affine transform was first applied to the raw experimentally obtained intensity diffraction patterns of an actual layered object to correct slight misalignment. We then applied the gradient descent up to 4 iterations and the FGP-FISTA up to 8 iterations to the corrected experimental patterns, to compute one set of TV-based Approximants Eq. (5). Testing process took approximately 300 ms for generating each volumetric reconstruction.
Even though training used NPCC as in Eq. (13). we investigated two additional metrics for testing: PE and the Wasserstein distance 96,97 . We also quantified test performance using the SSIM 98 .
PE is the mean absolute error between two binary objects; in the digital communication community it is instead referred to as Bit Error Rate (BER). It is the first time to our knowledge to use PE as a quantitative metric in tomography. To obtain the PE, we first threshold the reconstructions (see SI Section S7 for details in the thresholding process) and then define PE ¼ # false negatives ð Þ þ# false positives ð Þ total # pixels ð14Þ We found that it oftentimes helps to accentuate the differences between a binary phase ground truth object and its binarized reconstruction as even small residual artifacts, if they are above the threshold, are thresholded to be one, and thus they are taken into account to the probability of error calculation more than they would have been to other metrics. With these procedures, PE is a particularly suitable error metric for the kind of objects we consider in this paper.
PE is also closely related to the two-dimensional Wasserstein distance as we will now show through an analytical derivation. The latter metric involves an optimization process in terms of a transport plan to minimize the total cost of transport from a source distribution to a target distribution. The two-dimensional Wasserstein distance is defined as W p¼1 ¼ min p P; C ¼ min p P ij P kl γ ij;kl C ij;kl s:t: P kl γ ij;kl ¼ f ij ; P ij γ ij;kl ¼ g kl ; γ ij;kl ! 0 where f ij and g kl are a ground truth binary object and its binary reconstruction, i.e. f ij ; g kl ; γ ij;kl 2 0; 1 f g, a coupling tensor P ¼ γ ij;kl , and a cost tensor C ij;kl ¼ x ij À x kl : PE can be reduced to have a similar, but not equivalent, form to that of the Wasserstein distance. For i.j,k,l where γ ij;kl ≠ 0,