Attentional Ptycho-Tomography (APT) for three-dimensional nanoscale X-ray imaging with minimal data acquisition and computation time

Noninvasive X-ray imaging of nanoscale three-dimensional objects, such as integrated circuits (ICs), generally requires two types of scanning: ptychographic, which is translational and returns estimates of the complex electromagnetic field through the IC; combined with a tomographic scan, which collects these complex field projections from multiple angles. Here, we present Attentional Ptycho-Tomography (APT), an approach to drastically reduce the amount of angular scanning, and thus the total acquisition time. APT is machine learning-based, utilizing axial self-Attention for Ptycho-Tomographic reconstruction. APT is trained to obtain accurate reconstructions of the ICs, despite the incompleteness of the measurements. The training process includes regularizing priors in the form of typical patterns found in IC interiors, and the physics of X-ray propagation through the IC. We show that APT with ×12 reduced angles achieves fidelity comparable to the gold standard Simultaneous Algebraic Reconstruction Technique (SART) with the original set of angles. When using the same set of reduced angles, then APT also outperforms Filtered Back Projection (FBP), Simultaneous Iterative Reconstruction Technique (SIRT) and SART. The time needed to compute the reconstruction is also reduced, because the trained neural network is a forward operation, unlike the iterative nature of these alternatives. Our experiments show that, without loss in quality, for a 4.48 × 93.2 × 3.92 µm3 IC (≃6 × 108 voxels), APT reduces the total data acquisition and computation time from 67.96 h to 38 min. We expect our physics-assisted and attention-utilizing machine learning framework to be applicable to other branches of nanoscale imaging, including materials science and biological imaging.


Introduction
Three-dimensional X-ray imaging enables noninvasive monitoring of objects' interiors with nanoscale resolution.Integrated circuits (IC) are especially interesting for this operation, for two reasons: first, noninvasive inspection of ICs is important for verifying manufacturing integrity.Second, ICs follow specific design rules, which makes their geometries highly regular and yet highly diverse.The geometrical properties are then useful as prior knowledge, enabling vast improvements in practical aspects of the imaging process, such as acquisition time as we show here.
Prior works have typically used two types of scanning: translational and rotational.The translational scan (ptycho) is inspired by ptychography, i.e. a scanning-based coherent diffraction imaging method for phase retrieval.Ptychography was originally proposed by W. Hoppe (1) to solve the phase problem in Scanning Transmission Electron Microscopy (STEM), where a moving aperture resolves the ambiguity in phase based on translational invariance.The term "ptychography" was coined in the following year (2).Nellist et al. (3) demonstrated resolution improvement in STEM by a factor of 2.5 over the limit imposed by partial coherence, exploiting the redundancy in the ptychographic measurements.As an alternative that does not even require careful aberration correction in the optics, Gerchberg and Saxton (4) introduced a lensless iterative phase retrieval algorithm, now referred to as GS after them.This work was extended to lensless ptychography for extended objects by Faulkner (5).Subsequently, Rodenburg (6) introduced yet another iterative phase retrieval algorithm called Ptychographical Iterative Engine (PIE) that simultaneously retrieves both the object and the probe function.Thus, the requirement of a high-quality lens for imaging is fundamentally lifted.Further advances by Thibault et al. (7) and Thibault and Guizar-Sicairos (8) led to the Difference Map (DM) algorithm and Maximum Likelihood algorithm, respectively, for iterative ptychographic reconstruction.
After the ptychographic reconstruction step, the second angular scan (tomo) operation is required to retrieve the object's interior, as in tomography.For parallel beam illumination and under the weak scattering approximation, the measurements are interpreted simply as projections through the object, i.e. the measurements implement the object's Radon transform (9,10).The inverse Radon transform is typically implemented as a version of the Filtered Back-Projection (FBP) algorithm, first proposed by Bracewell and Riddle (11,12).Gordon, Bender, and Herman (13) proposed an alternative iterative tomography algorithm called Algebraic Reconstruction Technique (ART) that iteratively, which applies also to non-parallel illumination beams and works by updating the object estimate to sequentially bring each reconstructed projection into agreement with the corresponding measured projection.Subsequent improvements of this original iterative method were the Simultaneous Iterative Reconstruction Technique (SIRT) (14) and the Simultaneous Algebraic Reconstruction Technique (SART) (15), which consider all projections simultaneously and thus drastically reduce the number of iterations for the reconstruction process.Maximum Likelihood methods have also been popular for tomography, with the Bouman-Sauer algorithm (16) as one of the most prominent.Here, we propose a machine learning framework to reduce data acquisition and computation time for IC reconstruction under the X-ray ptycho-tomography geometry.The reduction in data acquisition is compensated by explicit use of prior knowledge of the typical objects being imaged, and of the optical physics of the imaging system.The length of the acquisition and computation time scale as the number N of tomo-scans.The total angular range θ determines the size of the missing wedge in the Fourier domain and, therefore, is commensurate with loss of fidelity.Our "gold standard" is a ptycho-tomo reconstruction by SART with N = 349 and θ = ±70.4• .This maximum angle is determined by practical considerations, such as the sample geometry.More details about the gold standard geometry and our approach are available in Methods.
To search this two-dimensional space (N, θ), our strategy is as follows: we start with the gold standard nominal values of N and θ.If we reduced N while using a standard reconstruction algorithm, like FBP, SIRT, etc. mentioned earlier, performance would decrease immediately.
With machine learning, we find that it is possible to regularize for the loss of angular sampling density and still maintain reconstruction fidelity, down to a minimum number N * .Then we start reducing the total angular range, meaning that the sampling now becomes denser.The machine-learning regularizer again manages to maintain approximately even fidelity down to a minimum range θ * .This is our optimal operating point (N * , θ * ).The strategy is depicted in Fig. 1b.In principle, this procedure can be repeated to find even tighter operating conditions, but we did not carry that out as we would expect any further gains to be minimal.That machine learning is suitable for achieving even fidelity while the amount of sampled data is decreasing is not entirely surprising by now.The key is the ability of deep neural networks to very effectively capture regularizing priors, especially sparsity, in both supervised mode as we do here and in untrained mode (21,22).Previous demonstrations of supervised learning have been carried out for Fourier ptychography (23,24) and two-dimensional ptychography (25).The reason we chose the supervised learning mode is because we had ample data available from the gold standard ptycho-tomo approach.
APT is described schematically in Fig. 1c.We first invert the far-field diffraction intensities (or ptycho-scans) with an approximate inversion operator.This yields to get an approximate volumetric estimate of the interior of an IC chip, which we dub the "Approximant" (26).This step utilizes prior knowledge on underlying imaging physics and pre-processes the input with the physics prior.The Approximant as a result of this pre-processing step (or physics-informing step) is defective in terms that layers are not well separated because of the approximate inversion from diffraction intensities and only a small fraction of tomo-scans used for the computation.
During training, the neural network's weights are optimized based on the Approximant as input.
Upon the completion, the trained neural network gives a refined volumetric reconstruction of ICs.
The proposed neural network is based on a 3D U-net architecture (27,28) and augmented with the multi-head axial self-attention (29) to address lack of spatial resolution in the Approximant by taking advantage of its global-range interactions to retrieve information from all layers to resolve each layer's structure.We choose this multi-head axial self-attention over multi-head self-attention (30) to alleviate computational burden.
We demonstrate that the present method is capable of providing reliable reconstructions of ICs even when both the number and the total angular range of tomo-scans are largely decreased to N * ∼ 29 and θ * ∼ ±17 • representing an improvement of ×12 and ×4.2, respectively.For the reconstruction of an IC chip over the test volume (4.48 × 93.2 × 3.92 µm 3 ), 0.63 hours (or 38 minutes) is sufficient for both data acquisition and reconstruction with our machine learning framework.The improvements work out to an approximate overall ×108 reduction in total (acquisition plus computation) time compared to the current up-to-date iterative reconstruction method.

Reducing acquisition and scanning time
The synchrotron beam is delivered on the sample, and a full lateral scan is carried out to obtain the ptychographic information for each angular orientation of the sample.Repeating for N an-

Regularization and imaging system physics
The reported improvements suggest that the APT algorithm is particularly effective at learning regularizing priors to compensate for the missing information.Fig. 6a shows the power spectral densities of the gold standard, the APT reconstruction, and the baseline tomographic reconstruction methods FBP (11), SIRT (14), and SART (15), all obtained at N * = 29 and θ * = ±16.8• .The missing wedge is evident in the latter three.The qualitative cross-sections in Fig. 6b confirm that the missing wedge effect leads to severe artifacts in the baseline methods, but not in APT.
APT also relies on its input, the Approximant, having carefully taken into account the physics of the imaging system.Unlike earlier works where the illumination on the sample was coherent (31,32), the synchrotron may be considered as temporally coherent but is rather less coherent spatially.The mutual intensity is expressed as a linear combination of mutually incoherent states, also known as coherent modes (36).Accounting correctly for the synchrotron X-ray's coherence state has been shown to improve spatial resolution and phase contrast in standard ptychography for thin samples (37).
For samples thicker than the depth of focus of the probe, multi-slice reconstruction from simple ptychography has been demonstrated with visible light (38), X-rays (39) and electrons (40).This is the starting point for our Approximant (please see Fig. 1a.)We form the cost function where N is the number of given tomo-scans, J n the number of ptycho-scans associated with the n-th tomo-scan; M the number of coherent modes; L the number of slices for our given depth of focus works out to equal to 5; q denotes the coordinates in the reciprocal space; and j,q indicate the wavefield before the L-th slice from the m-th coherent mode and experimental diffraction intensity at the j-th ptycho-and the n-th tomo-scans, respectively.We run two iterations of a gradient scheme on Eq. 1 and obtain the argument ∠O at each one among the (38,40).We rotate the result back to the original coordinate system, and average the estimates from all tomo-scan steps to yield the final Approximant.More details can be found in Methods.
The Approximant computation step is the slowest in the pipeline; in our computing hardware (see Methods), it takes 36 minutes when θ = ±70.4• and 26 minutes when θ ∼ ±17 • .In addition to the computation time, the spacing between slices in the Approximant is nominally limited by the depth of focus, and that is why we only reconstruct L = 5 of them.The number of desired reconstruction slices is much larger, i.e. 280, so we simply dilate the Approximant slices to match it.As a result, the input to the neural network is poor (more in Supplementary Materials).Nevertheless, the subsequent APT architecture learns how to use the multi-slices as input and, as long as N > N * and θ > θ * , produce a high-fidelity final reconstruction with much finer slice spacing.

Discussion
APT is trained using the gold standard reconstructions of randomly selected segments from a single IC specimen, which was made available for our experiments.This prompts us to address two related concerns: (1) what can we guarantee about fidelity of the gold standard and, hence, our reconstructions vis-à-vis the ground truth, i.e. the physical specimen?and (2) is our APT overtrained to this specific IC?
The first concern was partially addressed by Refs.(31,32), where the design files of the geometrical features were treated as ground truth.(That method was still bound by the assumption that the physical specimen matched the design files; but that was less of a concern, given the size scales involved.)Neither of these algorithms would have worked in the case reported here, because of the great range of feature sizes in the specimen and because the synchrotron X-ray is not spatially coherent.Moreover, the design files for the specimen are not available to us.On the other hand, the gold standard was obtained quite thoroughly with 95% spatial overlap factor in the ptycho-scan and N = 349 angles in the tomo-scan.Besides, there are no discernible visual artifacts in the gold standard reconstructions.These facts provide us with reasonable assurance about the fairness of our comparisons in Figs.2-6.
Regarding the second concern: if new structures are given where the priors are significantly different than the priors learnt here (e.g.features oriented at 45 • ) then APT would have to be retrained.This is a necessary limitation of our supervised learning approach.The same holds for non-IC objects such as viruses.If, moreover, not enough physical specimens are available for supervised training, then it is possible to train by rigorously simulating the forward propagation of X-rays through the specimen (as Refs.(31,32) did for visible light) or use "untrained" methods, such as deep image prior (21,41).
The reported best values of N * ∼ 29 and θ ∼ ±17 • are not fundamental, but indicative of the effectiveness of IC geometries acting as regularizing prior.Less complex geometries, smooth and with less content at high frequencies in the missing wedge, could achieve even better reductions, whereas complex structures with smaller features and higher refractive index contrast would be more limited.A full theoretical analysis of how N * and θ * depend on the complexity of the prior is beyond the scope of this work.
Lastly, regarding ICs in particular and planar samples more generally, the total attenuation of the X-rays increases at large angles, which leads to artifacts.It may be compensated computationally, or by scanning the illumination wavevectors on a conical surface.The latter scheme is referred to as laminography (42,43).It is beyond the scope of our present work, but it would be interesting to investigate if approaches similar to the one reported here are applicable.

Methods
Experiment and the gold standard preparation As a first step to obtain the gold standard reconstruction, a two-dimensional projection was reconstructed for each tomo-scan with 600 iterations of the least-square maximum likelihood ptychographic algorithm (44) as implemented in PtychoShelves (45).Raw diffraction intensities of 256 × 256 px 2 size were downsampled by ×2 to accelerate the computation.The ptychographic reconstruction for all 349 tomo-scans was processed with 8 Tesla V100 GPUs in parallel to expedite the process, thus taking 362.09 hrs for this step.
Then, the projections were aligned to a tomographic rotation axis with an additional correc-tion in the form of a phase ramp removal process.Then, a deep neural network pre-trained on similar images of integrated circuits was applied to the aligned projections for upsampling by ×2 (46,47).The elapsed time of this step was approximately 5 hrs.
Lastly, the final tomographic reconstruction was performed using 349 upsampled projections with 10 iterations of SART to generate a finally three-dimensional reconstruction of the IC sample with the isotropic 14-nm voxel size, which took 1 hr with 8 Tesla V100 GPUs.

Gradient calculation
Considering the mixed-state (spatially partially coherent) nature of synchrotron X-rays and multi-slice structure of the IC sample, a forward model can be formulated as : the l-th slice of the object viewed at the n-th tomo-scan.
The following describes the gradient computation of the loss function in Eq. 1 based on the forward model, which was done automatically with Ptychoshelves (45).The gradients of the loss function with respect to the wavefield and the complex object are and ∂L ∂O where With two iterations of gradient descent on the loss function in Eq. 1, we obtain the multi- for each tomo-scan and subsequently its argument at each one of the L = 5 slices.For the final Approximant, we rotate the results back to the original coordinate system, and average N estimations from all N tomo-scans.Please see Supplementary Materials for visualization.More details on the gradient calculation can be found in Refs.(38,40).

Machine learning framework
Our neural network architecture is based on a 3D U-net structure (27,28) augmented with multi-head axial self-attention ("axial self-attention" in short) (29).The U-net directly transfers multi-scale features to its decoder arm to preserve spatial information, and the axial attention augments the features with its global-range self-interactions.
The U-net backbone encoder design was influenced by the well-established architecture ResNet50 (48) with some modifications so that it can accommodate 3D instead of 2D data.The architecture's decoder then upsamples the features by ×2 to result in isotropic voxels of linear size 14 nm.More details can be found in Supplementary Materials.
The encoder's low-dimensional manifolds are further enhanced by the axial self-attention which was proposed in order to reduce the computational complexity of multi-head self-attention ("self-attention" in short) (30).The axial self-attention factorizes 3D self-attention into three 1D axial self-attention modules, thus reducing the complexity from O(N 3 ) to O(3N ).Each axial self-attention attends to voxels along one of x, y, z axes.Fig. 7 visualizes learned attention weights p ij that quantifies normalized "contribution" of other layers s j (j = 1, • • • , N ) to the layer s i .We assume that the information of layer s i is spread along the layers s j (j = 1, due to lack of spatial resolution; therefore, the axial self-attention gathers the scattered information from the layers to resolve layer s i with global-range interactions.Note that in this paper, we used Pytorch instead of the original Tensorflow implementation (29), and our code should be publicly available in https://github.com/iksungk/APT.

Training and testing environments
To prepare a paired dataset for training and testing, both of the Approximant and the gold standard are divided into smaller volumes of 1.792×1.792×3.92µm 3 with 50% lateral overlap.
Then, we split the paired dataset into two non-overlapping sub-datasets.One set is reserved for training, and the other for testing.The training and test samples were drawn so as to not be correlated accidentally by spatial overlap during the ptycho-and tomo-scan operations.
For training, we use negative Pearson Correlation coefficient (NPCC) as the training loss function (31,32,49) and the Adam optimizer for stochastic gradient descent optimization (50) with initial learning rate of 2 × 10 −4 , β 1 = 0.9, β 2 = 0.999, and without weight decay.We also update the learning rate schedule according to a polynomial rule (51) as where T = 200 and p = 0.9.We run the training process for 150 epochs and stabilize it by a mini-batch learning strategy (52)

Quantitative metrics
Because each voxel on an IC is occupied by a single material, even if ICs are printed with various materials such as copper, aluminum, and tungsten, ICs can be comfortably classified into M -ary labels irrespective of the printing material.To further simplify, we binarize the gold standard by thresholding according to the presence of a metal or silicon within each voxel.The gold standard reconstruction, however, may still be ambiguous especially for longitudinal features due to the missing wedge in the Fourier domain as it still does not cover the entire angular range, i.e. ±90 • , due to the tomographic scheme.Since the gold standard also suffers from extensive errors in these ambiguous layers, we exclude them from our quantitative evaluations as well.More details can be found in Supplementary Materials.
The quantitative comparisons in Figs.49) and the MS-SSIM with the same weights as in the original reference (34).
The remaining two metrics are the DSC (35) and the BER.The former is a widely accepted similarity measure in image segmentation to compare an algorithm output against its reference in medical applications (53,54).The BER measures the ratio of erroneously classified voxels over the total voxels, and it is allowable because of our binarization approach.Both of these metrics are probabilistic in the sense that they involve the estimation of probability density functions.They are obtained as and For X-rays, the high penetration depth facilitates recovery of information deep inside the sample in the angular sampling scheme.Combining this property with translational scanning for lensless high spatial resolution, Dierolf et al. (17) proposed the Ptychographic X-ray Computed Tomography (PXCT) scheme to determine the volumetric interior of biological specimens with nanoscale details.Using this technique, Holler et al. (18) experimentally demonstrated noninvasive imaging of ICs produced with 22 nm technology at 14.6 nm resolution.These techniques are limited by the requirement for two types of scanning, angular and translational, and scale badly scales with object volume.A novel X-ray microscope called Velociprobe (19) utilizes fly-scan ptychography (20) to significantly reduce the data acquisition time.Still, total data acquisition and reconstruction time for a typical 100 × 100 × 5 µm 3 IC ( 2 × 10 10 voxels) is estimated to be in excess of two months.
gles collects tomographic information for the interior's reconstruction.The raw intensities past the sample are recorded by a digital camera detector at each scan position.The details of the experimental collection system are in Methods.The collected raw intensities are then processed in two steps: the first step embeds the physics of X-ray propagation through an Approximant operator(31,32), while the second step consists of the APT network delivering the final reconstruction, as described earlier.The details of training and operating this computational pipeline are in Methods.As discussed earlier, our approach is to first reduce scanning time by finding the minimum N * and then reduce computation time by finding the minimum θ * .A parameter sweep over N is shown qualitatively in Fig.2.Four quantitative performance comparisons are shown in Fig.3, in terms of the following metrics: Pearson correlation coefficient (PCC)(33), multi-scale structural similarity index metric (MS-SSIM) (34), Dice Similarity coefficient (DSC)(35), and Bit-Error Rate (BER, more details available in Methods).Both analyses indicate that N * ∼ 29, representing a reduction of more than ×12 over the gold standard of N = 349.Reducing N significantly below this value results in noticeable degradation, both qualitatively and quantitatively.Next we fix N = 29 and perform a parameter sweep over θ.Qualitative results are shown in Fig.4, while the quantitative evaluation according to the same four metrics of the previous section is in Fig.5.Both analyses lead to θ * ∼ ±17 • as the approximate lower bound before drastic degradation occurs.The savings in data acquisition and computation times are ×12 and ×105, respectively, and total time savings (acquisition plus computation) of ×108.

Integrated circuits produced with 16 -
nm technology of size 25.1×93.2×3.92µm 3 were laterally scanned with synchrotron X-rays of 8.8 keV for each tomo scan with Velociprobe (19) at the Advanced Photon Source (APS) of the Argonne National Laboratory (ANL).12 coherent modes of the synchrotron X-ray were used for the experiment.Tomo-scans were carried out from -70.4 • to 70.4 • with angular increment of 0.4 • , and for each tomo-scan, ptycho-scans were recorded on-the-fly at ∼ 60k lateral positions with Dectris Eiger X 500K area detector (pixel size: 75 µm, sample-to-detector distance: 1.92 m) at a frame rate of 500 Hz.Elapsed time of this whole data acquisition process (translational and angular) was 12.51 hrs, or 129 seconds per tomo-scan.

Fig. 1 :
Fig. 1: X-ray ptycho-tomography and the implementation of APT.(a) Brief schematic of X-ray ptycho-tomography geometry with translational scanning of synchrotron X-rays (ptychoscans) and symmetric angular scanning of the IC sample with uniform angular increment (tomoscans).(b) Gold standard uses 349 tomo-scans within the angular range of ±70.4 • , but our machine learning framework (APT) uses fewer tomo-scans optimized through two steps.(c) Diffraction intensities are pre-processed with an approximate inverse operator to generate the Approximant (and more details can be found in Methods and Supplementary Materials.)One of two non-overlapping portions of the Approximant is used for training with a negative Pearson correlation coefficient (NPCC) as the training loss function, where network weights are updated over several training epochs.For testing, best trained weights are loaded and fixed to generate outputs over the test volume (4.48 × 93.2 × 3.92 µm 3 ).

tFig. 3 :Fig. 4 :Fig. 6 :
Fig. 3: Optimizing the number of tomo-scans -quantitative view.(a) Quantitative comparison from a parameter sweep over the number of tomo-scans (N ) with four different quantitative metrics.(b) The number of tomo-scans that optimally compromise the performance (N * ) is 28.89 in average, where APT reduces the data acquisition and computation time by a factor of 85.

Fig. 7 :
Fig. 7: Learned attention weight visualization.Learned attention weights of multi-head axial self-attentions along each x, y, z axis.Parentheses contain information on the selected attention head and the position of the layer of interest (blue, s i ) that attends to all layers (s j (j = 1, 2, • • • , N )) with attention weights (red, α k ij ), showing importance of s j to s i .
with batch size equal to 4. Upon completion of the training process, the network is loaded and fixed with the optimal weights, and used to reconstruct the test volume (4.48 × 93.2 × 3.92 µm 3 ), as shown in Figs.2, 4 and 6.For all computational procedures, i.e. pre-processing and training & testing processes, we used the MIT Supercloud with a Intel Xeon Gold 6248 CPU with 384 GB RAM and dual NVIDIA Volta V100 GPUs with 32 GB VRAM.Once the network was trained, it took 45 seconds to generate the reconstruction over the test volume.