Introduction

Three-dimensional X-ray imaging enables noninvasive monitoring of objects’ interiors with nanoscale resolution. Integrated circuits (IC) are especially interesting for this operation, for two reasons: first, noninvasive inspection of ICs is important for verifying manufacturing integrity. Second, ICs follow specific design rules, which makes their geometries highly regular and yet highly diverse. The geometrical properties are then useful as prior knowledge, enabling vast improvements in practical aspects of the imaging process, such as acquisition time as we show here.

Prior works have typically used two types of scanning: translational and rotational. The translational scan (ptycho) is inspired by ptychography, i.e. a scanning-based coherent diffraction imaging method for phase retrieval. Ptychography was originally proposed by W. Hoppe1 to solve the phase problem in Scanning Transmission Electron Microscopy (STEM), where a moving aperture resolves the ambiguity in phase based on translational invariance. The term “ptychography” was coined in the following year2. Nellist et al.3 demonstrated resolution improvement in STEM by a factor of 2.5 over the limit imposed by partial coherence, exploiting the redundancy in the ptychographic measurements. As an alternative that does not even require careful aberration correction in the optics, Gerchberg and Saxton4 introduced a lensless iterative phase retrieval algorithm, now referred to as GS after them. This work was extended to lensless ptychography for extended objects by Faulkner5. Subsequently, Rodenburg6 introduced yet another iterative phase retrieval algorithm called Ptychographical Iterative Engine (PIE) that simultaneously retrieves both the object and the probe function. Thus, the requirement of a high-quality lens for imaging is fundamentally lifted. Further advances by Thibault et al.7 and Thibault and Guizar-Sicairos8 led to the Difference Map (DM) algorithm and Maximum Likelihood algorithm, respectively, for iterative ptychographic reconstruction.

After the ptychographic reconstruction step, the second angular scan (tomo) operation is required to retrieve the object’s interior, as in tomography. For parallel beam illumination and under the weak scattering approximation, the measurements are interpreted simply as projections through the object, i.e. the measurements implement the object’s Radon transform9,10. The inverse Radon transform is typically implemented as a version of the Filtered BackProjection (FBP) algorithm, first proposed by Bracewell and Riddle11,12. Gordon, Bender, and Herman13 proposed an alternative iterative tomography algorithm called Algebraic Reconstruction Technique (ART), which applies also to non-parallel illumination beams and works by updating the object estimate to sequentially bring each reconstructed projection into agreement with the corresponding measured projection. Subsequent improvements of this original iterative method were the Simultaneous Iterative Reconstruction Technique (SIRT)14 and the Simultaneous Algebraic Reconstruction Technique (SART)15, which consider all projections simultaneously and thus drastically reduce the number of iterations for the reconstruction process. Maximum Likelihood methods have also been popular for tomography, with the Bouman-Sauer algorithm16 as one of the most prominent.

For X-rays, the high penetration depth facilitates recovery of information deep inside the sample in the angular sampling scheme. Combining this property with translational scanning for lensless high spatial resolution, Dierolf et al.17 proposed the Ptychographic X-ray Computed Tomography (PXCT) scheme to determine the volumetric interior of biological specimens with nanoscale details. Using this technique, Holler et al.18 experimentally demonstrated noninvasive imaging of ICs produced with 22-nm technology at 14.6-nm resolution. PXCT provides noninvasive inspection of fabricated samples, eliminating the need for destructive measures such as delayering, which are required by STEM due to its limited depth access caused by electron scattering. This allows fabs to connect to synchrotron X-ray sources and achieve quality assurance without destructive measures. However, this technique is limited by the requirement for two types of scanning, angular and translational, and scale badly scales with object volume. A novel X-ray microscope called Velociprobe19 utilizes fly-scan ptychography20 to significantly reduce the data acquisition time. Still, total data acquisition and reconstruction time for a typical 100 × 100 × 5 µm3 IC (2 × 1010 voxels) is estimated to be in excess of two months.

Here, we propose a machine learning framework to reduce data acquisition and computation time for IC reconstruction under the X-ray ptycho-tomography geometry to provide a noninvasive and efficient solution for inspection purposes. The reduction in data acquisition is compensated by explicit use of prior knowledge of the typical objects being imaged, and of the optical physics of the imaging system. The length of the acquisition and computation time scale as the number N of tomo-scans. The total angular range θ determines the size of the missing wedge in the Fourier domain and, therefore, is commensurate with loss of fidelity. Our “gold standard” is a ptycho-tomo reconstruction by SART with N = 349 and θ = ±70.4°. This maximum angle is determined by practical considerations, such as the sample geometry. More details about the gold standard geometry and our approach are available in “Materials and methods” section.

To search this two-dimensional space (N, θ), our strategy is as follows: we start with the gold standard nominal values of N and θ. If we reduced N while using a standard reconstruction algorithm, like FBP, SIRT, etc. mentioned earlier, performance would decrease immediately. With machine learning, we find that it is possible to regularize for the loss of angular sampling density and still maintain reconstruction fidelity, down to a minimum number N*. Then we start reducing the total angular range, meaning that the sampling now becomes denser. The machine-learning regularizer again manages to maintain approximately even fidelity down to a minimum range θ*. This is our optimal operating point (N*, θ*). The strategy is depicted in Fig. 1b. In principle, this procedure can be repeated to find even tighter operating conditions, but we did not carry that out as we would expect any further gains to be minimal.

Fig. 1: X-ray ptycho-tomography and the implementation of APT.
figure 1

a Brief schematic of X-ray ptycho-tomography geometry with translational scanning of synchrotron X-rays (ptycho-scans) and symmetric angular scanning of the IC sample with uniform angular increment (tomo-scans). b Gold standard uses 349 tomo-scans within the angular range of ±70.4°, but our machine learning framework (APT) uses fewer tomo-scans optimized through two steps. c Diffraction intensities are pre-processed with an approximate inverse operator to generate the Approximant (and more details can be found in “Materials and methods” section and Supplementary Information.) One of two non-overlapping portions of the Approximant is used for training with a negative Pearson correlation coefficient (NPCC) as the training loss function, where network weights are updated over several training epochs. For testing, best trained weights are loaded and fixed to generate outputs over the test volume \((4.48\times 93.2\times 3.92\,{\upmu {\mathrm{m}}}^{3})\)

That machine learning is suitable for achieving even fidelity while the amount of sampled data is decreasing is not entirely surprising by now. The key is the ability of deep neural networks to very effectively capture regularizing priors, especially sparsity, in both supervised mode as we do here and in untrained mode21,22. Previous demonstrations of supervised learning have been carried out for Fourier ptychography23,24 and two-dimensional ptychography25. The reason we chose the supervised learning mode is because we had ample data available from the gold standard ptycho-tomo approach.

APT is described schematically in Fig. 1c. We first invert the far-field diffraction intensities (or ptycho-scans) with an approximate inversion operator. This yields to get an approximate volumetric estimate of the interior of an IC chip, which we dub the “Approximant”26. This step utilizes prior knowledge on underlying imaging physics and pre-processes the input with the physics prior. The Approximant as a result of this pre-processing step (or physics-informing step) is defective in terms that layers are not well separated because of the approximate inversion from diffraction intensities and only a small fraction of tomo-scans used for the computation. During training, the neural network’s weights are optimized based on the Approximant as input. Upon the completion, the trained neural network gives a refined volumetric reconstruction of ICs.

The proposed neural network is based on a 3D U-net architecture27,28 and augmented with the multi-head axial self-attention29 to address lack of spatial resolution in the Approximant by taking advantage of its global-range interactions to retrieve information from all layers to resolve each layer’s structure. We choose this multi-head axial self-attention over multi-head self-attention30 to alleviate computational burden.

We demonstrate that the present method is capable of providing reliable reconstructions of ICs even when both the number and the total angular range of tomo-scans are largely decreased to N*~ 29 and θ*~ ± 17° representing an improvement of ×12 and ×4.2, respectively. For the reconstruction of an IC chip over the test volume (4.48 × 93.2 × 3.92 µm3), 0.63 h (or 38 min) is sufficient for both data acquisition and reconstruction with our machine learning framework. The improvements work out to an approximate overall ×108 reduction in total (acquisition plus computation) time compared to the current up-to-date iterative reconstruction method. We are confident that this method can be applied to various physical systems in which the forward model can be expressed mathematically. Furthermore, this method is not limited to specific sample geometries and can be extended to other types of samples.

Results

Reducing acquisition and scanning time

The synchrotron beam is delivered on the sample, and a full lateral scan is carried out to obtain the ptychographic information for each angular orientation of the sample. Repeating for N angles collects tomographic information for the interior’s reconstruction. The raw intensities past the sample are recorded by a digital camera detector at each scan position. The details of the experimental collection system are in “Materials and methods” section. The collected raw intensities are then processed in two steps: the first step embeds the physics of X-ray propagation through an Approximant operator31,32, while the second step consists of the APT network delivering the final reconstruction, as described earlier. The details of training and operating this computational pipeline are in “Materials and methods” section. As discussed earlier, our approach is to first reduce scanning time by finding the minimum N* and then reduce computation time by finding the minimum θ*. While we acknowledge that optimizing both parameters jointly for a few iterations would be ideal, we believe that the resulting approximant is only a preliminary input to our neural network. Therefore, it is unnecessary to overly complicate its derivation, and keeping it simple would be more beneficial.

A parameter sweep over N is shown qualitatively in Fig. 2. Four quantitative performance comparisons are shown in Fig. 3, in terms of the following metrics: Pearson correlation coefficient (PCC)33, multi-scale structural similarity index metric (MS-SSIM)34, Dice Similarity coefficient (DSC)35, and Bit-Error Rate (BER, more details available in “Materials and methods” section). Both analyses indicate that N* ~ 29, representing a reduction of more than ×12 over the gold standard of N = 349. Reducing N significantly below this value results in noticeable degradation, both qualitatively and quantitatively.

Fig. 2: Optimizing the number of tomo-scans - qualitative view.
figure 2

Qualitative comparison from a parameter sweep over the number of tomo-scans (N) at two different depths: (a) \(z=0.364\,{\upmu {\mathrm{m}}}\) and (b) \(y=0.532\,{\upmu {\mathrm{m}}}\). The figure shows APT reconstructions with different N over the test volume \((4.48\times 93.2\times 3.92\,{\upmu {\mathrm{m}}}^{3})\)

Fig. 3: Optimizing the number of tomo-scans - quantitative view.
figure 3

a Quantitative comparison from a parameter sweep over the number of tomo-scans (N) with four different quantitative metrics. b The number of tomo-scans that optimally compromise the performance (N*) is 28.89 in average, where APT reduces the data acquisition and computation time by a factor of 85

Next we fix N = 29 and perform a parameter sweep over θ. Qualitative results are shown in Fig. 4, while the quantitative evaluation according to the same four metrics of the previous section is in Fig. 5. Both analyses lead to θ*~ ± 17° as the approximate lower bound before drastic degradation occurs. While the number of tomo-scans (N) is the primary determinant of computation time, the angular range of tomo-scans θ can also have an impact. This is because the depth of multi-slice estimates \({O}_{{\boldsymbol{r}}}^{\left(n\right)\left[l\right]}\) derived from different tomographic angles varies with the angle, as the optical path length changes accordingly. This can affect the speed of computing the Approximant. More information is provided in the “Materials and methods” section. The savings in data acquisition and computation times are ×12 and ×105, respectively, and total time savings (acquisition plus computation) of ×108.

Fig. 4: Optimizing the number of angular range - qualitative view.
figure 4

Qualitative comparison from a parameter sweep over the angular scanning range at two different depths: a \(z=1.092\,{\upmu {\mathrm{m}}}\) and b \(y=4.186\,{\upmu {\mathrm{m}}}\). The figure shows APT reconstructions over the test volume \((4.48\times 93.2\times 3.92\,{\upmu {\mathrm{m}}}^{3})\)

Fig. 5: Optimizing number of angular range – quantitative view.
figure 5

a Quantitative comparison from a parameter sweep over the angular scanning range at \(N={N}^{* }(=29)\). b The total angular range that optimally compromise the performance (θ*) is θ* = ±16.93° in average. APT decreases the time for whole process by 108 times

Regularization and imaging system physics

The reported improvements suggest that the APT algorithm is particularly effective at learning regularizing priors to compensate for the missing information. Figure 6a shows the power spectral densities of the gold standard, the APT reconstruction, and the baseline tomographic reconstruction methods FBP11, SIRT14, and SART15, all obtained at N* = 29 and θ* = ±16.8°. The missing wedge is evident in the latter three. The qualitative cross-sections in Fig. 6b confirm that the missing wedge effect leads to severe artifacts in the baseline methods, but not in APT. Further discussions, including a comparison with TV-FISTA36,37, are provided in the Supplementary Information.

Fig. 6: Reconstruction performance comparison with baseline methods.
figure 6

a The figure qualitatively compares 3D power spectral densities of reconstructions of gold standard and baseline reconstruction methods. Both APT and baseline reconstruction methods (FBP, SIRT, SART) use the optimal tomo-scans with N* = 29 and θ* = ±16.8°. Only APT has effectively filled up the missing wedges. b Reconstructions of gold standard and baseline methods are shown and compared along xy, xz, and yz planes

APT also relies on its input, the Approximant, having carefully taken into account the physics of the imaging system. Unlike earlier works where the illumination on the sample was coherent31,32, the synchrotron may be considered as temporally coherent but is rather less coherent spatially. The mutual intensity is expressed as a linear combination of mutually incoherent states, also known as coherent modes38. Accounting correctly for the synchrotron X-ray’s coherence state has been shown to improve spatial resolution and phase contrast in standard ptychography for thin samples39.

For samples thicker than the depth of focus of the probe, multi-slice reconstruction from simple ptychography has been demonstrated with visible light40, X-rays41, and electrons42. This is the starting point for our Approximant as shown Fig. 1c, and please find more information on why multi-slice ptychography was used instead of 2D ptychograpy in Gradient calculation in “Materials and methods” section. We form the cost function

$${{\mathcal{L}}}_{{\mathscr{n}}}=\mathop{\sum }\limits_{j=1}^{{J}_{n}}\sum _{{\boldsymbol{q}}}{\left(\sqrt{{\sum }_{m=1}^{M}{\left|{{\mathcal{P}}}_{{\mathscr{d}}}\left\{{P}_{j,{\boldsymbol{r}},m}^{\left(n\right)\left[L\right]}{O}_{{\boldsymbol{r}}}^{\left(n\right)\left[L\right]}\right\}\right|}^{2}}-\sqrt{{I}_{j,{\boldsymbol{q}}}^{\left(n\right)}}\right)}^{2}\left(n=1,2,\cdots ,N\right)$$
(1)

where N is the number of given tomo-scans; Jn the number of ptycho-scans associated with the n-th tomo-scan; M the number of coherent modes; L the number of slices for our given depth of focus works out to equal to 5; q denotes the coordinates in the reciprocal space; Pd the free-space propagator by the distance d; and \({P}_{j,{\boldsymbol{r}},m}^{\left(n\right)\left[L\right]},{I}_{j,{\boldsymbol{q}}}^{\left(n\right)}\) indicate the wavefield before the L-th slice from the m-th coherent mode and the experimental diffraction intensity at the j-th ptycho and the n-th tomo-scans, respectively. We run two iterations of a gradient scheme on Eq. 1 and obtain the argument \(\angle {O}_{{\boldsymbol{r}}}^{\left(n\right)\left[l\right]}\) at each one among the \(l=1,{\cdot \cdot \cdot },L\) slices40,42. We rotate the result back to the original coordinate system, and average the estimates from all tomo-scan steps to yield the final Approximant. More details can be found in “Materials and methods” section.

The Approximant computation step is the slowest in the pipeline; in our computing hardware (see Materials and Methods), it takes 36 min when θ = ±70.4° and 26 min when θ ~ ± 17°. In addition to the computation time, the spacing between slices in the Approximant is nominally limited by the depth of focus, and that is why we only reconstruct L = 5 of them. The number of desired reconstruction slices is much larger, i.e. 280, so we simply dilate the Approximant slices to match it. As a result, the input to the neural network is poor (more in Supplementary Information). Nevertheless, the subsequent APT architecture learns how to use the multi-slices as input and, as long as N > N* and θ > θ*, produce a high-fidelity final reconstruction with much finer slice spacing.

Discussion

APT is trained using the gold standard reconstructions of randomly selected segments from a single IC specimen, which was made available for our experiments. This prompts us to address two related concerns: (1) what can we guarantee about fidelity of the gold standard and, hence, our reconstructions vis-à-vis the ground truth, i.e. the physical specimen? and (2) is our APT overtrained to this specific IC?

The first concern was partially addressed by refs. 31,32, where the design files of the geometrical features were treated as ground truth. (That method was still bound by the assumption that the physical specimen matched the design files; but that was less of a concern, given the size scales involved.) Neither of these algorithms would have worked in the case reported here, because of the great range of feature sizes in the specimen and because the synchrotron X-ray is not spatially coherent. Moreover, the design files for the specimen are not available to us. On the other hand, the gold standard was obtained quite thoroughly with the scan step size of 100 nm in the ptycho-scan and N = 349 angles in the tomo-scan. Besides, there are no discernible visual artifacts in the gold standard reconstructions. These facts provide us with reasonable assurance about the fairness of our comparisons in Figs. 26.

Regarding the second concern: if new structures are given where the priors are significantly different than the priors learnt here (e.g. features oriented at 45°) then APT would have to be retrained. This is a necessary limitation of our supervised learning approach. The same holds for non-IC objects such as viruses. If, moreover, not enough physical specimens are available for supervised training, then it is possible to train by rigorously simulating the forward propagation of X-rays through the specimen (as refs. 31,32 did for visible light) or use “untrained” methods, such as deep image prior21,43,44.

The reported best values of N* ~ 29 and θ* ~ ± 17° are not fundamental, but indicative of the effectiveness of IC geometries acting as regularizing prior. Less complex geometries, smooth and with less content at high frequencies in the missing wedge, could achieve even better reductions, whereas complex structures with smaller features and higher refractive index contrast would be more limited. A full theoretical analysis of how N* and θ* depend on the complexity of the prior is beyond the scope of this work.

Lastly, regarding ICs in particular and planar samples more generally, the total attenuation of the X-rays increases at large angles, which leads to artifacts. It may be compensated computationally, or by scanning the illumination wavevectors on a conical surface. The latter scheme is referred to as laminography45,46. It is beyond the scope of our present work, but it would be interesting to investigate if approaches similar to the one reported here are applicable.

Materials and Methods

Experiment and the gold standard preparation

Integrated circuits produced with 16-nm technology of size 25.1 × 93.2 × 3.92 µm3 were laterally scanned with synchrotron X-rays of 8.8 keV for each tomo scan with Velociprobe19 at the Advanced Photon Source (APS) of the Argonne National Laboratory (ANL). 12 coherent modes of the synchrotron X-ray were used for the experiment. Tomo-scans were carried out from −70.4° to 70.4° with angular increment of 0.4°, and for each tomo-scan, ptychoscans were recorded on-the-fly at 60k lateral positions with Dectris Eiger X 500 K area detector (pixel size: 75 µm, sample-to-detector distance: 1.92 m) at a frame rate of 500 Hz with the scan step size of 100 nm. Elapsed time of this whole data acquisition process (translational and angular) was 12.51 h, or 129 s per tomo-scan.

As a first step to obtain the gold standard reconstruction, a two-dimensional projection was reconstructed for each tomo-scan with 600 iterations of the least-square maximum likelihood ptychographic algorithm47 as implemented in PtychoShelves48. The ptychographic reconstruction for all 349 tomo-scans was processed with 8 Tesla V100 GPUs in parallel to expedite the process, thus taking 362.09 h for this step.

Then, the projections were aligned to a tomographic rotation axis with an additional correction in the form of a phase ramp removal process. Then, a deep neural network pre-trained on similar images of integrated circuits was applied to the aligned projections for upsampling by ×249,50. The elapsed time of this step was approximately 5 h.

Lastly, the final tomographic reconstruction was performed using 349 upsampled projections with 10 iterations of SART to generate a finally three-dimensional reconstruction of the IC sample with the isotropic 14-nm voxel size, which took 1 hr with 8 Tesla V100 GPUs.

Gradient calculation

Considering the mixed-state (spatially partially coherent) nature of synchrotron X-rays and multi-slice structure of the IC sample, a forward model can be formulated as

$${\psi }_{j,{\boldsymbol{r}},m}^{\left(n\right)\left[L\right]}={O}_{{\boldsymbol{r}}}^{(n)[L]}{{\mathcal{P}}}_{\varDelta {\mathscr{z}}}\left[{O}_{{\boldsymbol{r}}}^{(n)[L-1]}{{\mathcal{P}}}_{\varDelta {\mathscr{z}}}\left[\cdots {{\mathcal{P}}}_{\varDelta {\mathscr{z}}}\left[{P}_{{\boldsymbol{r}}-{{\boldsymbol{r}}}_{j},m}{O}_{{\boldsymbol{r}}}^{(n)[1]}\right]\right]\right]$$
(2)
$${I}_{j,{\boldsymbol{q}}}^{\left(n\right)}=\mathop{\sum }\limits_{m=1}^{M}{\left|{\widetilde{\psi }}_{j,{\boldsymbol{q}},m}^{\left(n\right)\left[L\right]}\right|}^{2}=\mathop{\sum }\limits_{m=1}^{M}{\left|{\mathcal{F}}\left[{\psi }_{j,{\boldsymbol{r}},m}^{\left(n\right)\left[L\right]}\right]\right|}^{2}$$
(3)

where

  • n: the index of tomo-scans \((n=\mathrm{1,2},{\cdot \cdot \cdot },N)\)

  • j: the index of ptycho-scans \((j=\mathrm{1,2},{\cdot \cdot \cdot },{J}_{n})\)

  • l: the index of multi-slices \((l=\mathrm{1,2},{\cdot \cdot \cdot },L)\)

  • m: the index of mixed states \((m=\mathrm{1,2},{\cdot \cdot \cdot },M)\)

  • r: the spatial domain coordinates

  • q: the reciprocal domain coordinates

  • Pr,m: the m-th coherent mode of the synchrotron X-ray probe

  • \({O}_{{\boldsymbol{r}}}^{(n)[l]}\): the l-th slice of the object viewed at the n-th tomo-scan.

The following describes the gradient computation of the loss function in Eq. 1 based on the forward model, which was done automatically with Ptychoshelves48. The gradients of the loss function with respect to the wavefield and the complex object areandwhere

$${P}_{j,{\boldsymbol{r}},m}^{\left(n\right)\left[l\right]}={{\mathcal{P}}}_{\Delta {\mathcal{z}}}\left[{O}_{{\boldsymbol{r}}}^{(n)[l-1]}{P}_{j,{\boldsymbol{r}},m}^{\left(n\right)\left[l-1\right]}\right]$$
(6)
$$\frac{\partial {\mathcal{L}}}{\partial {P}_{{\boldsymbol{r}},m}}=\mathop{\sum }\limits_{n=1}^{N}\mathop{\sum }\limits_{j=1}^{{J}_{n}}\frac{\partial {\mathcal{L}}}{\partial {P}_{j,{\boldsymbol{r}}+{{\boldsymbol{r}}}_{j},m}^{\left(n\right)\left[1\right]}}$$
(7)
$${{\rm{\chi }}}_{j,{\boldsymbol{r}},m}^{\left(n\right)\left[L\right]}={{\mathcal{F}}}^{-1}\left\{\left(1-\frac{\sqrt{{I}_{j,{\boldsymbol{q}}}^{\left(n\right)}}}{\left|{\widetilde{\psi }}_{j,{\boldsymbol{q}},m}^{\left(n\right)\left[L\right]}\right|}\right){\widetilde{\psi }}_{j,{\boldsymbol{q}},m}^{\left(n\right)\left[L\right]}\right\}$$
(8)

With two iterations of gradient descent on the loss function in Eq. 1, we obtain the multi-slice estimate \({\left.{O}_{{\boldsymbol{r}}}^{\left(n\right)\left[l\right]}\right|}_{{N}_{\text{iter}}=2}\) for each tomo-scan and subsequently its argument at each one of the L = 5 slices. For the final Approximant, we rotate the results back to the original coordinate system, and average N estimations from all N tomo-scans. Please see Supplementary Information for visualization. More details on the gradient calculation can be found in refs. 40,42.

The data were pre-processed with multi-slice ptychography in order to properly address the sample at larger angles, where 2D ptychography may not be suitable since the effective optical path length becomes thicker than the depth of focus of the probe. The depth of focus (DOF) of the probe is approximately \(\frac{\lambda }{2{\left(\text{NA}\right)}^{2}}=\) 2.82 µm. When the sample is rotated by the largest angle for tomographic scanning, i.e. 70.4°, the optical path length increases up to \(\frac{3.92}{{{\cos }}\left({70.4}^{\circ }\right)}=11.7\,{{\upmu }}{\text{m}}\simeq 4.46\,{\text{DOF}}\), so 5 slices should be sufficient to address the rotated sample.

Machine learning framework

Our neural network architecture is based on a 3D U-net structure27,28 augmented with multi-head axial self-attention (“axial self-attention” in short)29. The motivation behind self-attention is similar to that of atrous convolution (or dilated convolution) in convolutional neural networks51,52,53, which also seek to enhance the receptive field and account for multi-scale features. We decided to incorporate self-attention into our architecture instead of dilated convolution, as it generates attention maps that provide a deeper understanding of the neural network’s functions. In our network, the U-net directly transfers multi-scale encoded features to its decoder arm to preserve spatial information, and the axial attention augments the features with its global-range self-interactions.

The U-net backbone encoder design was influenced by the well-established architecture ResNet5054 with some modifications so that it can accommodate 3D instead of 2D data. The architecture’s decoder then upsamples the features to result in isotropic voxels of linear size 14 nm. More details can be found in Supplementary Information.

The encoder’s low-dimensional manifolds are further enhanced by the axial self-attention which was proposed in order to reduce the computational complexity of multi-head self-attention (“self-attention” in short)30. The axial self-attention factorizes 3D self-attention into three 1D axial self-attention modules, thus reducing the complexity from O(N3) to O(3 N). Each axial self-attention attends to voxels along one of x, y, z axes. Figure 7 visualizes learned attention weights pij that quantifies normalized “contribution” of other layers sj \((j=1,{\cdot \cdot \cdot },N)\) to the layer si. We assume that the information of layer si is spread along the layers sj \((j=1,{\cdot \cdot \cdot },N)\) due to lack of spatial resolution; therefore, the axial self-attention gathers the scattered information from the layers to resolve layer si with global-range interactions. Note that in this paper, we used Pytorch instead of the original Tensorflow implementation29, and our code should be publicly available in https://github.com/iksungk/APT.

Fig. 7: Learned attention weight visualization.
figure 7

Learned attention weights of multi-head axial self-attentions along each x, y, z axis. Parentheses contain information on the selected attention head and the position of the layer of interest (blue, si) that attends to all layers (sj \((j=\mathrm{1,2},{\cdot \cdot \cdot },N)\)) with attention weights (red, \({\alpha }_{{ij}}^{k}\)), showing importance of sj to si

Training and testing environments

To prepare a paired dataset for training and testing, both of the Approximant and the gold standard are divided into smaller volumes of 1.792 × 1.792 × 3.92 µm3 with 50% lateral overlap. Then, we split the paired dataset into two non-overlapping sub-datasets. One set is reserved for training, and the other for testing. The training and test samples were drawn so as to not be correlated accidentally by spatial overlap during the ptycho- and tomo-scan operations. (In Supplementary Information, we provide comprehensive details on how the datasets were partitioned, along with additional evidence demonstrating the robustness of our algorithm to different partitioning methods.) To mitigate the generalization problem that arises when pre-trained networks fail to perform well on testing data that significantly departs from the training data, we trained the network on a portion of the IC and applied the trained network to the remainder. Transfer learning can be utilized to reduce the amount of training required for the network when working with new ICs that have different design rules.

For training, we use negative Pearson Correlation coefficient (NPCC) as the training loss function31,32,55 and the Adam optimizer for stochastic gradient descent optimization56 with initial learning rate of 2 × 10−4, β1 = 0.9, β2 = 0.999, and without weight decay. We also update the learning rate schedule according to a polynomial rule57 as

$${lr}({\rm{epoch}})={lr}\left(0\right)\times {\left(1-\frac{{\rm{epoch}}}{T}\right)}^{p}$$
(9)

where lr(0) = 2 × 10−4. We followed the convention of Ref. 58 by setting p = 0.9. Additionally, we aimed to reduce the final learning rate to 1/100 of the initial rate, which was accomplished by setting T = 200. We trained the model for 150 epochs and utilized a mini-batch learning strategy59 with a batch size of 4 to stabilize the process. This batch size was chosen as it was optimal for the computational memory limitations of our machine. Upon completion of the training process, the network is loaded and fixed with the optimal weights, and used to reconstruct the test volume (4.48 × 93.2 × 3.92 µm3), as shown in Figs. 2, 4, and 6.

For all computational procedures, i.e. pre-processing and training and testing processes, we used the MIT Supercloud with a Intel Xeon Gold 6248 CPU with 384 GB RAM and dual NVIDIA Volta V100 GPUs with 32 GB VRAM. Once the network was trained, it took 45 s to generate the reconstruction over the test volume.

Quantitative metrics

Because each voxel on an IC is occupied by a single material, even if ICs are printed with various materials such as copper, aluminum, and tungsten, ICs can be comfortably classified into M-ary labels irrespective of the printing material. To further simplify, we binarize the gold standard by thresholding according to the presence of a metal or silicon within each voxel. The gold standard reconstruction, however, may still be ambiguous especially for longitudinal features due to the missing wedge in the Fourier domain as it still does not cover the entire angular range, i.e. ± 90°, due to the tomographic scheme. Since the gold standard also suffers from extensive errors in these ambiguous layers, we exclude them from our quantitative evaluations as well. More details can be found in Supplementary Information.

The quantitative comparisons in Figs. 3 and 5 use four different quantitative metrics to illustrate different aspects of the reconstructions. The first two are correlative metrics: the PCC55 and the MS-SSIM with the same weights as in the original reference34.

The remaining two metrics are the DSC35 and the BER. The former is a widely accepted similarity measure in image segmentation to compare an algorithm output against its reference in medical applications60,61. The BER measures the ratio of erroneously classified voxels over the total voxels, and it is allowable because of our binarization approach. Both of these metrics are probabilistic in the sense that they involve the estimation of probability density functions. They are obtained as

$${\rm{DSC}}=\frac{2\cdot {\rm{TP}}}{2\cdot {\rm{TP}}+{\rm{FN}}+{\rm{FP}}}$$
(10)
$${\rm{BER}}=\frac{{\rm{FP}}+{\rm{FN}}}{{\rm{TP}}+{\rm{TN}}+{\rm{FP}}+{\rm{FN}}}$$
(11)

where TP, TN, FP, and FN indicate the number of true positives, true negatives, false positives, and false negatives, respectively. For the gold standard, the binary thresholds and prior probabilities \(p\left(0\right),p\left(1\right)\) required for these quantities were estimated by an Expectation Maximization (EM) procedure. For the tests, we used Bayes’ rule \(p\left(x|0\right)p\left(0\right)=p\left(x|1\right)p\left(1\right)\) with \(p\left(0\right),p\left(1\right)\) same as for the gold standard.

Furthermore, we have conducted a resolution analysis of the reconstructions shown in Figs. 3 and 5 using the Fourier ring coefficient62, which is detailed in the Supplementary Information.