Low-dose x-ray tomography through a deep convolutional neural network

Synchrotron-based X-ray tomography offers the potential for rapid large-scale reconstructions of the interiors of materials and biological tissue at fine resolution. However, for radiation sensitive samples, there remain fundamental trade-offs between damaging samples during longer acquisition times and reducing signals with shorter acquisition times. We present a deep convolutional neural network (CNN) method that increases the acquired X-ray tomographic signal by at least a factor of 10 during low-dose fast acquisition by improving the quality of recorded projections. Short-exposure-time projections enhanced with CNNs show signal-to-noise ratios similar to long-exposure-time projections. They also show lower noise and more structural information than low-dose short-exposure acquisitions post-processed by other techniques. We evaluated this approach using simulated samples and further validated it with experimental data from radiation sensitive mouse brains acquired in a tomographic setting with transmission X-ray microscopy. We demonstrate that automated algorithms can reliably trace brain structures in low-dose datasets enhanced with CNN. This method can be applied to other tomographic or scanning based X-ray imaging techniques and has great potential for studying faster dynamics in specimens

post-processing. Thus, neither current hardware or software solutions provide a clear path to fast X-ray imaging of radiation sensitive samples at nanometer resolution.
A better approach is to enhance signals from low-dose projections during acquisition itself, avoiding the potential pitfalls described above. Deep learning methods, particularly convolutional deep neural networks (CNNs) 20 , are promising algorithmic approaches for addressing this issue. CNNs have been widely used for image denoising [21][22][23] , super-resolution microscopy [24][25][26][27] , and even post-hoc denoising of low-dose X-ray tomography reconstructions 18 . Despite their promise, CNNs have not yet been used to enhance acquisition data by learning corresponding 'maps' between features in low-dose and high-dose images of the same sample, and applying these learned maps to low-dose projections from the same sample. Since both the training and raw images are collected from the same dataset, it is unnecessary to estimate an additive noise model to correct the data.
In this article, we introduce a CNN-based approach for learning the mapping between a number of pairs of low-dose and high-dose projections. Using a limited number of high-dose training examples, we can then use the trained network to predict high-dose reconstructions from a full-rotation tomographic dataset with short-exposure times. The proposed approach can be applied to a range of low-dose tomography applications (e.g. lab-based CT systems); however, we test it using transmission x-ray microscopy (TXM) 28 . We applied the method to recover adult mouse brain structures (e.g. myelinated axon tracts) in 3D at nanometer length scales (∼50 nm) and successfully demonstrate that the CNN-based method can provide sufficient signal for automated tracing of individual axons. By combining computational imaging approaches with a multi-stage approach for tomography, we demonstrate that high-quality reconstructions can be obtained at a fraction of the radiation dose.

Results
Evaluation with synthetic data. To quantitatively evaluate improvements obtained using our CNN-based approach, we first created a synthetic dataset -a solid cube with 1000 sphere shaped particles randomly distributed throughout a 512 × 512 × 512 volume (see Fig. 1(a) for details). The synthetic dataset served as ground truth and allowed us to model different exposure conditions through adding noise to the data (e.g. shorter exposure times correspond to adding more noise). We generated 721 high-dose projections (P) from 0 to 180 degrees by applying a Radon transform to the object ( Fig. 1(b)) and then added Gaussian noise (5% to 30%) to simulate low dose measurements at variable exposure time ( Fig. 1(c)).
Using low-and high-dose pairs, we trained a CNN to estimate the high-resolution measurements (P o ) when provided with low-dose projections (P n ). Specifically we trained the CNN using the projection pairs at 0° (P o (0) and P n (0)) and 45° (P o (45) and P n (45)) for all noise levels: (f: (P o (0), P o (45)) → (P n (0), P n (45))). To obtain a mean square error between the predicted projections and ground truth dataset of less than 10 −4 (40 epochs), CNN training took about 2 hours (213 s × 40 epochs) with a Nvidia Quadro M5000 GPU. The trained CNN is used to process the full angle P n to obtain the enhanced projections P e . Our results on synthetic data demonstrate how our CNN-based approach is able to remove the background noise and clearly distinguish the structures of the particle phantoms ( Fig. 1), which were smeared by the noise in the low-dose measurements. These improvements are even more significant for the projections with higher noise levels.
We performed tomographic reconstructions for P o , P n and P e to obtain the reconstructed 3D volumes R o , R n and R e . We evaluated the CNN improvements by comparing image quality metrics for these 3D volumes. We first gave a direct visualization of one slice and corresponding histograms for R o , R n and R e (Fig. 2). Then we plotted the histograms of each 3D dataset for R n and R e (Fig. 3). At the last, we quantitatively compared R n and R e with the ground truth by computing two popular criteria for quantifying image quality, the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) 29 of R n /R o and R e /R o (Fig. 4).
As shown in Fig. 2 (right), the image quality of R e is visually close to R o . The corresponding histograms also proved it, see Fig. 2 (left). Compared with the ground truth (R o (20)), R n (20) shows strong noise and the three phases (A: black space; B: gray solid cubic; C: white particles) smeared in the histogram. The histogram of R e05 (20) shows almost the same pattern as of R o (20). After the CNN enhancement of the noisy projections, the reconstruction quality is close to the ground truth.
The histograms of the 3D datasets R n and R e show significant improvements by the CNN enhancement for all the noise levels of our simulations, see Fig. 3. When the noise level is ≥10%, the histograms of R n only show a single peak making segmentation by simple thresholding impossible. The histograms of R e show three distinguished peaks when noise level ≤20%, and two distinguished peaks for the rest. This confirms that the CNN enhanced projections produce the reconstructions with better quality, thus making it possible to distinguish the different phases by simply thresholding. Figure 4 shows the means and standard deviations of the PSNR and SSIM for each 3D volume. The PSNR and SSIM of R n /R o and R e /R o are displayed as red and blue bars, respectively. Our results demonstrate that the PSNR of R e /R o is always much higher than R n /R o . This shows that the image quality of R e is much closer to the ground truth R o than R n . As the noise level increases, the improvements of R e over R n are more significant. The SSIM shows similar behavior, with even larger differences between R e /R o and R n /R o than observed in the PSNR. Even for the extreme noise situation (30%), the CNN can still recover enough signal to produce results with acceptable metrics (PSNR: 2.96 → 16.61, SSIM: 0.02 → 0.71). Our evaluations demonstrate that our CNN-based method produced tomographic reconstructions with consistently better quality, both in terms of their structural information and in their accuracy, even in extremely noisy situations.
Validation on a mouse brain sample. The proposed method was further validated on a small sample of mouse somatosensory (S1) cortex containing myelinated axons. This sample has been stained with lead using a standard ROTO procedure 30 to increase X-ray absorption contrast in the projections. After staining, the sample was embedded in plastic (EPON) to make it more X-ray resistant. After preparing the sample, it was measured with nano-CT using the TXM instrument at the 32-ID beamline of the Advanced Photon Source (APS) at Argonne National Laboratory. We first scanned the sample with 30 s exposure time (high-dose) only for 6 angles (0° to 150°, 30° per projection). We then performed a full tomographic scan (361 projections for 0° to 180°) of the sample with 2 s exposure time (low-dose) per projection. All the projections were acquired with 2k × 2k resolution and a pixel size of 30 nm. These projections were then down-sampled to 1k × 1k for subsequent processing and analysis (60 nm pixel size). The resulting reconstructions from low-dose projections mitigate beam damage, but provide limited contrast needed to resolve the width of myelin in the images.
We used the short-exposure and long-exposure projections pairs of 30° and 120° to train our CNN approach (f: (P 2 (30), P 2 (120)) → (P 30 (30), P 30 (120))). The trained CNN was then used to enhance all of the 361 low-dose projections. We evaluated the improvements of CNN-enhanced projections P e by computing the PSNR and SSIM for P e /P 30 and P 2 /P 30 , as shown in Table 1. The SSIM values of P e /P 30 are much higher than the SSIM values of P 2 /P 30 . The SSIM values after CNN enhancement are about 0.9. Effectively, the CNN makes the image quality of 2 s exposure projections P 2 almost equivalent to the 30 s exposure projections P 30 . The PSNR values also reflect higher image quality with P e /P 30 higher than P 2 /P 30 across the board.
We compared the details of the image quality between P e (90)/P 30 (90) and P 2 (90)/P 30 (90), shown in Fig. 5. The profiles of CNN-enhanced projection show greater overlap to the 30 s projection and much less noise than the 2 s projection. This confirms that our method provides good enhancement in both the horizontal and vertical planes of our projection images, thus providing isotropic enhancement and noise removal. When comparing the SSIM map of the 2 s projection with the 30 s projection, the majority of SSIM values range from 0.4 to 0.6; a small part We also compared the CNN method with median filter and total variation regularization (TV) 31,32 , shown in Fig. 6. The SSIM values of P e /P 30 is higher than P med /P 30 and P tv /P 30 , where P med and P tv denote the projections denoised with median filter and TV regularization respectively. The CNN enhancement performs the best to recover the structural information from the noisy projection.
We performed tomographic reconstructions for P e , P 2 , P med and P tv to obtain R e , R 2 , R med and R tv respectively. As shown in Fig. 7, the slice of R e reveals structures that are barely visible in R 2 . The CNN enhancement improves the reconstruction quality for the low-dose data. On the other hand, R med is rather noisy and R tv is too blurry to extract the structure of the axons. In the CNN-enhanced volume R e , axons can be automatically traced over the extent of the entire sample using methods previously developed for segmenting micro-CT data 33 . The width of myelin can also be estimated in the enhanced data, as shown in Fig. 8. After enhancement, low-dose measurements can be used to estimate the thickness of myelin around the axons in the sample. In addition, tracing and resolving 2D cross-sections of axons is also improved in the CNN-enhanced images (as confirmed through manual annotations of both volumes).

Discussion
We have introduced a new CNN-based method that dramatically reduces the exposure time by a factor of 10 for synchrotron-source X-ray tomography, while retaining structural information. Reducing imaging time decreases sample damage and provides more accurate reconstructions of the interiors of materials and biological samples. We applied this algorithm to both synthetic datasets and reconstructions of the white matter of mammalian brain samples.
As the evaluation of our CNN approach with the synthetic datasets demonstrates, the CNN does not simply 'denoise' the noisy projections. It also learns to distinguish the structural information of the object from the noise. The traditional denoising algorithms always lose spatial resolution if the noise is strong. However, the CNN directly learned the mapping between the noisy images and high-quality images. It can keep the resolution close to the high-quality images of learning data, when processing the tests images. The CNN processed images can be easily segmented.
Using the CNN to enhance the tomographic projections, instead of denoising the final reconstructed images 18 , was presumed to have great advantages. The tests with both the synthetic datasets and the real TXM measurements both confirmed our supposition. The CNN successfully reduces noise, and also makes the low-dose projections show structural information as well as a high-dose projection would. We also presumed fewer training images were required when the data is from the measurement of the same object as the testing data. The local features of different tomographic projections for the same object have great similarity. Despite the small size of the training datasets, they included necessary features to enhance the full-scanning projections. Thus, the accuracy of the predictions based on this training model is ensured. All our tests also proved that.
In our application of this method on a brain sample, we found that even at the low-dose that we used to obtain the results, we still observed some beam damage. Thus, at this limit, it is impossible to increase the dose further without considerable degradation. This suggests that only computational approaches can be used to improve image quality in these settings. It is only after CNN enhancement that axons could be resolved at levels sufficient to measure the thickness of myelin at high rates across the volume. Our approach for computational imaging-based enhancement promises high-resolution and dynamic imaging of brains in the future and can be combined with micro-CT approaches that capture mesoscale neuroanatomy 33 .
The running time of our CNN approach can be easily parallelized and scaled up to GPU clusters. We have tested it on the GPU cluster of Cooley from Argonne Leadership Computing Facility (ALCF). With 48 Tesla K80 dual-GPU nodes, the computing time of the data enhancement procedure can be reduced to less than 10 minutes for most of the tomographic scanning with 2 K resolution. Even including the data processing time, the whole measurement will be faster.
In addition to the good performance obtained through enhancing the low-dose tomographic projections, our CNN approach also has great potential for dynamic measurements of X-ray imaging. It speeds up the  measurement process by the magnitude of 10 with keeping the data quality close to the long-time scanning.
The same principle can also be applied to other X-ray imaging techniques, such as X-ray fluorescence and X-ray ptychography, to reduce dose damage or scanning time.

Methods
Overview of approach. Tomographic reconstructions from short-exposure X-ray images yield noisy images as too few photons are received by the sensor. The signal to noise ratio decreases as photon counts drop and the structural information about the object cannot be reconstructed successfully unless regularization methods or denoising methods are used. Although, the features in X-ray projections of the same object for different exposure times are correlated to some extent, there is no relation readily available. This is due to the complexity of the features in the sample and variations in noise between exposure times. We used a deep CNN to learn image-to-image transformations among projections according to the rule established in the training data sets. The basic idea is based on learning a "transformation" between images taken at short (I s ) and long (I l ) exposures. The learning is done by minimizing the corresponding cost function, With this basic principle, we built a CNN architecture to enhance tomographic data with short-exposure times.
CNN architecture. The proposed CNN architecture was inspired by various image transformation models in literature 27,34 . The input and output of the network are both images, instead of images as inputs and scalar labels as outputs for the typical CNN model 35 . As shown in Fig. 9, the network architecture consists of two principal parts: the image encoder and the data decoder. The image encoder is composed of 8 convolutional layers and a fully connected layer. All convolutional layers use 3 × 3 convolution kernels. We increase the number of convolution kernels for each layer from 16 to 64. Three of the convolutional layers use a 2 × 2 strides to compute the convolution from every 2 pixels of the corresponding images 36 . This reduces the image width and height by half after the convolution. The convolutional layers with and without strides extract multiple features of the input images (W × H) from different scales (W × H to W/8 × H/8). Together with the fully connected layer, the network enforces image information to be sparse with various feature maps. The encoder aims to fit the image information with specific target data.
The second half of the network is the decoder. The encoded data are reshaped to a new image size of (W/8 × H/8). We use 9 deconvolutional layers 37 , which are also called transpose convolutional layers. All deconvolutional layers use 3 × 3 deconvolution kernels. We decrease the number of deconvolution kernels for each layer from 64 to 1. Three of the deconvolutional layers use the 2 × 2 strides, which double the image width and height. These decovolutional with and without strides generate layers of feature maps with different scales. At the end, we use a deconvolutional layer only with one deconvolutional kernel to generate a single output image that matches the target image. Using the deconvolutional layers as the decoder has a fundamental advantage of keeping the image with good resolution. This avoids smoothing the patterns and boundaries of structures by using convolutional layers. The main reason is that the convolutional layers compute the pixels of the new image from each of their 3 × 3 neighbors. Deconvolution is the inverse of this process and can make the output images as sharp as the input images.
Using merge layers is another important way of making the output images sharp. In the decoder part of the network, feature maps generated from the deconvolutional layers are concatenated with the preceding feature maps of the same scale from the encoder part. The proportions of the merged feature maps from different layers are decided by the kernel number of these layers, see Fig. 9. With these merged layers, the feature maps of the decoder part include the features with different resolutions of the input images, including the original resolution of the input images. They can avoid the resolution loss during the down-sampling process of the encoder.
The objective function of the CNN is computed based on the peak signal-to-noise ratio (PSNR): The PSNR can be calculated as: Normalization and patch extraction. To infer high fidelity reconstructions from short-exposure tomographic measurements, we trained a deep CNN to learn the mapping between the two conditions. To prepare the training data, we first scanned the object for a few angles for obtaining images taken at relatively long-exposure times (I l ). Then, we performed full-angle tomographic scanning to obtain projections with shorter exposure times (I s ). We used the long-exposure time projection and its related short-exposure time projection as the training data to fit the CNN:  . Image transformation model of CNN. "Conv" is the abbreviation of convolution kernel. "DeConv" is the abbreviation of deconvolution kernel. The numbers before the "Conv" or "DeConv" are the numbers of kernels. "FC" is the abbreviation of fully connected layer. All the convolutional and deconvolutional layers use 3 × 3 kernels. All the layers use Relu as the activation functions 43  where I s (a) and I l (a) are the projections of angle a. Normally, two or three angles of projections are enough for training. We processed the training data in two steps before tomographic reconstruction: normalization and patches extraction. The normalization was done in two steps: . We extracted overlapped small patches from these normalized images as the final training inputs and outputs using the same approach described in 35 .
The reason and benefit to use the small patches are as following: • Every projection of the same object shows different patterns. We only had one angle view of the image and this was not sufficient to predict all the projections. However, if we extracted small patches from the projections, we could usually find very similar local features from these patches. Thus, the patches from one projection include enough local features to train the network and to predict the rest of the projections. • The overlapped patches increase the number of the training data from a few projection images to ∼10 5 of small images. The distances between the data points become smaller than directly using the full images. This can significantly reduce the probability of overfitting.
After we trained the CNN transformation model, we can save the fitted weights of the network to predict projections with quality close to that of the long-time projections (I l ) from the short-exposure-time projections (I s ). Small patches with the same size as the training data were extracted from these projections (I s ) and used as the input for the trained CNN. We reconstructed these patches to obtain enhanced projections. These enhanced projections were used for the tomographic reconstruction to get the final result.
Tomographic acquisition and reconstruction. The synthetic projections were generated from the simulated phantoms with Radon transform 38 , which is an integral transform whose inverse is used to reconstruct images from X-ray CT scans. The experimental dataset was collected at the Transmission X-ray Microscope (TXM) of sector 32-ID at the Advanced Photon Source 2 and formatted in Scientific Data Exchange standard 39 . We performed tomographic reconstruction with the open-source TomoPy toolbox 31 using the CNN enhanced short-exposure projections and the long-exposure projections. We applied a Fourier grid algorithm (GridRec) with a Parzen filter for reconstruction 40 , because of its good balance of the reconstruction speed and accuracy, however other tomographic reconstruction methods can as well be used.
Axonal segmentation. After reconstructing images of a small mouse brain sample, we applied segmentation methods previously used to segment blood vessels and cell bodies in micro-CT data 33 to segment myelinated axons in the sample. To do this, we first trained a random forest classifier using an interactive tool for segmentation called ilastik 41 using a combination of edge, texture, and intensity-based 3D features. Following training, we applied the classifier to a small reconstructed image volume from a cube of mouse cortex (234 × 400 × 300). We then thresholded the classifier probability estimates (threshold = 0.92) and applied 3D morphological filtering operations to the resulting thresholded data (see 33 for more details). Finally, we visualized the segmented data using the multi-view projection and 3D visualization tool in ITK-Snap 42 .