Restoration of amyloid PET images obtained with short-time data using a generative adversarial networks framework

Our purpose in this study is to evaluate the clinical feasibility of deep-learning techniques for F-18 florbetaben (FBB) positron emission tomography (PET) image reconstruction using data acquired in a short time. We reconstructed raw FBB PET data of 294 patients acquired for 20 and 2 min into standard-time scanning PET (PET20m) and short-time scanning PET (PET2m) images. We generated a standard-time scanning PET-like image (sPET20m) from a PET2m image using a deep-learning network. We did qualitative and quantitative analyses to assess whether the sPET20m images were available for clinical applications. In our internal validation, sPET20m images showed substantial improvement on all quality metrics compared with the PET2m images. There was a small mean difference between the standardized uptake value ratios of sPET20m and PET20m images. A Turing test showed that the physician could not distinguish well between generated PET images and real PET images. Three nuclear medicine physicians could interpret the generated PET image and showed high accuracy and agreement. We obtained similar quantitative results by means of temporal and external validations. We can generate interpretable PET images from low-quality PET images because of the short scanning time using deep-learning techniques. Although more clinical validation is needed, we confirmed the possibility that short-scanning protocols with a deep-learning technique can be used for clinical applications.

Since PET/CT scanners are used in most hospitals, a restoration technique using only PET without MRI information is needed.
In this study, we applied the deep-learning technique for short-scanning FBB PET image restoration. The proposed method uses PET images only, without additional information, such as MRI or CT. We did qualitative and quantitative analyses to evaluate the clinical applicability of the proposed method.

Materials and methods
The Institutional Review Board (IRB) of Dong-A University Hospital reviewed and approved this retrospective study protocol (DAUHIRB-17-108). The IRB waived the need for informed consent, since only anonymized data would be used for research purposes. We used all methods in accordance with the relevant guidelines and regulations.
Patients and F-18 FBB brain PET acquisition. For training and internal validation of our deep-learning algorithm, we enrolled 294 patients with clinically diagnosed cognitive impairment who had received FBB PET between December 2015 and May 2018 retrospectively in this study. We also randomly collected 30 patients who had FBB PET from January to May 2020 for temporal validation. Out of 30 patients, we excluded two patients because of insufficient clinical information, and finally 28 patients participated. In this study, we excluded patients with head movement during PET scanning. All the FBB PET examinations were done using a Biograph mCT flow scanner (Siemens Healthcare, Knoxville, TN, USA). The PET/CT imaging was done according to the routine examination protocol of our hospital, which is the same method used in the previous study published by our group 12 . We injected 300 MBq F-18 florbetaben intravenously into the patients and started PET/CT acquisition 90 min after the radiotracer injection. A helical CT scan was carried out with a rotation time of 0.5 s at 120 kVp and 100 mAs, without an intravenous contrast agent. A PET scan followed immediately, and the image was acquired for 20 min with the list mode. All the images were acquired from the skull vertex to the skull base. We reconstructed the list mode PET data for 20 min into a 20-min static image (PET 20m ) and used it as the full-time ground-truth image. We also reconstructed a short-scanning static PET image (PET 2m ) using the first 2-min data from the total list mode PET data. We used the same parameters to acquire both PET 20m and PET 2m images.
In addition, we carried out external validation, and obtained data used in the preparation of the external validation from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu). Among the subjects who underwent FBB PET, we randomly selected 60 patients, and excluded two patients because of inconsistency in the brain amyloid plaque load (BAPL) scoring. Finally, 58 patients were involved.
The characteristics of all subjects included in this study are summarized in Table 1.
Deep-learning method. Network architecture. We adopted a generative adversarial network that consists of two competing neural networks with an additional pixelwise loss 13 . The schematic diagram of the proposed network is shown in Fig. 1. The generator ( G ) is trained to generate a synthetic PET 20m -like (sPET 20m ) image from the noisy PET 2m image, and the discriminator ( D ) is trained to distinguish sPET 20m images generated by the generator from real PET 20m image. In the training procedure, the discriminator enables the generator to provide more realistic sPET 20m images 14 . Pixelwise loss is defined as a mean-squared error between sPET 20m images and original PET 2m images, which prevents the generator from changing small anomalies or structures of PET 2m images during training 15 . www.nature.com/scientificreports/ The generator is constructed using the deep convolutional framelets, which consist of encoder-decoder structures with skipped connections 16 . Both encoder and decoder paths contain two repeated 3 × 3 convolutions (conv), each followed by a batch normalization (bnorm) and a leaky rectified linear unit (LReLU) 17,18 . A 2-D Haar wavelet de-composition (wave-dec) and re-composition (wave-rec) are used for down-sampling and up-sampling, respectively, of the features 19 . In the encoder path, three high-pass filters after wavelet de-composition skip directly to the decoder path (arrow marked by 'skip'), and one low-pass filter (marked by 'LF') is concatenated with the features in the encoder path at the same step (arrow marked by 'skip & concat'). At the end, a convolution layer with a 1 × 1 window is added to match the dimension of input and output images. The numbers below the rectangular boxes in Fig. 1 indicate the number of filters. The architecture of deep convolutional framelets is similar to that of the U-net 20 , a standard multi-scale convolutional neural network (CNN) with skipped connections. The difference is in using the wavelet de-composition and re-composition, instead of max-pooling and un-pooling, for down-sampling and up-sampling, respectively. Additional skip connections of high-frequency filters help to train the detailed relationship between PET 2m and PET 20m images.
For the discriminator, we adopted the standard CNN without a fully connected layer. The discriminator contains three convolution layers with a 4 × 4 window and strides of two in each direction of the domain, each followed by a batch normalization and a leaky ReLU with a slope of 0.2. At the end of the architecture, a 1 × 1 convolution is added to generate a single-channel image.

Datasets for training and internal validation.
In the dataset of the 294 patients' PET images (70 image slices/ patient), we randomly divided the training and internal validation datasets into 80% and 20%, and used 236 patients' images as the training dataset and used 58 patients' images as the internal validation dataset. The original size of the PET images was 400 × 400 . In order to improve training effectiveness, we cropped all 400 × 400 images to 224 × 224 pixels around the center of an image in both horizontal and vertical directions. Here, only background (i.e., zero-valued) information was removed. We used the cropped images as input and label datasets for the proposed deep-learning network. In the testing procedure, we resized the images corrected by means of the trained generator to 400 × 400 by adding the rows and columns of zeros at the top, bottom, left, and right sides of the images (i.e., zero padding). We did not use data augmentations such as rotation or flipping for training.
Network training. In our study, we ran training for 200 epochs using Adam solver with a learning rate of 0.0002, and a mini-batch size of 10 21 . It was implemented using TensorFlow on a CPU (Intel Core i9-7900X, 3.30 GHz) and a GPU (NVIDIA, Titan Xp. 12 GB) system 22 . It took about 68 h to train the network. The network weights followed a Gaussian distribution, with a mean of 0 and a standard deviation of 0.01.
Assessment of image quality. We compared the image quality of PET 2m and the synthesized sPET 20m images with the original PET 20m images using the peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and normalized root mean-square error (NRMSE). The SSIM index depends on the parameters www.nature.com/scientificreports/ where L is the dynamic range of pixel values and K is some constant 8 . In our study, we chose C 1 = (0.0002 × 65535) 2 andC 2 = (0.0007 × 65535) 2 . The proposed method was also compared with the standard U-net method.
For further analysis, we calculated the Standardized Uptake Value Ratio (SUVR) using PMOD 3.6 software (PMOD Technologies, Zurich, Switzerland) 23 . We obtained the transformation matrix of each participant by fusing the CT template of the PMOD and the CT image of the participant. PET images were then spatially normalized by using the transformation matrix of each participant and were applied to an automated anatomical labeling template of PMOD (Hammers atlas). We spatially normalized all pairs of sPET 20m and PET 20m images to the Montreal Neurological Institute (MNI) spatial templates and applied the Hammers atlas. By reconstructing the volume-of-interests of the atlas, the representative areas were set up as the striatum, frontal, parietal, temporal and occipital lobes, and global brain. We calculated the SUVRs of the representative areas and used the cerebellar cortex as the reference region. We compared the difference of SUVRs of the identical area between sPET 20m and PET 20m images.
Clinical interpretations. For visual interpretation, three nuclear medicine physicians with certification and experience in amyloid PET readings participated (YJ and DY have over 15 years and JE has 4 years of experience in nuclear medicine; all of them also have 4 years of experience in amyloid PET assessment). They were blinded to the clinical data and independently read all PET images of the internal validation dataset.
Turing test. We did two Turing tests and evaluated all PET images of the internal validation dataset. First, of all the sPET 20m and PET 20m images, we randomly selected 58 images and presented them to the physicians one by one for them to decide whether the PET image was real or synthetic (Test 1). Second, we presented a pair of sPET 20m and PET 20m images of the same patient to the physicians to find the original PET 20m image (Test 2). We anonymized all PET images and randomized the order of PET images.
BAPL score. We gave all anonymized sPET 20m images of the internal validation dataset to the physicians to interpret and score according to the conventional interpretation protocol. All the sPET 20m images were classified into three groups according to the BAPL scoring system. BAPL score is a specialized, predefined three-grade scoring system for F-18 FBB PET wherein measurements are made by the physician based on the visual assessment of the subject's amyloid deposits in the brain 24 . BAPL scores of 1 (BAPL 1), 2 (BAPL 2), and 3 (BAPL 3) indicate no amyloid load, minor amyloid load, and significant amyloid load, respectively. Therefore, BAPL 1 indicates a negative amyloid deposit, whereas BAPL 2 and BAPL 3 represent positive amyloid deposits. In this study, we treated the BAPL score read from the PET 20m images as the ground-truth score set by consensus among the three physicians. We measured the accuracy of the BAPL score for each physician. We also analyzed the agreements between the BAPL score of sPET 20m and PET 20m images for each physician.
Temporal and external validations. We additionally verified our model by measuring PSNR, SSIM, NRMSE, and SUVR by means of temporal and external validation. The patient characteristics of our temporal and external validation datasets are illustrated in Table 1. We performed all analyses of temporal validation in the same manner as used in internal validation. We did external validation using a public FBB dataset from ADNI. ADNI datasets contain a series of 4 × 5 min of FBB PET images. The proposed model trained on our institute dataset (i.e., pair of 2-min and 20-min images) was tested on the first 5-min PET images. In this study, a Gaussian filter with 4-mm FWHM was applied to all FBB PET images of ADNI datasets.

Statistical analysis.
We assessed the intra-observer agreement of the BAPL score between the sPET 20m and PET 20m images using Cohen's weighted kappa. We calculated the accuracy, sensitivity, and specificity for the interpretations of sPET 20m images. We assessed the difference of group characteristics using an independent t-test, one-way ANOVA, and chi-squared test. We evaluated the difference in SUVR between sPET 20m and PET 20m images using an independent t-test or Mann-Whitney U test, and evaluated the relationship of SUVRs between them using the Pearson's correlation coefficient. We assessed agreement of SUVRs of both PET images using the Bland-Altman 95% limits of agreement. We did the statistical analyses using the MedCalc software version 16.4 (MedCalc Software, Mariakerke, Belgium) and NCSS 12 Statistical Software (NCSS, LLC. Kaysville, Utah, USA). Statistical significance was defined as p < 0.05.

Results
Assessment of image quality. PSNR, SSIM and NRMSE. The original PET 2m , PET 20m and sPET 20m images and a synthetic image generated by U-net are shown in Fig. 2.
Both the proposed and the U-net methods significantly reduce noise, but the U-net produces a slightly blurrier image than does the proposed method. For quantitative comparison, we calculated averaged PSNR, SSIM, and NRMSE for all datasets. The results are summarized in Table 2, which shows that the proposed method had the highest PSNR and SSIM, and lowest NRMSE, whereas PET 2m images showed the worst performance in internal validation. The proposed model shows similar performance for the temporal validation dataset, in terms of PSNR, SSIM, and NRMSE. As shown in Table 2, our method also improved the image qualities of the 5-min images in external validation.

Clinical interpretations for internal validation dataset.
Turing test. Tests 1 and 2 showed similar results (Table 3). Test 1, a test to decide whether the presented single PET image was real or synthetic, showed that, regardless of the duration of clinical reading experience in nuclear medicine, the overall accuracy was not high (44.8% and 63.8%). In Test 2, a test to select a real PET image out of two PET images of the same patient, the more experienced the physicians were in clinical reading, the more often the real PET image was selected (48.3-60.3%). Overall, however, the clinicians did not seem to distinguish well between generated PET images and real PET images.
BAPL score. The three physicians assessed the sPET 20m images according to the BAPL scoring system, and there was no poor or inadequate image that was difficult to interpret. In five, six, and eight patients out of 58 patients, each physician assessed the BAPL score differently from the ground-truth score. Table 4 shows the   www.nature.com/scientificreports/ accuracy, sensitivity, and specificity for the three physicians. Overall, the mean values for accuracy, sensitivity, and specificity were 89.1%, 91.3%, and 83.3%, respectively. The confusion matrices are provided in Table 5.
We evaluated the intra-observer agreement using Cohen's weighted kappa by comparing the BAPL scores between the sPET 20m and PET 20m images. Clinicians' Cohen's weighted kappa was 0.902 (DY), 0.887 (YJ), and 0.844 (JE), with a mean value of 0.878.

Discussion
In this study, we investigated the feasibility of a deep-learning-based reconstruction approach using short-time acquisition PET scans. We used PET images acquired for 2 and 20 min as input and target images, respectively. Quantitative and qualitative analyses showed that the proposed method produces efficient synthetic PET images from short-scanning PET images. We calculated image-quality metrics (such as PSNR, SSIM, and NRMSE) for model evaluation between the synthetic images and ground-truth images (standard scanning images). Overall, the proposed method improved the image quality by suppressing the noise in short-scanning images. Note that the SSIM index depends on the parameters ( K 1 and K 2 ). In our study, the average SSIM index for the synthetic images increased from 0.8818 to 0.9939 when K 1 , K 2 increased from 0.0002 to 0.0007 and 0.01 to 0.03 , respectively. However, in this case, the differences in the SSIM index were very small. Our deep-learning method also improved the image qualities of the 5-min images of the ADNI dataset, even though the test domain significantly differs from our training domain. We adopted the GAN framework with an additional mean-squared loss between the synthetic sPET 20m image and the PET 2m image. The performance of the proposed network was compared with that of the conventional U-net. The U-net minimizes only the pixelwise loss between the synthetic PET image and ground-truth (i.e., PET 20m ) image, resulting in an over-smoothed image, whereas the proposed approach clearly reconstructs the detailed structures of the brain (Fig. 2) 25 . In terms of quality measurements, such as PSNR, NRMSE, and SSIM, the proposed method outperformed the U-net. The time taken to generate a synthetic single sPET 20m image from a PET 2m image was within a few milliseconds on the GPU system, which would make the proposed method adequate for clinical use.
Some previous studies have also tried to reduce noise and improve image quality using a deep-learning technique in PET imaging [5][6][7][8][9] . Most of these studies aimed at maintaining the quality of the PET image while reducing the injection dose of radiopharmaceuticals in order to minimize radiation exposure. They showed that the image quality of low-dose PET could be restored like the original PET images obtained with standard protocols while reducing the conventional radiopharmaceutical dose by up to 99%. However, they all used synthesized low-dose data (i.e., a small amount of data selected from the entire acquisition period), which may differ from the measured data obtained from the true low dose. A feasibility study on real data is needed for clinical use. One study restored a low-quality PET image taken in 3 min to match a standard image taken in 12 min 7 . This study differs from ours in that it used MRI information taken together to restore image quality. Considering the absence of a PET/MRI scanner in most hospitals, the proposed method using PET images only could be used in general clinical practice. Another study reported that using a 5-min PET image, one frame of 20-min data without deep-learning methods, did not relevantly affect the accuracy of disease discrimination 26 . The advantage of our method is that it can generate PET images like those of full-time scanning images with only 2-min data in any part, regardless of the frame. In our study, no comparison of diagnostic accuracy between PET images obtained by our method and 5-min PET images was done. However, if PET image reconstruction with short-time data is required, we think that our method, along with the PET imaging method using one frame 5-min data, can broaden the range of options that can be selected according to the situation.
Since amyloid PET images are used in hospitals to care for patients with memory impairment, deep-learning-generated images must have an image quality similar enough to the original image that it can be used for interpretation in the clinics. In this study, we used several methods to decide whether generated images could be available clinically. We did tests to find an answer to the following questions: Can physicians distinguish between PET 20m and generated sPET 20m images? What is the difference in visual interpretation results? What is the difference between quantitative analysis using SUVR in both images?
When PET 20m and generated sPET 20m images were presented at the same time to three nuclear medicine physicians who were in charge of clinical reading, the accuracy of the selection of the PET 20m images was within 40-60%. This suggests that synthetic PET images generated by our method are almost indistinguishable from Table 5. Confusion metrics for interpretation of PET images using BAPL score between the PET 20m and sPET 20m images. BS BAPL score of ground truth, GT ground truth, sBS BAPL score of the synthetic PET image.  sBS1  13  2  0  15  15  5  0  20  12  4  0  16   sBS2  3  16  0  19  1  13  0  14  4  14  0  18   sBS3  0  0  24  24  0  0  24  24  0  0  24  24   Total  16  18  24  58  16  18  24  58  16  www.nature.com/scientificreports/ the real PET image. Next, we did the BAPL scoring test to assess the intra-observer agreement and diagnostic accuracy. In our study, Cohen's weighted kappa was above 0.84, which indicated an almost perfect intra-observer agreement. We also did BAPL scoring on generated PET images, which we compared with the ground-truth scores. In the strong positive cases (BAPL 3), all three physicians showed a 100% accuracy, but in the negative (BAPL 1) and weak positive (BAPL 2) cases, between 5 and 8 of the 58 patients were false-positive or falsenegative. It is already known that the amyloid PET study itself, even if obtained according to a conventional protocol, can cause misclassification when visually read. Some studies have reported that about 10% of the results may be inconsistent 27,28 . In addition, some errors from the deep-learning algorithm could be added, so we think that the misclassification has increased a little in our study. We also think that the physician's opinion may have some influence on the interpretation of how much the amyloid uptake is positive in the visual reading that distinguishes BAPL 1 and 2. Few studies have evaluated the accuracy of physicians' interpretations among studies related to deep learning on a subject similar to ours. One study showed 89% accuracy when read using deep-learning-generated PET images, which is very similar to our result 6 . In order to make up for the weak points of the visual reading, SUVR is used as a quantitative indicator in routine practice to infer the severity or prognosis of the disease 23 . In the generated brain PET images of this study, regional SUVRs were not significantly different from the values of ground-truth images in negative and positive cases (p > 0.05). In the Bland-Altman analysis, the mean of the difference was 0.005 in the negative case and 0.024 in the positive case, and the limits of agreement of each region were small. That is, our deep-learning model can generate images with SUVR values that are comparable to those of the original PET images. We obtained similar results by means of temporal and external validations, which allowed us to reconfirm this fact. Taken together, these results suggest that the synthetic amyloid PET images generated by our deep-learning method could be used for clinical reading purposes.
Our study has some limitations that need to be considered for clinical use. First, our deep-learning model trained on FBB PET with 2-min data should be tested under various acquisition conditions. Using multicenter datasets for training or incorporating domain adaptation techniques could improve image quality, which is a part of our future work 29,30 . In this study, in order to avoid overfitting, we evaluated our model using the ADNI data, a completely different dataset, and our hospital data obtained at a different time from the training dataset. Second, in our study, we empirically chose 2-min images as a training dataset for short scanning. However, 2-min PET images may not be optimal. More rigorous analysis may be needed to choose the proper short-scanning image. Third, we generated only trans-axial PET images in this study. Although interpretation guidelines for FBB PET recommend using trans-axial PET images for clinical reading, coronal and sagittal PET images have also been used recently for reading. In the next study, we need to apply our deep-learning model to generate three orthogonal PET images. In addition, the application of a 3-dimensional model and finding the optimal hyperparameters is a problem to be solved in the future.
In conclusion, we presented an image-restoration method using a deep-learning technique to yield a clinically acceptable amyloid brain PET image with short-time data. Qualitative and quantitative analysis by means of internal, temporal, and external validations showed that the image quality and quantitative value of the generated PET images were very similar to those of the original images. Although more evaluation and validation are needed, we found that applying deep-learning techniques to amyloid brain PET images can reduce acquisition time and provide clinically equivalent interpretable images as standard images.

Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.