DeepGhost: real-time computational ghost imaging via deep learning

The potential of random pattern based computational ghost imaging (CGI) for real-time applications has been offset by its long image reconstruction time and inefficient reconstruction of complex diverse scenes. To overcome these problems, we propose a fast image reconstruction framework for CGI, called “DeepGhost”, using deep convolutional autoencoder network to achieve real-time imaging at very low sampling rates (10–20%). By transferring prior-knowledge from STL-10 dataset to physical-data driven network, the proposed framework can reconstruct complex unseen targets with high accuracy. The experimental results show that the proposed method outperforms existing deep learning and state-of-the-art compressed sensing methods used for ghost imaging under similar conditions. The proposed method employs deep architecture with fast computation, and tackles the shortcomings of existing schemes i.e., inappropriate architecture, training on limited data under controlled settings, and employing shallow network for fast computation.

Scientific RepoRtS | (2020) 10:11400 | https://doi.org/10.1038/s41598-020-68401-8 www.nature.com/scientificreports/ relying on basic correlation and probabilistic methods for target detection 14,15 . Recently, there have been some interesting studies that explore the potential of DL for GI [16][17][18][19][20] . For GI, the most relevant deep neural network model is the denoising autoencoder 21 . An autoencoder can be used as an unsupervised feature learner to extract features from high-dimensional data in a systematic fashion. For GI, the autoencoder model can be used to recover a clean image from an undersampled ghost image reconstructed from fewer measurements, thus reducing reconstruction time.
The existing DL methods applied to CGI have limited applicability due to: (a) inappropriate architecture, (b) training on limited data or targets, and (c) employing shallow network for real-time operation. These schemes can work under controlled settings but fail when tested on a large dataset with complex scenes and measurement noise. For example, in Ref. 16 a stacked neural network model was used, confirming the potential of DL in CGI. The model employs a shallow fully connected network which is known to have computational complexity and is prone to data overfitting 22 . The model seems to work well with MNIST dataset, but its fully connected architecture is not suitable for complex image analysis. For image analysis, a more apt choice is the convolutional neural network (CNN) 23 . The work presented in Ref. 17 proposed a better (autoencoder) model based on CNN for CGI. However, the network was only trained for a particular object with limited training dataset, therefore not utilizing the true power of CNN.
In this paper, we demonstrate a CGI system that employs deep convolutional autoencoder network (DCAN) to reconstruct real-time images, using only a photodiode and random binary patterns for target scanning. The proposed DCAN (called "DeepGhost") strikes a balance between depth of layers and computation speed by employing a novel architecture for improved image recovery and fast network convergence. By employing innovations such as augmentation and transfer-learning, the proposed method can image complex unseen targets with high efficiency. Through simulations and experiments, we validate the superiority of our model by comparing it with existing DL 16,17 and state-of-the-art compressive sensing algorithms 24 used for GI under similar conditions.

Results
Simulations. The network architecture for DeepGhost is shown in Fig. 1. The idea is to feed the network with undersampled (10%, 15%, 2 0%, and 25%) target images (acquired from CGI setup) for clear target reconstruction. The proposed network is optimized for physical imaging setup by exhaustively testing through numerical simulations. For training and testing, STL-10 25 dataset is used, which comprises of 10 classes: monkey, cat, dog, deer, car, truck, airplane, bird, horse, and ship. Sample image from each class is shown in Fig. 2. comparison with conventional and cS algorithms. First, the performance of DeepGhost is evaluated through comparison with differential ghost imaging (DGI 26 ) and compressive sensing methods 24 . The Deep-  www.nature.com/scientificreports/ Ghost model is first trained on STL-10 data set (10,000 images), and then evaluated over a validation dataset (1,000 images) which is not seen during training. The same validation dataset is used as target images for DGI and CS based methods. In this paper, the sampling ratio 'S' is defined as the ratio between Number of measurements to Image size in pixels. For quantitative comparison, peak signal-to-noise ratio (PSNR) and Structural SIMilarity (SSIM) 27 metrics are used.

Results and analysis.
For qualitative comparison, an image from the "monkey" class of validation dataset is chosen. We evaluate the reconstruction results of DGI, Sparse, total variation (TV), and DeepGhost algorithms (see details in "Methods" ****section) for sampling ratios ranging from 0.1 to 0.25. We use Sparse and TV algorithms which are well-known high performance algorithms for specifically comparing the reconstruction quality. By visual inspection, it can be seen from Fig. 3 that the reconstruction results for TV and DeepGhost are almost identical. For a low sampling ratio of 15%, we get a reasonable target reconstruction for complex scene using DeepGhost. However, to achieve better results on overall dataset and diverse scenes, we resort to S = 0.

Results and analysis.
The PSNR over the test set (1,000 images) is computed during training and plotted against training epochs, shown in Fig. 4a. The PSNR for the reconstructed image is calculated with respect to its  www.nature.com/scientificreports/ ground truth counterpart. It can be seen from Fig. 4a that it is very challenging for the GIDL network to recover image details from an under sampled image, achieving low PSNR values throughout its training. This is easy to understand because fully-connected neural networks are not ideal for image analysis. Although they can perform well on simple (e.g., digits) dataset, it is difficult for them to achieve satisfactory performance on complex images. Moreover, the training time for the GIDL network is very long compared to DeepGhost due to its fully connected structure. Compared to GIDL, the DLGI employs a better network based on convolutional layers. However, from Fig. 4a, it can be seen that DeepGhost also outperforms DLGI in terms of image reconstruction quality with high PSNR values achieved within a few epochs.
It is important to highlight that the training convergence for DeepGhost is faster compared to both DLGI and GIDL networks. This points toward the fact that simply using deep networks for image reconstruction may not lead to a satisfactory performance. Since DeepGhost uses skip connections along with deep architecture, it can achieve better results with fast convergence. Keeping in view the long convergence times of other models compared to DeepGhost, we carry out comparison testing at a high learning rate (lr = 0.001). It can be seen from Fig. 4a that DeepGhost has a chirpy PSNR response after ~ 10 epochs. This is because our network converges faster at a high learning rate compared to DLGI and GIDL networks and then goes into overfitting mode. Therefore, we choose a lower learning rate (lr = 0.0001) for DeepGhost training. To further investigate performance differences between these networks, a qualitative comparison is presented in Fig. 4b.
From Fig. 4b, it can be seen that the GIDL network fails to reconstruct complex targets because of its fully connected architecture. Therefore, this kind of network is not suitable for dynamic CGI. Similarly, the DLGI network, by using shallow convolutional structure, roughly estimates the target, failing to provide a clear reconstruction. In contrast, DeepGhost provides much better reconstructions for complex diverse targets. This superior performance of DeepGhost can be attributed to its denoising autoencoder structure with skip connections, which achieves deep architecture with low computational time. The inclination towards using simple architecture, shallow network (to reduce computational time), and validating model on limited data results in poor performance of DLGI and GIDL.
For evaluating noise robustness, the performance of DeepGhost is compared with DLGI (which gives slightly better reconstruction than GIDL). In this experiment, the detection fluctuations are simulated by adding noise (using awgn() function in Matlab) to measurement data (intensity values), resulting in different SNRs. The reconstruction results for the 'bird' image at S = 0.2 are shown in Fig. 5. From qualitative comparison in Fig. 5, it can be seen that the DLGI network fails to combat noise with poor reconstruction quality at different SNRs. This indicates that the convolutional layers (of DLGI) with no mechanism to suppress noise fail to recover a clean target. On the other hand, the DeepGhost network based on denoising autoencoder architecture, learns to suppress noise using compressing/decompressing stages, recovering clean targets at different SNRs. This noise suppression is further aided by skip connections, which provide high frequency information across different layers, to recover fine details which are lost during noise suppression. From overall comparison, it can be concluded that the DeepGhost model is more suitable for practical CGI compared to existing networks. The reconstruction results for DeepGhost at different sampling ratios are shown in Fig. 6.
Physical experiments. The experimental arrangement of CGI setup is shown in Fig. 7. A series of random binary patterns is projected using a custom-made projection system. Light from the source LED is modulated by a TI DLP6500 DMD. A projection lens with focusing dial is used to project sharp patterns on the target plane. Target scenes are printed on an A4-sized white paper (using a regular printer). The target is placed at a distance of 500 mm from the plane of projection and detection. Light back-reflected from the scene is collimated on the photodetector (Thorlabs; 21 mm 2 active area) by a 5 mm imaging lens. Intensity measurements captured by the photodetector are digitized by a 16-bit data acquisition (DAQ) card (Sampling at 2 MS/s). A customized software is used to project patterns and acquire intensity values (using a synchronous trigger) for computation. The rudimentary image reconstructed by the software is passed down to DeepGhost for clean undersampled reconstruction. The data collection and preparation (of experimental and synthetic data) for training takes a week.
Experiment-1 results. In the first experiment, we directly apply the DeepGhost model trained on simulation dataset to reconstruct target images acquired from random image datasets (airplane and dog image 28 , standard mandrill test image, and our university logo). It is observed that the application of simulation-trained model www.nature.com/scientificreports/ under physical conditions (e.g., noise, target reflectivity) demands undersampled input to be reconstructed at S = 0.4. Therefore, we capture input images at 40% sampling rate with respect to clear target reconstruction through our CGI (DGI) setup in this case. Figure 8(a,c: good case, b,d: worst case) shows the reconstructed images with corresponding PSNR and SSIM values. From Fig. 8, it can be seen that the network is able to reconstruct random images from different classes. However, the network is unable to correctly reconstruct all random targets with clarity because of limited data training and knowledge of physical imaging environment. In fact, it   www.nature.com/scientificreports/ is very challenging to optimize a DL model for CGI directly through simulation data for reconstructing diverse random scenes. To counter this problem, we apply augmentation and transfer-learning in our experiments.

Experiment-2 results.
In the second experiment, the proposed network is trained on undersampled images acquired from the CGI setup (through DGI for different targets), with ground truth counterparts set as training output. To increase limited data acquired from physical setup, we apply data-augmentation technique (using Keras's DataGenerator module; by applying translation, rotation, and adding noise in the images). Even though, the data can be increased through augmentation, it is still prone to overfitting. Therefore, we further use transferlearning to make the network highly-scalable. Transfer-learning is used to provide prior-knowledge from the large dataset (obtained during training) to the smaller augmented dataset to perfect imaging under physical conditions. The results for 'mandrill' test image are presented in Fig. 9. It can be seen that the results from experiment-2 ( Fig. 9) are very clear compared to the result (Fig. 8b) from simulation based model. The results on validation dataset are understandably consistent, shown in Fig. 10. Overall, it is observed that simple targets with plain background are easily reconstructed at S = 0.2. However, for some complex targets (e.g., Fig. 10a,d), better image quality is achieved at a slightly higher sampling ratio (Fig. 11). This is due to (1) practical system noise that can blur reconstructed images by corrupting feature extraction and/or (2) complex image features of random unseen images. The overall results indicate that the reconstruction quality with 20% sampling rate using binary random patterns based CGI is very promising.    Table 1 that DeepGhost can achieve real-time frame rates (fps) compared to conventional methods with high reconstruction overhead only.

Methods principles and methods of cGi. In computational ghost imaging, a target scene O(x, y) is reconstructed
by correlating a series of modulation patterns P i (x, y) with intensity measurements S i at the bucket detector. The target scene can be reconstructed by 29 : where S i is the ith measurement, P i is the ith modulation pattern, and the ensemble average for N iterations is given by: To reconstruct high quality image, a large number of measurements are required. To improve the performance of correlation based GI, DGI has been proposed 26 . Figure 3 shows images reconstructed using DGI defined by Eq. (2), where, R i is the reference signal. It is evident that even with these methods, GI still requires a large number of measurements (long imaging time) to produce quality image.
To reduce reconstruction time for CGI, compressive sensing methods have been applied to ghost imaging 11,30,31 . The CS theory allows an object (target scene) O(x, y) to be reconstructed from a set of undersampled measurements S, assuming that object is sparse within a fixed basis. For evaluation, we process our GI data with two commonly used priors for natural images: the sparse prior and the total variation (TV) regularization prior. The sparse representation prior 32 considers natural image to be represented by an orthogonal basis (discrete cosine transform) transform matrix D and coefficient vector c. The reconstruction for CGI is achieved by minimizing the following function: where y is the Lagrange multiplier and µ is the balancing parameter. The above l1-minimization problem can be solved by using augmented lagrange multiplier (ALM) method 33 . The TV regularization prior is related to the gradient of an image. If G is the gradient matrix of an image, the TV regularization prior based reconstruction is given by solving the following minimization: DeepGhost. The proposed deep convolutional autoencoder architecture is shown in Fig. 1. The network employs convolutional layers with trainable filters for extracting features and filtering corruptions from the image. The encoding stages use 32, 64, and 128 (Conv2D) filters for scaling down the data. The compressed data is grouped at an "intermediate" layer with 256 conv-filters. The decoding stages use 128, 64, and 32 filters for reconstructing the encoded image. The output is reconstructed using a single conv-filter at the end. To visualize data processing at each layer, the feature maps for an unseen target (pepper test image) through the network pipeline are shown in Fig. 12. To prevent network operation in saturated or dead regions of activation, the network is initialized with Xavier initialization 34 . After every convolutional layer, batch normalization layer 35 is used to achieve training efficiency. The data along the pipeline is scaled into different dimensions using max-pooling www.nature.com/scientificreports/ and Up-sampling operations. To counter data over-fitting, Gaussian noise layers are used to apply regularization through additive Gaussian noise in the hidden layers. The image reconstruction quality is improved by training the network with noisy data traversed via skip connections between similar scale stages. The nonlinearity between layers is created using a nonlinear activation (ReLU). In general, the autoencoder serves the purpose of image denoising. If O(x, y) is assumed to be the target, then the target imaged by CGI using undersampled measurements is a corrupted version of the target g O x, y + n added with noise, represented by Õ x, y . The inverse problem of recovering the original image from an undersampled image is solved by applying DL. Through training, the network learns an end-to-end mapping from Õ x, y to O x, y . For the reconstructed target Ô x, y , the network is trained on a set S = {DGI undersampled , Ground truth }, to minimize the loss function expressed as: The network is fed with an undersampled ghost image reconstructed from CGI data using iterative DGI algorithm (Eq. (2). For further time reduction and fast reconstruction, a compressive sensing algorithm can also be used to preprocess CGI data 17 . The network parameters are updated using Adaptive moment estimation optimization 36 with standard back propagation on mini-batch(es)S − . The learning rate for each layer = 10 -4 . The proposed network is trained on gray-scaled STL-10 25 96 × 96 images. All images are preprocessed using standard normalization procedure. The training set has 10,000 images, whereas both test and validation image sets have 1,000 images each. The network is implemented with Keras (TensorFlow support) on an Intel i7 CPU with 32 GB memory.

conclusion
In this paper, we demonstrate a DL based imaging framework to improve the performance of random-pattern based CGI. DL can learn features from a large dataset and is more flexible compared to CS optimization techniques based on fixed priors and rigid calculations. The proposed method is capable of reconstructing goodquality 96 × 96 target with 80% compression at 4-5 Hz frame rates. Optimizing random-pattern based CGI for real-time application is very challenging because of its long reconstruction time. Even if the reconstruction time is reduced by means of undersampling, the reconstruction quality of undersampled CGI (through CS or DL) for diverse unseen targets is poor. The main objective in this paper is to reconstruct diverse unseen targets with accuracy. By importing prior knowledge from a large dataset, and training a network on physical data, this objective is achieved. The core component of our imaging framework is the DCAN. The network uses an encoding-decoding architecture combined with skip connections to reconstruct good quality image from an undersampled input. Deep learning combined with GI is a good choice in order to avoid complex methods that fail to reap the benefits of GI i.e., reduced cost and simplicity. By further training our algorithm on a larger dataset (more classes), we can enhance its feature learning ability, which would increase reconstruction reliability and quality. Experimental results show that the proposed method achieves better performance than compressive sensing and existing deep learning methods used for computational ghost imaging.  Figure 12. End-to-end Visualization of activation feature maps at different layers in the network. The SSIM plot for different standard test set targets is given to quantify SSIM at different layers. The SSIM increases when the decoding layers start reconstructing.