Deep-learning-assisted Fourier transform imaging spectroscopy for hyperspectral fluorescence imaging

Hyperspectral fluorescence imaging is widely used when multiple fluorescent probes with close emission peaks are required. In particular, Fourier transform imaging spectroscopy (FTIS) provides unrivaled spectral resolution; however, the imaging throughput is very low due to the amount of interferogram sampling required. In this work, we apply deep learning to FTIS and show that the interferogram sampling can be drastically reduced by an order of magnitude without noticeable degradation in the image quality. For the demonstration, we use bovine pulmonary artery endothelial cells stained with three fluorescent dyes and 10 types of fluorescent beads with close emission peaks. Further, we show that the deep learning approach is more robust to the translation stage error and environmental vibrations. Thereby, the He-Ne correction, which is typically required for FTIS, can be bypassed, thus reducing the cost, size, and complexity of the FTIS system. Finally, we construct neural network models using Hyperband, an automatic hyperparameter selection algorithm, and compare the performance with our manually-optimized model.

Fluorescence imaging allows for direct observation of various organelles in a biological specimen with high resolution and contrast. It typically uses fluorescent dyes which bind to different key targets/organelles of cells or fluorescent proteins (FPs) that are fused to protein targets in living cells 1 . A dichroic filter tuned for the characteristic excitation and emission bands of the fluorophore is typically required. For the observation of more than one fluorophore, a dichroic filter with multiple passbands or a set of filters mounted on a filter wheel is typically adopted. Many FPs commonly used in live-cell imaging have overlapping emission spectra, which limit the number of FPs that can be used simultaneously 2 . Hyperspectral imaging, which combines imaging and spectroscopy, allows for using a multitude of fluorophores with close emission peaks 3,4 . It also allows for accurate detection and quantification of target fluorescence signals in a tissue with highly autofluorescent background 5 .
For hyperspectral imaging with the spectral resolution of 10 nm or below, an acousto-optic tunable filter (AOTF), a liquid crystal tunable filter (LCTF), or Fourier transform spectroscopy (FTS) is typically combined with an imaging device. AOTF enables us to scan the entire visible wavelength range in several seconds or randomly access to any wavelength in the range 6 . LCTF is slower but can provide superior image quality 7 . AOTF provides a narrower spectral bandwidth (1.5-4.1 nm) than LCTF (4.5-19 nm) 7 . FTS is a preferred method when a high spectral resolution is required, the signal is weak, or both. In contrast to LCTF or AOTF, which passes only a narrow spectral band of interest, FTS uses the entire spectrum of interrogated light for each sampled data; thus, the sensitivity of FTS is very high 8,9 . This Fellgett's advantage is especially useful when the target fluorescence signal is weak. For FTS, a Michelson interferometer or a Sagnac interferometer is typically used, which splits the interrogated light into two, and a series of intensity data is acquired for varying optical path differences (OPDs) 10 . The Fourier transform of the interferogram can be related to the spectral profile of the interrogated light. Fourier transform imaging spectroscopy (FTIS) combines FTS with an imaging device to measure the spectrum of each pixel of an image. FTIS has been used for various applications that require simultaneous use of multiple fluorophores. For example, it has been used to classify seven fluorophores with overlapping emission spectra in immunofluorescence-stained tissue samples 3 . FTIS is a gold standard method for spectral karyotyping, which uses a combination of 4-5 fluorophores to label 24 human chromosome pairs 11 .
One key disadvantage of FTIS is low throughput due to the large number of interferogram images that need to be collected to reconstruct the spectrum. Typically, more than 1000 images are acquired, which can take tens www.nature.com/scientificreports/ of seconds even with a high-speed camera. Several efforts have been made to reduce the sampling number with methods such as compressed sensing 12 . Since the interferogram measurements of FTIS are taken in the Fourier space, the signal measurement procedure for FTIS satisfies the incoherence property which is a requirement of compressed sensing 13,14 . Here, we show that deep learning can significantly reduce the required FTIS sampling number. As with FTS, FTIS records interferograms at uniform intervals of OPD; i.e., the distances between the moving mirror's positions where the interferograms are recorded are assumed to be the same. However, due to environmental vibrations and translation stage error, the actual mirror position where each interferogram is recorded is different from the target position. To correct for this error, a reference laser (typically He-Ne laser) with a sharp peak at the known spectral position is inserted in the same beam path as the interrogated light, and its interferogram is used to find the true OPD (i.e., the actual mirror position) for each sampled data. The socalled He-Ne correction is crucial for accurate reconstruction of the spectrum using FTS, but requires additional optical components. Here we show that deep-learning-assisted FTIS obviates the He-Ne correction; thereby, it can reduce the complexity and footprint of the FTIS system. Deep learning-based approaches have been shown to outperform traditional methods in a wide array of fields including imaging 15 and natural language processing 16 . For example, deep learning has been applied to improve the resolution of optical microscopy 17 , scanning electron microscopy 18 , and multispectral imaging 19 . The enhanced imaging performance has been used to accurately identify components of images that are indicative of specific diagnosis 15,20,21 .

Materials and methods
Materials. To   Data acquisition. Before the experiment, the zero OPD position, where the interferogram had the maximum value, was found. To image BPAE cells, the power of each excitation LED was adjusted to produce about the same fluorescence intensity levels for all the fluorophores. For each field of view (FOV), 1000 interferogram images were recorded while moving the translation stage in steps of 50 nm. This step size corresponds with 100 nm in terms of the OPD. The images were recorded with the camera EM gain of 300 and the exposure time of 0.01s. Simultaneously capturing the images with the camera, the reference interferogram was collected with a photodiode. Each measurement took 23 seconds per set of 1000 images. A total of 30 sets of images (i.e., FOVs) were collected. For the fluorescent beads, the power of the excitation LED and the exposure time of the camera were adjusted to prevent pixel saturation for the scarlet bead, which produced the strongest fluorescence intensity. The same setting was used for all the other bead types. In each FOV there were a minimum of three beads. For the training data, interferograms from 10 FOVs were collected for each fluorescent bead type. For the test data, interferograms from twenty FOVs with mixed fluorescent beads were collected. For each FOV, we selected the sample region using a binary mask, which was obtained by applying a threshold to the maximum projection of the raw interferogram images. Then, for each pixel in the sample region, we extracted the raw interferogram, and the maximum intensity was saved for later use. Each interferogram was detrended, and normalized so that the peak was the center, with a mean of 0.5 and a maximum of 1.

Scientific Reports
For the BPAE cell imaging, we trained a neural network (NN) to predict the normalized channel intensity (NCI), the area under each fluorescence emission band divided with the total area for all the three emission bands, which is shown in Fig. 1b. The ground truth NCI values were computed from the interferograms using the conventional FTS method (including He-Ne correction). For each channel (i.e., emission band), 20,000 interferograms producing the highest NCI values were selected, which resulted in 60,000 interferograms from each FOV. Out of 30 FOVs, the datasets for 28 FOVs were used for training. The dataset for two remaining FOVs were set aside for validation and testing. The training data was randomly mixed so that each mini batch contained data from multiple FOVs, multiple locations on each sample, and different fluorescent signal types. The training data was augmented by adding three types of error: the peak location was shifted from the center by an amount randomly sampled between − 3 and 3, the mean was shifted from 0.5 by an amount sampled from a normal distribution with a mean of zero and a standard deviation of 0.05, and noise sampled from a normal distribution with a mean of zero and a standard deviation of 0.05 was added to each point in each interferogram.
For the classification of 10 types of fluorescent beads, 1000 randomly selected pixels from each training sample FOV were saved, resulting in 10,000 interferograms for each fluorescent bead, 100,000 total. The average spectrum of each fluorescent bead was computed using the He-Ne corrected data and the NUFFT for the range of 400-800 nm. Because the training dataset for each bead type was acquired separately, the ground truth label is known. To determine the ground truth labels of the mixed test data, the MSE between the test data pixel spectrum and each of the 10 averaged training data spectrum was calculated, and the dye type producing the lowest MSE was assigned to the pixel. This method produced over 99.99% accuracy when it was applied to the training dataset. Once all of the labels were computed, the He-Ne-corrected interferograms were discarded, and only the raw interferograms were used by the NN.
Deep learning model. Figure 3 shows a schematic of the 1D convolutional neural network (1D CNN), which was used for the BPAE cell imaging. After each convolutional block, there is a max pooling layer which   www.nature.com/scientificreports/ was increased to 10. Although we tested several configurations, using several layers of small kernels allowed us to detect more complex features at a lower cost than using larger kernels, which is consistent with the previous works, for example, Simonyan and Zisserman 22 . The commonly used ReLU activation function was applied at each of the hidden layers. ReLU activation is widely used for deep CNNs as it introduces nonlinearity while being more computationally efficient than other nonlinear activation functions such as tanh 23,24 . Regularization reduces validation and test loss while sacrificing training loss leading to better generalization. Here, we selected to use dropout and L2 regularization, which is often referred to as "weight decay", with a constant value in each layer. L2 regularization penalizes large weights without leading to additional model sparsity, which results from L1 regularization 25 . The Adam adaptive learning rate optimization algorithm was selected for weight optimization, which is known to perform well with stochastic gradient descent 26 . The loss for the cost function was selected to be based on mean absolute error (MAE) instead of mean squared error (MSE), because MAE-based loss punishes smaller errors in prediction harder than MSE-based loss, which disproportionately punishes larger prediction errors. With our BPAE cell dataset, the models trained with MAE-based loss tended to synthesize images closer to the ground truth.
Training of NN with 1D interferogram. The 1D CNN was trained using the standard training procedure shown in the lower middle area of Fig. 2. For the BPAE cell imaging, training data is fed to the NN, which outputs predictions. These predictions are compared with the ground-truth NCI values to compute the loss, and the weights of the 1D CNN are optimized to minimize the loss. Our training set contained 1,680,000 interferograms from 28 FOVs. Each epoch, or time the 1DCNN trains on the mini batches covering the entire data set, the MAE and loss for training data were calculated and written to a separate file; the same was done for the validation set which consisted of 30,000 interferograms (10,000 producing the highest NCI values for each channel) from a separate FOV. These values are evaluated by the early stopping algorithm. Early stopping is important as it functions as a source of regularization in a synergistic way alongside L2 regularization 17 . The early stopping algorithm observes the loss of the validation data and the epoch number. The training stops if either of the following conditions has been met: (1) the current epoch has reached the maximum number of desired epochs (2) the loss at the current epoch has not decreased below the minimum recorded loss in a set number of epochs, which is referred to as the patience. We used the maximum iteration number of 100 and patience of 5. When an early stopping condition has been met, the training stopped, and the model resulting in the lowest validation MAE was saved. For the training and testing of NN, we used a workstation with two GPUs (NVIDIA Quadro P6000). The MirroredStrategy function built in to TensorFlow was used for synchronous training across the multiple GPUs on a single workstation. For the classification of the 10 fluorescent beads, the same procedure was used to train the neural network except for a few things. First, the main difference in the architectures was that the output layer had 10 output nodes with the Softmax activation function. Second, the classification accuracy was monitored instead of the MAE. Finally, it was observed that the validation accuracy was much higher when the cost function used MSE as opposed to the MAE used in the BPAE cell label regression when testing a few architectures. For this reason, MSE was used for the cost function in the NN classifying the 10 fluorescent beads.
Hyperparameter selection. The hyperparameters for our NN models were manually selected while monitoring the MAE or MSE values for the training data and validation data as shown in Fig. 4. For each interferogram sampling case, the model capacity (the number of trainable weights and biases) was increased until overfitting was observed. More specifically, each initial model started with only one convolutional layer consisting of a small number of filters and a small kernel size, a max pooling layer with large pool size, and the output layer. Each new model introduced a new layer, more parameters per layer, or a decrease in pool size in a nonuniform fashion, intending to make small changes and slightly increase the number of parameters. Filters per convolutional layer ranged from 32 to 128, kernel size ranged from 4 to 8, pool size ranged from 100 to 4, and dense layer  Image synthesis. For the FTS-computed spectrum, the cell image was synthesized by converting each wavelength to an RGB value and then taking the sum of the RGB values weighted by the spectral intensity. The outputs from the NN, which are labeled as "Predictions" on the flow chart in Fig. 2, represent signal weight between 0 and 1 of each fluorescent band region. These predictions were each multiplied with their respective average spectrum (obtained from the training data) and combined, resulting in a synthesized spectrum for each pixel. The maximum intensity of raw interferogram saved at each pixel was used to adjust the scale of the synthesized spectrum. The adjusted spectrum for each pixel was converted to RGB values the same way as described earlier, resulting in an RGB image.

Results and discussion
Providing the fluorescence spectrum for each pixel, hyperspectral imaging allows us to use various fluorophores with close emission spectra and distinguish target fluorescence signals from autofluorescence background. With the superior spectral resolution of FTIS, we can increase the number of fluorophores that can be simultaneously used, thereby increasing the imaging throughput and obviating the need for sample washing. Here we show our deep learning-based approach can increase the throughput by an order of magnitude while minimally sacrificing the accuracy of measurement. Figure 5 shows the images synthesized with FTS for different sampling numbers: (a) 1000, and (b) 50. The raw interferogram for each pixel was He-Ne corrected before being processed with the FTS algorithm. Figure 5a is the ground truth that we use for comparison. Figure 5b, reconstructed from 50 sampled points, shows the image completely lost the ability to distinguish between the blue DAPI (nucleus) and green Alexa Fluor 488 Phalloidin (F-actin) fluorescent dyes. Figure 6 shows the FTS-synthesized images without the He-Ne correction. For N = 1000 (Fig. 6a), the nucleus is shown in an incorrect color, and the F-actin and mitochondria are indistinguishable. For N=50 (Fig. 6b), all three fluorescent dyes are indistinguishable. Comparing the images with those in Fig. 5, the importance of He-Ne correction in the conventional FTS can be clearly seen. Figure 7 shows the images synthesized with the NN described earlier. The interferograms used for the training, validation, and testing of NN have not been corrected with the He-Ne data. Even though the He-Ne correction was not applied, the NN-synthesized image shown in Figure 7(a), which corresponds to N=1000, looks very similar to the ground truth, Fig. 5a. The image shown in Fig. 7b is also almost the same as the ground truth except for some speckles. This is remarkable considering that it was synthesized from 20 times less data than the ground truth and without the He-Ne correction. Figure 8a shows the MSE with varying sampling amounts for the various synthesis methods: FTS without He-Ne correction, FTS with He-Ne correction, and NN without He-Ne correction. The FTS without He-Ne correction method starts heavily degraded with a high MSE (> 0.14) even for N = 1000. The MSE of FTS reconstruction with He-Ne correction is very low for N = 1000; however, it quickly degrades to near 0.08 as the sampling number decreases. In contrast, the NN without He-Ne correction has some small MSE at full interferogram sampling of N = 1000, and the MSE stays below 0.04 as the sampling number decreases. For N = 50, the MSE for the NN without He-Ne correction is less than half of the corresponding FTS with He-Ne correction. While the www.nature.com/scientificreports/ images computed by FTS with and without He-Ne correction are heavily degraded with less sampling, the images synthesized by the NN remain intact at the sampling amount of N = 50. The limit of our approach appears to be at N=50, which is reducing the sampling number by 95%.
The results presented earlier were obtained using a manually optimized NN. For comparison, we built NNs using Hyperband, an automatic hyperparameter selection algorithm. Figure 9 shows the hyperparameters of the top 3 Hyperband models, which have a few key differences. Model 1 does not use pooling, which may reduce the ability to detect translations in patterns 25 , but uses the highest dropout rate of 20%. Model 2 is the only model to use L2 regularization without dropout. To compensate for omitting dropout, the L2 regularization in Model 2 is two orders of magnitude higher than the other models which also feature dropout. Model 3 features 2 standard convolutional blocks, each with max pooling which may lead to the ability to detect pattern translation 25 . The manually selected model uses only one convolutional block consisting of 4 convolutional layers and a max pooling layer with the largest pool size of the 4 models. It also uses significantly fewer nodes in the fully connected layers. Using the models shown in Fig. 9, we applied k-fold cross validation with 10 folds using only the training and validation sets. Table 1 shows a table of k-fold cross validation results for the N=50 case with the models found using Hyperband compared to the manually selected model. From Table 1, we see that Model 3 has the lowest validation error (MAE) of all the models. For this N = 50 case, the Hyperband model search space was between 1,083 and 1,015,075 trainable parameters. The model resulting from manual hyperparameter selection consisted of only 4,460 trainable parameters, an order of magnitude below all three models selected by Hyperband. This manually selected model resulted the highest mean k-fold cross validation error out of the 4 models; however, its test error is greater than the best performing model, Model 3, only by 8%. In contrast, Models 1 and 2 produced about    www.nature.com/scientificreports/ the same validation error as Model 3 but significantly higher test error: 15% and 36%, respectively. The good generalization of the manually selected model is also reflected in the smallest standard deviation for the validation error and may be attributed to the small number of parameters (i.e., low capacity). Since the Hyperband models have an order of magnitude more parameters than the manually selected model, and more than double the standard deviation of validation error, they appear to be memorizing some of the training sets. Therefore, we proceeded with the manually selected model. To further demonstrate the performance of deep-learning-assisted FTIS, we demonstrated the technique using 10 fluorescent microspheres with close emission peaks and overlapping spectra shown in Fig. 10. For the N = 50 case, the best classification accuracy that was achieved was 85%. This low accuracy is attributed to the close emission spectra of some beads. We were able to achieve a very high classification accuracy of 97.8% for N = 100, which is 10 times less sampling than conventional FTS. The confusion matrix in Table 2 shows that the main source of error comes from the yellow bead pixels being mistaken for the orange bead pixels, and the red-orange bead pixels being confused with the carmine bead pixels. Looking at the average spectra of these beads shown in Fig. 10, the yellow and orange spectra peak locations are indeed very close relative to the other fluorescent dyes. For the carmine and red-orange spectra, they are close, but also have similar bandwidths which could have contributed to the problem. Regardless, the 1DCNN is able to very accurately classify 8 of the 10 fluorescent beads, and acceptably classify all 10 fluorescent beads with only 1/10th of the data sampling amount. Figure 11 shows a test sample containing each type of fluorescent bead and the pixel classification of each pixel interferogram for the N = 100 case. We observe that the beads are very accurately classified, and the only error appears to be near the edges of some beads.
The significant reduction of interferogram sampling by 10 to 20 times is comparable to the performance demonstrated with compressed sensing, which requires 1/16 of data traditionally needed 28 or 1/9 of random samples from the original dataset 29 . Reducing the sampling size allows FTIS-based approaches to be readily used for hyperspectral fluorescence imaging, allowing more fluorescent dyes to be used including dyes with   www.nature.com/scientificreports/ close emission spectra. Obviating the need for He-Ne correction, we can eliminate several optical elements, and thereby reduce the cost, size, and complexity of the FTIS system. We observed that the MAE loss worked better for the reconstruction of BPAE cells, which required a regression-type NN prediction. However, the MSE loss worked better for the classification of 10 bead types. The superior performance demonstrated with the MAE loss may be attributed to its lower sensitivity to outliers; however, we would need a more extensive study to confirm this, which is left as our future study. The NN prediction is applied to each pixel and capable of distinguishing multiple fluorophores mixed at different concentrations in the pixel volume. Changing the labeling protocol would not affect the NN accuracy, unless it significantly alters the emission spectrum of each fluorophore. The noise that can affect the accuracy of NN most is the Poisson noise due to low fluorescence signal. Noteworthy, FTIS provides higher signal-to-noise ratio than LCTF-or AOTF-based hyperspectral imaging due to the Fellgett's advantage. The optical aberration may affect the spatial registration of the fluorescence signal; however, it will not affect the NN prediction. It is well established that the He-Ne correction can compensate for the stage error and environmental vibrations in the FTS reconstruction. We also confirmed this by comparing the MSE/MAE values of FTS with and without He-Ne correction (Fig. 8). For all the sampling number cases, the NN produced MSE/MAE values that are much lower than those for FTS without He-Ne correction. This confirms that the NN is more robust to the stage error and environmental vibrations than the FTS without He-Ne correction. A more systematic study on the relationship between the actual noise level and the accuracy of FTS as well as NN is left as our future study.

Conclusion
In this paper, we have demonstrated hyperspectral fluorescence imaging by combining deep learning and FTIS. The image synthesized by the NN with a 10-20 times reduction in sampling accurately matched the ground truth image. Using triple-labeled bovine pulmonary artery endothelial cells and 10 types of fluorescent beads with close emission peaks, we demonstrated the capabilities of our approach. While greatly reducing the required sampling, we also bypass the need for He-Ne correction, eliminating several optical elements which reduces the cost, size, and complexity of the FTIS system. The developed system can be used in a wide range of applications where several fluorescent dyes with close emission spectra must be used, with much higher throughput. Figure 11. Synthesized image of 10 types of fluorescent beads using classification predictions by the convolutional neural network with N=100 non He-Ne corrected interferograms at each pixel.