Generative adversarial network enables rapid and robust fluorescence lifetime image analysis in live cells

Fluorescence lifetime imaging microscopy (FLIM) is a powerful tool to quantify molecular compositions and study molecular states in complex cellular environment as the lifetime readings are not biased by fluorophore concentration or excitation power. However, the current methods to generate FLIM images are either computationally intensive or unreliable when the number of photons acquired at each pixel is low. Here we introduce a new deep learning-based method termed flimGANE (fluorescence lifetime imaging based on Generative Adversarial Network Estimation) that can rapidly generate accurate and high-quality FLIM images even in the photon-starved conditions. We demonstrated our model is up to 2,800 times faster than the gold standard time-domain maximum likelihood estimation (TD_MLE) and that flimGANE provides a more accurate analysis of low-photon-count histograms in barcode identification, cellular structure visualization, Förster resonance energy transfer characterization, and metabolic state analysis in live cells. With its advantages in speed and reliability, flimGANE is particularly useful in fundamental biological research and clinical applications, where high-speed analysis is critical.


Supplementary Methods
Generative adversarial network structure and training. In this work, we trained a deep neural network using a generative adversarial network (GAN) 1 . GAN consisted of two subnets, generator (G) and a discriminator (D), which were trained via an adversarial process. G mimicked the ground-truth fluorescence decay histogram while D returns an adversarial loss to the G-output fluorescence decay. The cost functions for the gradient descent update of generator (Gcost) and discriminator (Dcost) were designed as follows, respectively.

Goutput.
While GAN had shown great success, there exist several problems, including the difficulty to achieve Nash equilibrium 2 vanish gradient due to the use of loss functions in equation 3 (1,2). Wasserstein generative adversarial network (WGAN) 4 was used to improve the G performance. Wasserstein distance instead of Jensen-Shannon divergence was used to train GAN by evaluating the distance of the distributions between the real data and the generated data. The improved training process could offer strong enough gradients to train the generator than that of the original GAN. The cost functions of WGAN for the gradient descent update of G and D were as follows, Where ( ) is a 1-Lipshitz function which is defined in equation (5). To approximate the function for Wasserstein distance calculation, we trained D parameterized with weights lying in a compact space determined by clipping parameters. Higher value of D output refers to the ground-truth data; while lower value refers to the lowphoton-count histogram. This modification of loss function helps to stabilize the training schedule and ensure that training process can lead the deep learning model to converging.
Generative model (G). Residual neural network 5 and convolution neural network are artificial neural networks (ANNs) architecture that were first proposed for image recognition, easing the training of networks that are substantially deeper than those used previously. A similar network architecture has also been successfully applied in recent image reconstruction and image synthesis applications 6 . The structure of the generative network used in this work is demonstrated in Fig. 1 The dimension of the output of each ReLU activation function is reduced by AveragePooling layer. Then a multitask neural network with hard parameter sharing converts the high dimensional flattened output into three tasks, and each task corresponds to each lifetime parameter (for example, bi-exponential decay model). The last network, decoding layer, termed multilayer perceptron with the activation functions as tanh() to force the range of the output to lie within -1 and 1, maps the 3 tasks into 256 channels of output that corresponds to the fluorescence decay histogram. Instead of learning a direct mapping toward ground-truth fluorescence decay histogram, we reframe the process with the residual learning framework by introducing a residual connection between one of the inputs, normalized low-photon-count decay histogram, and the model's output.

Discriminative model (D).
The structure of the discriminative model begins with a densely connected neural network with 128 nodes. Then the output is fed into other densely-connected neural networks with 64, 8, and 1 node. All layers instead of the last one has a sigmoid activation function, which forces the output to lie within the range from 0 to 1, defined as follows, The last layer has the linear activation function to output the score corresponding to the input histogram fed into D.

Estimator model (E).
On top of the components in the GAN framework, here we also constructed an estimator that focused on predicting the lifetime parameters of interest from the high-photon-count fluorescence decay histogram.
In flimGANE analysis, the output of the generator, normalized ground-truth mimicking fluorescence decay histogram, would be fed into the estimator model, E, to infer the lifetime values. This holds a great potential to improving the predictive power of the lifetime for more than two species. The structure of the estimator model begins with two densely connected neural networks with 64, and 128 nodes for incoming IRF and high-photoncount fluorescence decay histogram, respectively. These two outputs are concatenated by a concatenation layer. The output of the concatenation layer is fed into the multi-task neural network (n = 3) with hard parameter sharing, and a multilayer perceptron with a single hidden layer, whose output is the corresponding fluorescence lifetime parameters. The loss function for E to be trained is defined as follows, Where ,̂ represent the predicted and the ground-truth lifetime parameters, respectively. All the estimative models for specific applications were pre-trained for 5000 iterations, which took ~6 minutes to train in estimative model training stage. The generative models integrated with estimative models were then trained for 500 iterations, which took ~42 minutes to train in flimGANE combination training stage. Training without the discriminative loss and predictive cost can result in over-smoothed images, as the generative model optimizes only a specific group of statistical metrics. Therefore, it is imperative to incorporate discriminator to train the generative model well.  The photon counts were acquired with fastFLIM unit to build up the phase histogram. Here the laser repetition period we used is 50 ns, and it is divided into 256 bins.

Correction of the temporal shift between instrument response function (IRF) and acquired
where ℎ( ) is the fluorescence decay histogram acquired from the experiment at the corresponding bin .
We hypothesized that for certain lifetime, the difference of CoM between IRF and the fluorescence decay histogram should be the same. To verify the above hypothesis, we generated 250 simulated decay histograms under 150, 500, 1500, 5000 photon counts for various fluorescence lifetime values and calculated the mean and the standard deviation of the difference of CoM. Supplementary Fig. 8 shows the calibration line for the distance of CoM between IRF and fluorescence decay histogram. Such a calibration line is employed to determine the amount of the temporal shift.
Implementation of center of mass evaluation (CoME) and trained flimGANE. To eliminate the "temporal shift" effect that may cause inaccurate lifetime estimation, this parameter was obtained by "center of mass" analysis with the control experiment and then employed to calibrate the measurement for each pixel. With IRF and the calibrated fluorescence decay histogram as model inputs, flimGANE outputs two lifetime components and its relative ratio for each pixel, forming the FLIM image accurately and rapidly.

Fluorescence lifetime fitting with time-domain least-squares estimation (TD_LSE).
Assume the fluorescence decay histogram contains either only one or two species: The expected measurement is the convolution of the fluorescence decay histogram and the IRF, where τ1 represents short lifetime, τ2 represents long lifetime, α1 is fraction amplitude of short lifetime species, and IRF is obtained from the experiments. The best fit parameters are determined by minimizing the sum of squared where y is the data to be fitted.

Fluorescence lifetime fitting with time-domain maximum likelihood estimation (TD_MLE). The observed
fluorescence decay histogram is given by equation (14), where represents the instrument response function, τ1, τ2 are the shorter and longer fluorescence lifetime, and α1 is the fraction amplitude of shorter lifetime. The goal of MLE is to find the parameters that maximize the likelihood function, which is equivalent to minimizing the negative likelihood function. For the case of fluorescence lifetime analysis, we aim to minimize the negative Poisson log-likelihood function (-L), which is defined as follows, where, ni represents the number of the detected photons in i th bin and mi represents the expected measurement in i th bin.

Fluorescence lifetime fitting with digital frequency-domain least-squares estimation (DFD_LSE).
Due to the finite response of a system, the measured decay signal f'(t) is the convolution of the fluorescence decay histogram and the IRF plus the noise, n(t), Thus, accurate FLIM data analysis typically requires the calibration of the IRF. In time-domain FLIM, the IRF can be measured by recording the scattered excitation light when using one-photon excitation, or the second-harmonic generation (SHG) signals for two-photon excitation by using a sample that yields strong SHG, such as urea crystal.
In DFD-FLIM, the IRF is typically calibrated with a known fluorescence lifetime standard.
In TCSPC FLIM, a decay histogram f'(t) is recorded at each pixel location. To present the data in frequency-domain, we take the Fourier Transform of the fluorescence decay histogram to get real and imaginary components.
Notice that in frequency-domain, the convolution with the IRF is a simple multiplication of the IRF vector, and the noise is a simple addition of the noise vector. During the calibration procedure, we simply subtract the noise vector and divide the denoised vector by the IRF vector to get the emission fluorescent vector. If the emission fluorescent vector is from the standard calibration single component dye, the phasor will be on the semi-circle after the calibration. The later measurement of the unknown samples will be always noisy and IRF-free for finding the unknown fluorescent lifetime(s). The Fourier Transform of the fluorescence decay histogram can be separated into real number, g(ω), and imaginary number, s(ω).
Therefore, after calibration, each fluorescent decay histogram (without IRF and noise) of each pixel can be plotted as a single point, termed phasor, in the phasor plot by applying the sine and cosine transforms to the measured decay data, where the modulation frequency ω is the laser repetition angular frequency and is calculated by multiplying the laser repetition rate with 2π.
By solving the integral, we can then derive the following relationships between the phasor and the lifetime, where N is the number of the fluorescent species, and fi is the fractional contribution of the i th species with the fluorescence lifetime .

Reduced chi-square (χ 2 ) for time-domain and digital frequency-domain data.
In time-domain fitting process, the goodness-of-fit is evaluated by the reduced χ 2 , which is defined as where n is the number of acquisition bin, p is the number of the fitting parameters, N(tk) is the number of photons collected at bin k, and Nc(tk) is the number of photons at bin k calculated from the fitting parameters and the decay model.
The reduced χ 2 for digital frequency-domain fitting is defined as where n is the number of modulation frequencies, p is the number of the fitting parameters, φk and mk are the measured phase shift and modulation ratio at frequency k, φck and mck are the calculated phase shift and modulation ratio at frequency k using the fitting parameters and the selected model, σφk and σmk are the uncertainties in the phase shift and modulation ratio values at frequency k, respectively.
Pre-processing of FLIM data. Calibrating the phasor plot is the only procedure for digital frequency-domain fitting.
The DFD-FLIM data measurements at each pixel location are composed of both the phase delay (φ) and the amplitude modulation ratio (m). The DFD-FLIM data at each pixel can be mapped to a single point called "phasor" in the phasor plot through a transform defined below, where ω is the modulation frequency, g(ω) and s(ω) represent the values at the two coordinates (g(ω), s(ω)) of the phasor plot.
In order to establish the correct scale for the phasor analysis, the coordinates of the phasor plot need to be calibrated using a standard sample of known lifetime. This will include the calibration of the IRF and the background noise.
In DFD-FLIM, this is done during experimental calibration prior to the data acquisition. There is no need to measure the IRF explicitly. The calibration procedure subtracts the noise and divides denoised data by the IRF to reveal the true fluorescent emission component(s). In time-domain FLIM, a directly recorded IRF data for the zero lifetime (scatter) will be measured. The IRF will then be used for lifetime fitting, with a convolution involved, and model analysis.
Post-processing of FLIM (intensity) image. Either a median or Gaussian filter can be applied before analyzing the data in phasor plot. The degree can be chosen from 0 to 10. When degree is 0, no filter is used. When the degree is 1, it runs through all the pixels once with a 3x3 filter. When degree is 2, it runs through all the pixels twice with a 3x3 filter.
For lifetime fitting, we can bin the data of each pixel to reduce the uncertainty. If the bin number is 0, it is the single pixel histogram or phase and modulation. If the bin number is 1, it borrows left, right, up and down 1 pixel, so total is 3x3 pixels as the selected pixel is at the center. In this work, the bin number is set to be 0.
Calibration of a FLIM system. Before imaging biological samples, it is necessary to calibrate the FLIM system using a fluorescence lifetime standard. The fluorescence lifetime for many fluorophores has been established under standard conditions, and any of these probes can be used for the calibration of the FLIM system. Since the fluorescence lifetime of a fluorophore is sensitive to its environment, it is critical to prepare the standards according to the conditions specified in the literature, including the solvent and the pH. It is also important to choose a standard fluorophore with excitation, emission, and fluorescence lifetime properties that are similar to those of the fluorophore used in the biological samples.

Quality index (QI) quantification for simulated decay histogram.
To calculate the QI for the fluorescence decay histogram, we assume the amounts of signals is the total photon counts, and the amount of noise is the deviation of fluorescence decay histogram from the ground-truth decay histogram. The relationship can be described as the following equations: where S represents the total photons of the fluorescence decay histogram, yi is the fluorescence decay histogram on i th bin, and ri represents the ground-truth fluorescence decay histogram on i th bin.
Structural similarity index (SSIM) 8 . SSIM is a method used for measuring the similarity between two images.
First, two images were smoothed with Gaussian filter (σ = 1.5 pixels). Then SSIM index was obtained for each window of a pair of sub-images. It was calculated for squared windows, centered at the same pixel (dx, dy) of two images. The length of a side of the square window was eleven pixels. The SSIM (x, y) was then obtained for two windows in two images (x, and y) as follows, SSIM( , ) = where, μx, μy are the average of pixel intensity of window x and y, σx, σy, σxy are standard deviation of window x and y, and covariance of two windows, c1, and c2 are (0.01 x L) 2 and (0.03 x L) 2 , respectively, and L equals to the data range of the lifetime image. Then SSIM(x,y) will be averaged over all the area. The average of SSIM, SSIM ̅̅̅̅̅̅̅ , will be outputted, serving as "SSIM" for each pair of images in this work. 9 . PSNR is a method used for measuring the quality of the image comparing to the reference image. Higher PSNR (higher score) represents better image quality.

Mean-squared errors (MSE) and peak signal-to-noise ratio (PSNR)
Given the reference n × m FLIM image I and the reconstructed FLIM image K, where, MAXI is the maximum possible lifetime value of the reference image I. 10  where, ( ⃗ , ; ⃗⃗ , | , ) and ( ⃗ , ; ⃗ , | , ) represent the information that can ideally be extracted from a particular channel, , in the reference and test images respectively. Apparent and average lifetime for mixture of fluorophores. We referred to technique note from ISS (http://www.iss.com/resources/pdf/technotes/FLIM_Using_Phasor_Plots.pdf) to define the apparent lifetime and average lifetime 11,12 . Given the multi-exponential decay model defined as follows,

Visual information fidelity (VIF)
where, τi represents the lifetime, αi is the amplitude of the components at t = 0, also called pre-exponential factor, and n is the number of the components. Fractional contribution, fi, of each decay histogram to the steady-state intensity is determined by  Table  2 and Supplementary Fig. 21a). 500 simulated decays were generated for each , with  ranging from 0.01-0.30 ns (i.e.,  = 2.00, 2.01, … 2.30 ns). The mean lifetime was extracted from 500 lifetime estimates (by Gaussian fitting of the histogram) and each  was repeatedly tested for 20 times. Using the p-value < 0.05 in two-sided KS test as a criterion, if the p-value was less than 0.05 for 70% of the repeated tests, we determined that there was a significant difference between two lifetime distributions. For instance, 2.00 ns decay and 2.03 ns decays were indistinguishable as their p-value was 0.33 (Supplementary Fig. 21b). In contrast, 2.00 ns decay and 2.15 ns decay were distinguishable with a p-value of 1.95×10 -16 (Supplementary Fig. 21c). We noticed in all 5 photon-count conditions, flimGANE and TD_MLE had the same discriminability (Supplementary Fig. 21d). As expected, higher photon counts per pixel improved the discriminability.
flimGANE outperforms TD_MLE at ultra-low-photon-count conditions. To ensure a stable training process, in this work, Wasserstein GAN algorithm (WGAN 4 ) was implemented, which changes the value function to be based on Wasserstein distance W(pr, pg) (equation (39)): where ∏( , ) represents the set of all joint distributions ( , ) whose marginals are and , respectively.
In principle, W represents the lowest work to convert one distribution into another. WGAN can resolve the issue of vanishing gradient, thus eliminating the factor that could destabilize training 4 . Indeed, in our simulations, we have successfully trained the model to learn to reconstruct the ground-truth "mimicking" fluorescence decay histograms ( Supplementary Fig. 16), which could not be achieved by the classical GAN.
Although both GAN and MLE can be regarded as an optimization problem, they are fundamentally different.
MLE requires a likelihood function while GAN does not. Maximum likelihood approach is equivalent to minimizing the KL divergence (equations (40-43)). Assuming represents the unknown parameters to be estimated (e.g., 1 , 1 , 2 in fluorescence lifetime analysis) and ̂ is a set of values that maximizes the likelihood, determined by the estimated decay histogram pθ(y) and the observed data, and match pθ(y) with the ground-truth decay histogram pθ*(y): Instead of maximizing a positive function, we can minimize the corresponding negative function, where ̂ can be formulated as follows: The KL divergence between pθ*(y) and pθ(y) is: The first term in equation (43) Considering Σ1=Σ2=wIk and μ1≠μ2, the trace term in Wasserstein is 0 and the trace term in the KL divergence will be 0 when combined with the −k term and the log-determinant ratio, so these two quantities become: The Wasserstein distance does not change when the variance changes since the Wasserstein distance is a distance function in the joint support spaces of the two probability measures. In contrast, the KL divergence does change with the variance because the KL divergence is a divergence and this divergence changes based on the spread of the distributions. While these two measures (W and KL) have advantages and disadvantages in different applications 17,18,19 , when handling the low-photon-count data, our simulations showed more reliable results from Wasserstein distance calculation-based training process, which led to the better performance of flimGANE in analyzing low-photon-count data.
Here we prepared simulated decay histograms (ranged from 1 to 5 ns) with different photon counts (50, 100, 500 and 1500 photons per pixel) and their corresponding "artificial" ground-truth decay histograms (Goutput) were generated from Generator. The Goutput lifetimes were determined by fitting with TD_MLE (termed TD_MLEG_output).
The accuracy of lifetime estimation with TD_MLE, TD_MLEG_output and flimGANE was evaluated in Supplementary Fig. 17. TD_MLEG_output produces results better than those of TD_MLE under ultra-low-photoncount condition, validating that the Generator in flimGANE can reliably reconstruct the "artificial" ground-truth decay histograms. As expected, all the lifetime determination methods generate more accurate results with more photon counts in the decay histogram.
Supplementary Fig. 5 The performance of different analysis methods was evaluated by MSE and SSIM. The reconstructed "UT BME" FLIM images obtained by different analysis methods were evaluated by mean squared error (MSE, threshold = 0.21) and structural similarity index (SSIM, threshold = 0.98). a-b The performance of different analysis methods was evaluated under eight conditions, 50, 60, 80, 100, 150, 500, 1000, 1500 photon counts. Under ultra-low-photon-count condition (50-100 counts), the MSE and SSIM between the reconstructed FLIM images from flimGANE and the ground-truth FLIM images were less than 0.21 and larger than 0.98. TD_MLE provided MSE less than 0.21 and SSIM larger than 0.98 under low-photon-count condition. TD_LSE and DFD_LSE could provide MSE less than 0.2 as the photon counts were greater than 500. Supplementary Fig. 6 The performance of different analysis methods was evaluated by quality index (QI). flimGANE outperforms other analysis methods in fluorescence lifetime estimation from low-photon-count (noisy) decay histogram. The fluorescence decay histograms were simulated with 2 ns as pre-defined lifetime and various photon-count conditions (100 ~ 10k counts), generating the simulated decay histogram with quality index (QI) ranging from 0.5 to 13.0. The tolerance of the lifetime estimate is 0.25 ns (red box). We can observe that flimGANE enables accurate lifetime estimation under poor QI condition as 1, which is much smaller than TD_MLE (1. The analysis results in Supplementary Fig. 7 were plotted with distinct photon-count conditions (error bars, standard deviation, n = 250). It was demonstrated that the calibration is independent of the photon counts. The mean of the difference of CoM between IRF and decay histograms was the same at varying conditions. Moreover, it is expected to observe a higher variance at low-photon-count condition.