Coupling a recurrent neural network to SPAD TCSPC systems for real-time fluorescence lifetime imaging

Fluorescence lifetime imaging (FLI) has been receiving increased attention in recent years as a powerful diagnostic technique in biological and medical research. However, existing FLI systems often suffer from a tradeoff between processing speed, accuracy, and robustness. Inspired by the concept of Edge Artificial Intelligence (Edge AI), we propose a robust approach that enables fast FLI with no degradation of accuracy. This approach couples a recurrent neural network (RNN), which is trained to estimate the fluorescence lifetime directly from raw timestamps without building histograms, to SPAD TCSPC systems, thereby drastically reducing transfer data volumes and hardware resource utilization, and enabling real-time FLI acquisition. We train two variants of the RNN on a synthetic dataset and compare the results to those obtained using center-of-mass method (CMM) and least squares fitting (LS fitting). Results demonstrate that two RNN variants, gated recurrent unit (GRU) and long short-term memory (LSTM), are comparable to CMM and LS fitting in terms of accuracy, while outperforming them in the presence of background noise by a large margin. To explore the ultimate limits of the approach, we derive the Cramer-Rao lower bound of the measurement, showing that RNN yields lifetime estimations with near-optimal precision. To demonstrate real-time operation, we build a FLI microscope based on an existing SPAD TCSPC system comprising a 32\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}×32 SPAD sensor named Piccolo. Four quantized GRU cores, capable of processing up to 4 million photons per second, are deployed on the Xilinx Kintex-7 FPGA that controls the Piccolo. Powered by the GRU, the FLI setup can retrieve real-time fluorescence lifetime images at up to 10 frames per second. The proposed FLI system is promising and ideally suited for biomedical applications, including biological imaging, biomedical diagnostics, and fluorescence-assisted surgery, etc.


Introduction
Fluorescence lifetime imaging (FLI) is an imaging technique for the characterization of molecules based on time they decay from an excited state to the ground state [1].Compared with fluorescence intensity imaging, FLI is insensitive to excitation intensity fluctuations, variable probe concentration, and limited photobleaching.Besides, through the appropriate use of targeted fluorophores, FLI is able to quantitatively measure the parameters of the microenvironment around fluorescent molecules, such as pH, viscosity, and ion concentrations [2,3].With these advantages, FLI has wide applications in the biological sciences, for example to monitor protein-protein interactions [4], and plays an increasing role in medical and clinical settings such as visualization of tumor margins [5], cancerous tissue detection [1,6], and computer-assisted robotic surgery [7,8].
Time-correlated single-photon counting (TCSPC) is popular among FLI systems due to its superiority over other techniques in terms of time resolution, dynamic range, and robustness.In TCSPC, one records the arrival time of individual photons when emitted by molecules upon photoexcitation [9,10,11].After repeated measurements, one can construct a histogram of photon arrivals, which closely matches the true response of molecules, thus enabling the extraction of FLI, as shown Figure 1.The instrumentation of a typical TCPSC FLI system features a confocal setup, including a single-photon detector, a dedicated TCSPC module for time tagging, and a PC for lifetime estimation [12,9].Such systems are mostly unsuitable for rising clinical applications such as non-invasive monitoring, where a miniaturized and fast TCSPC system is desired [13].Besides, the large amount of data generated by TCSPC lays a great burden on data transfer, data storage, and data processing.A powerful PC, sometimes equipped with dedicated GPUs, is required to acquire and process TCSPC data.TCSPC requires photodetectors with picosecond time resolution and single-photon detection capability.In the last decade, single-photon avalanche diodes (SPADs) have been used successfully in TCSPC systems and, with the advent of CMOS SPADs, the expansion of these detectors into high-resolution image sensors for widefield imaging has been accomplished successfully [14].Several reviews of the use of SPADs in biophotonics have recently appeared [15,16,17].
Least-square (LS) fitting and maximum likelihood estimation (MLE) are widely used for fluorescence lifetime estimation [18,19,20].These two methods rely on iterations to achieve high accuracy, but they are time-consuming since computationally expensive.Various non-fitting methods have been proposed to tackle these problems but often compromise on other specifications, among which the Center-of-Mass method (CMM) is a typical one.CMM is a simple, fast, and photon-efficient alternative, which has been already applied in some real-time FLI systems [21,22,23].However, it is very sensitive to background noise, and the estimation is biased due to the use of truncated histograms [24].
Neural networks provide a new path to fast fluorescence lifetime extraction [25].The first neural network-based model for fluorescence lifetime estimation was proposed in 2016, where higher accuracy and faster processing than LS fitting were reported [26].Since then, several neural network architectures, including fully connected neural network (FCNN), convolutional neural network (CNN), and generative adversarial network (GAN) solutions have been explored for this end [27,28,29,30,31].These techniques showed the ability to resolve multi-exponential decays and achieve accurate and fast estimation even in low photon-count scenarios.Apart from fluorescence lifetime determination, these neural networks can extract high-level features and can be integrated into a large-scale neural network for end-to-end lifetime image analysis such as cancerous tissue margin detection [32] and microglia detection [33].
In this work, we propose to adopt the paradigm of edge artificial intelligence (Edge AI), constructing a recurrent neural network (RNN) -coupled SPAD TCSPC system for real-time FLI.We train and test variants of RNNs for lifetime estimation and deploy them on FPGA to realize event-driven and near-sensor processing.The working principle is illustrated in Figure 1.Upon the arrival of photons, the timestamp is processed by the RNN directly without histogramming.From photon detection to lifetime estimation, the whole system is integrated into a miniaturized device, which achieves reduced data transfer rates.With the flexibility to retrain neural networks, the same system can be easily reused for other very different applications, such as classification.

Results
The proposed system comprises a SPAD image sensor with timestamping capability coupled to an FPGA for implementation of neural networks in situ.In this section we describe the utilized RNN, its training, and the achieved results.We also describe the upperbounds on accuracy that were derived to contextualize the results obtained with the RNNs.

RNNs trained on Synthetic Datasets and Performance
We train and evaluate RNNs on synthetic datasets.Three RNN variants, namely simple RNN, gated recurrent unit (GRU) [34], and long short-term memory (LSTM) [35], are adopted.These RNNs are constructed with 8, 16, and 32 hidden units, respectively.LS fitting and CMM are also benchmarked.The metrics used for evaluation are the root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE).A common range of lifetime is covered here, which is from 0.2 to 5 ns, and we assume a laser repetition frequency of 20 MHz.
We start with a simple situation in which background noise is absent.All the tests are run on a PC with 32-bit floating point (32FP) precision.The results are presented in Table 1.One can observe that CMM achieves the lowest error in MAE and MAPE, GRU-32 achieves the lowest RMSE error, and GRU-32 and LSTM-32 have very similar performance to CMM.The performance of CMM itself is understandable.In this case, background noise is not considered, and the repetition period is 10 times the longest lifetime.Under these conditions, CMM is very close to the maximum likelihood estimator of the lifetime.When comparing Simple RNN, GRU, and LSTM, one can observe that GRU In a traditional TCSPC FLI system, the sample is excited by a laser repeatedly, and the emission photons are detected and time-tagged.A histogram is gradually built on these timestamps, from which the lifetime can be extracted after the acquisition is completed.In our proposed system, upon the receiving of a photon, the timestamp is fed into the RNN immediately.The RNN updates the hidden state accordingly and idles for the next photon.The schematic and formula of simple RNN are shown here.At timestep n, the RNN takes the current information x n and the past information h n−1 as input, then updates the memory to the current information h n and gives out a prediction y n .1: RNN models are trained and tested on a synthetic dataset, where the fluorescence decay model is monoexponential, lifetime ranges from 0.2 and 5 ns, laser repetition frequency is 20 MHz, and background noise is not considered.Their performance is benchmarked against Least-square (LS) fitting and Center-of-Mass method (CMM).RMSE: root mean squared error, MAE: mean absolute error, MAPE: mean absolute percentage error.outperforms LSTM by a small margin, and both of them perform much better than Simple RNN.As we can see with the decrease in model size, errors increase accordingly.

Model
Background noise is often inevitable during fluorescence lifetime imaging, especially in diagnostic and clinical setups where the interruption to existing workflows is supposed to be minimized [13].In our FLI system, it is estimated that at least 1% of the collected timestamps are from background noise.Therefore, we study the performance of each method under varying background noise levels.For simplicity, only LSTM-32 is used to compare with benchmarks.LSTM-32 is trained on a synthetic dataset, where 0 to 10% uniform background noise is added to the samples randomly.Here we also illustrate the result of CMM with background noise subtraction, assuming that the number of photons from background noise is known, though it is often not the case in real-time FLI systems.Two synthetic datasets are built for evaluation, where the background noise ratios are 1% (SNR=20dB) and 5% (SNR=12.8dB),respectively.The results are presented in Table 2.We can see that LSTM-32 outperforms other methods in all metrics and scenarios.Combined with Table 1, one can observe that errors increase when the background noise increases for all the methods.However, LSTM and LS fitting are more robust to background noise, while CMM is extremely sensitive to it.This finding is in agreement with previous studies [36,1].

Cramer-Rao Lower Bound Analysis
To compare the performance with the theoretical optima, the Cramer-Rao lower bound (CRLB) is calculated of the accuracy of the lifetime estimate with an open source software [36], given the setting parameters.The variance of the lifetime estimation methods is calculated from Monte Carlo experiments.As for CMM and RNN, 3000 samples are used; as for the least-squares method, 1000 samples are used to reduce running time.
The relationship between lifetime and the relative standard deviation of the different estimators is shown in Figure 2a, where the photon count is 1024.One can observe that the variance of CMM and LSTM-32 almost reaches the CRLB, which suggests that CMM and LSTM-32 are near-optimal estimators.Considering that the laser repetition period is much longer than the lifetime and that background noise is not included, it is understandable that CMM reaches the CRLB, since it is approximately a maximum likelihood estimator.LS fitting performs worse than CMM and LSTM-32, which is likely due to the underlying assumption of Gaussian errors.
The relationship between the number of photons and the relative standard deviation of the different estimators is shown in Figure 2b, where the lifetime is set at 2.5 ns.Similar to Figure 2a, the relative standard deviations of CMM and LSTM-32 almost reach the CRLB, while the least square fitting performs worse.This result suggests that CMM and LSTM-32 are efficient estimators over different photon inputs, achieving excellent photon efficiency.They only need less than half of the data to obtain similar results as LS fitting.
We also analyze the CRLB with background noise.The results are shown in Figure 2. Comparing Figure 2c and Figure 2e with Figure 2a, we can see the CRLB is lifted a bit in the presence of background noise.The relative standard deviation of LS fitting stays almost unchanged, and that of LSTM-32 increases slightly but is still much better than LS fitting.As for CMM, one can see that the relative standard deviation increases dramatically at shorter lifetimes, which suggests that CMM is very sensitive to background noise for short lifetimes.By comparing Figure 2d and Figure 2f with Figure 2b, we find that the relative standard deviation does not vary with 1% background noise.With 5% background noise, however, CMM shows a clear degradation of performance, its relative standard deviation getting close to the one of LS fitting.

Performance on Experimental Dataset
To verify the performance of RNNs, which are purely trained on synthetic datasets, on real-world data, the RNNs are tested on experimental data along with CMM and LS fitting as benchmarks.We prepare a fluorescence lifetimeencoded microbeads sample and acquire the TCSPC data with a commercial confocal FLIM setup.It is estimated that the background noise is below 1%.The LSTM-32 trained with 0% to 10% background noise dataset is used.The corresponding results are shown in Figure 3.
The histograms of the three samples share a similar shape.As for LS fitting, an instrument response function (IRF) is estimated from histograms of all pixels and then shared among them, which accounts for its good performance   here.The result of CMM has a 2 ns bias, which is corrected by the estimated IRF.It is worth noting that for the first peak in the histogram, LSTM shows a sharper Gaussian shape, which confirms LSTM's good performance under low fluorescence intensity and short lifetime.

Real-time FLIM Setup with on-FPGA RNN
We further built a real-time FLIM system by utilizing a SPAD array sensor with on-chip time-to-digital converters (TDCs) and deploying the aforementioned RNNs on FPGA for near-sensor processing.The schematic of our setup is shown in Figure 4.The 32×32 Piccolo SPAD sensor developed at EPFL [37,38] is utilized, on which 128 TDCs offer 50 ps temporal resolution.The sensor is controlled by a Kintex-7 FPGA, where four GRU cores are implemented for lifetime estimation.The four GRU cores are able to process up to 4 million photons per second.While the data transfer rate to the PC is 20Mb/s for histogram mode and 80Mb/s for raw mode, it reduces to only 240kb/s when applying the proposed RNN-based lifetime estimation method.
We prepare a sample containing fluorescent beads with a lifetime of 5.5 ns.The sample is measured by our system in real-time at 5 frames per second.During the imaging, we move the sample plate forward to observe the movement of beads in the images.The result is shown in Figure 5.The lifetime images are also displayed in rainbow scale.The average photon count for the beads is around 500 per pixel.This illustrates that our system can capture the movement of beads and provide an accurate lifetime estimation.One can also observe that there are some outliers, e.g.dark blue dots and red dots among the green beads.Apart from statistical fluctuations, RNN-based lifetimes tend to be lower when there are not enough photons, which explains why the blue dots are mostly darker than the surrounding pixels.

Discussion
The proposed on-FPGA RNN removes the need for histogramming altogether by taking raw timestamps as input directly, which released hardware resources on FPGA or PC and significantly reduced the burden on data transfer and data processing.The analysis of synthetic data and CRLB shows that RNN, as a data-driven method, reaches excellent accuracy and robustness compared to its competitors, while retaining higher photon efficiency.
The performance of the system can be further improved by using a larger SPAD sensor and by accommodating more RNN cores on the FPGA.A more powerful FPGA or even dedicated neural network accelerators can be used to accommodate more RNN cores.More efficient quantization and approximate schemes can also be explored to reduce resource utilization and latency.In addition, these GRU cores can be further optimized by VHDL implementation.In  the future, the RNN cores could be implemented on ASIC and stacked together with SPAD arrays by means of 3-D stacking technology, realizing in-sensor processing [39].
Though the proposed system is only used for FLI to this date, it can be easily adapted for other applications by retraining the RNN.It can be further combined with other large-scale neural networks for high-level applications, where the output of the RNN is composed of high-level features learned by neural networks automatically, and it serves as input for other neural networks.The existing FLI-based high-level applications such as margin assessment [5,32] could also be directly incorporated into our system.

Dataset Synthetic Dataset
A simulation that well captures the features of the real scene is the key to constructing synthetic datasets.To accurately model a real FLI system, we take fluorescence decay, instrument response, background noise, and dark counts into account.The latter two are often neglected in previous studies.However, in several scenes such as fluorescence-assisted surgery exist strong background noise which can not be simply ignored.Different from existing NN-based methods, which take histograms as input, we generate synthetic datasets on the timestamp level.Assuming that at most one photon reaches the detector in every repetition period (i.e. the pile-up effect is not considered), the timestamps t, namely the arrival time of photons, are modeled as: where 1 is the indicator function, k is the component indicator, t f luo is the fluorescence time delay, t irf is the instrument response time delay, and t bg is the arrival time of background noise or dark counts.
The component indicator k is a random variable with categorical distribution, representing the source of the incoming photon, which can be either a component of the fluorescence decay or background noise.The probability density function (PDF) of k is where p i represents the normalized intensity of fluorescence or background noise.
The fluorescence time delay t f luoi is subject to an exponential distribution.Its PDF is: where τ i is the lifetime of the fluorescence decay.
The instrument response time delay t irf is subject to a Gaussian distribution.Its PDF is: where t 0 is the peak position, and σ can be represented by full width at half maximum(FWHM): The time of arrival of background noise t bg is subject to a uniform distribution.Its PDF is: where T is the repetition period.
Given a set of the above parameters, synthetic datasets can be generated with different lifetime ranges, background noise ratios, and components of fluorescence.In this work, the FWHM is assumed to be 167.3ps, in accordance with previous studies [29,30]; t 0 for each sample is generated from a uniform distribution from 0 to 5 ns.To train models in the presence of background noise, p N for each sample is generated from a uniform distribution from 0 to 10%.Each dataset contains 500,000 samples, and each sample contains 1024 timestamps.

Cramer-Rao Lower Bound
Cramer-Rao lower bound (CRLB) gives the best precision that can be achieved in the estimation of fluorescence lifetime [43,44,36].Mathematically, CRLB expresses a lower bound of variance of estimators and it is proportional to the inverse of the Fisher information: where f (x; θ)) is the PDF and J is the Fisher Information, which is defined as: The CRLB is calculated with open-source software [36].

FPGA Implementation
Quantization is an effective way to reduce resource utilization and latency on hardware.In common deep learning frameworks, such as PyTorch or Tensorflow, model weights and activations are represented by 32-bit floating point numbers.However, it would be inefficient to perform operations for floating point numbers with such bitwidth.We aim to quantize the 32-bit floating point numbers with fixed-point numbers and to reduce the bitwidth as much as possible, while maintaining the same model behavior.
Both PyTorch and TensorFlow provide tools of quantization for edge devices, namely PyTorch Quantization and TensorFlow Lite.However, the quantized models rely on their own libraries to run, and the quantized weights cannot be exported.Therefore, we use Python and an open-source fixed point number library to realize a quantized GRU for evaluation.We compare 8-bit, 16-bit, and 32-bit fixed-point numbers to quantize weights and activations separately.
The results show that the weights can be quantized to 16-bit fixed point numbers without a significant accuracy drop, and to 8-bit fixed point numbers with an acceptable accuracy drop.Activations can be quantized to 16-bit fixed point numbers without a significant accuracy drop, but 8-bit fixed point quantization will lead the model to collapse.Besides the fixed point precision, we find that the rounding method has a great impact on the performance.Truncating, often the default rounding method, brings larger error.Fixed point numbers with convergent rounding have almost the same behavior as floating point numbers.
The quantized GRU model is then implemented on FPGA.For convenience, the GRU is written in C++ and compiled to Vivado IP with Vitis HLS.The whole model is divided into two parts: a GRU core and an FCNN.The GRU core is designed to be shared among a group of pixels, and the FCNN will be run sequentially for each pixel after integration.Upon receiving a timestamp, GRU core loads hidden states from block RAMs (BRAMs), updates the hidden states, and sends them back to BRAM.After the integration of each repetition period, the FCNN loads the hidden state from BRAM, and streams the estimated lifetime to a FIFO.

Experimental Setup
A real-time FLI microscopy (FLIM) system with our SPAD sensor and on-FPGA RNN is built, which is shown in Figure .4. The microscope is adapted from the sa confocal microscope system (Single-Channel Cerna ® Confocal Microscope System), though it is only used for widefield imaging in this work.The same fluorescent bead samples are measured, hence a 480 nm pulsed laser (PicoQuant) is utilized.A set of filters is adopted for fluorescence imaging.The excitation filter (Thorlabs FITC Excitation Filter) has a central wavelength of 475 nm with a bandwidth of 35 nm.The emission filter is a long-pass filter (Thorlabs Ø25.0 mm Premium Longpass Filter) with a cut-on wavelength of 600 nm.The dichroic filter (Thorlabs GFP Dichroic Filter) has a reflection band from 452 to 490 nm and a transmission band from 505 nm to 800 nm.
The Piccolo system is used for single-photon detection and time tagging [38].The complete system, along with its components and a micrograph of the Piccolo chip is shown in Figure 4. Piccolo provides 50-ps temporal resolution and 47.8% peak photon detection probability (PDP).Versions with microlenses are available as well, to improve the light The incoming timestamps are serialized and corrected and then sent to GRU cores.The hidden states of GRU are stored in the BRAM.Upon the arrival of a timestamp, the hidden state of the corresponding pixel is read by the GRU core, then the hidden state is updated and written back to the BRAM.After the integration period, the final hidden states are read by FCNN cores and the lifetime is estimated and sent to PC. collection efficiency.The median dark count rate (DCR) is 113 cps (per pixel at room temperature).A Xilinx FPGA was used to communicate with the PC and control the sensor.To minimize the system and reduce latency, the RNNs were deployed on the same FPGA.
The schematic of the FPGA design is shown in Figure 6.Four computation units are realized, each of which is in charge of a quarter of the sensor (32 by 8 pixels).The timestamps, sent to FPGA in parallel, are serialized and distributed to four computation units based on their SPAD IDs.Each computation unit is composed of one GRU core, one two-layer fully connected neural network (FCNN) core, and one BRAM.The computation speed is mainly limited by the latency of the GRU core, which is 1.05 ns when employing a 160 MHz clock.The photons that arrive when computation units are busy are simply discarded.The four computation units together are capable of processing up to 4 million photons per second.

Figure 1 :
Figure1: In a traditional TCSPC FLI system, the sample is excited by a laser repeatedly, and the emission photons are detected and time-tagged.A histogram is gradually built on these timestamps, from which the lifetime can be extracted after the acquisition is completed.In our proposed system, upon the receiving of a photon, the timestamp is fed into the RNN immediately.The RNN updates the hidden state accordingly and idles for the next photon.The schematic and formula of simple RNN are shown here.At timestep n, the RNN takes the current information x n and the past information h n−1 as input, then updates the memory to the current information h n and gives out a prediction y n .

Figure 3 :
Figure 3: Comparison of LSTM, CMM, and LS Fitting on experimental data.The sample contains a mixture of fluorescent beads with three different lifetimes (1.7, 2.7, and 5.5 ns).The fluorescence lifetime images are displayed using a rainbow scale, where the brightness represents photon counts and the hue represents lifetimes.The lifetime histograms among all pixels are shown below.Most pixels are assumed to contain mono-exponential fluorophores.Two or three lifetimes might be mixed at the edge of the beads.

Figure 4 :
Figure 4: Real-time FLIM system based on the Piccolo 32×32 SPAD sensor and on-FPGA RNNs.The main body of the microscope is from a single-channel Cerna ® Confocal Microscope System (ThorLabs, Newton, New Jersey, United States).On the top is the Piccolo system, composed of the SPAD sensor itself, motherboard, breakout board, and FPGA.The SPAD sensor has 32×32 SPADs and 128 on-chip TDCs, offering 50 ps temporal resolution.The FPGA is programmed to control the SPAD sensor and communicate with PC through USB 3. The RNN is also deployed on the same FPGA.

Figure 5 :
Figure 5: Real-time lifetime image sequence from our FLIM system.The sample contains fluorescent beads with a 5.5 ns reference lifetime.(See the full video in the Supplementary Material)

Figure 6 :
Figure6: Schematic of the proposed real-time FLIM system with on-FPGA GRUs.A pulsed laser illuminates the sample repeatedly and sends a reference signal to Piccolo to reset TDCs at the same time.The emitted fluorescence photon is then detected by Piccolo and time-tagged, whose arrival time is sent to FPGA in parallel for further processing.The incoming timestamps are serialized and corrected and then sent to GRU cores.The hidden states of GRU are stored in the BRAM.Upon the arrival of a timestamp, the hidden state of the corresponding pixel is read by the GRU core, then the hidden state is updated and written back to the BRAM.After the integration period, the final hidden states are read by FCNN cores and the lifetime is estimated and sent to PC.

Table 2 :
Performance of LS fitting, CMM, CMM with background subtraction, and LSTM-32 in the presence of 1% and 5% background noise.*CMM with background noise subtraction.LSTM-32 is trained on a dataset including 0% to 10% random background noise for generalization in different scenarios.
Relative standard deviation of different methods and CRLB as a function of the number of detected photons for a fixed lifetime of 2.5 ns.No background noise.
(e) Relative standard deviation of different estimation methods and CRLB as a function of lifetime for a fixed number of detected photons (1024).5% background noise.(f)Relative standard deviation of different methods and CRLB as a function of the number of detected photons for a fixed lifetime of 2.5 ns. 5% background noise.