Coded aperture snapshot spectral imaging fundus camera

Spectral imaging holds great promise for the non-invasive diagnosis of retinal diseases. However, to acquire a spectral datacube, conventional spectral cameras require extensive scanning, leading to a prolonged acquisition. Therefore, they are inapplicable to retinal imaging because of the rapid eye movement. To address this problem, we built a coded aperture snapshot spectral imaging fundus camera, which captures a large-sized spectral datacube in a single exposure. Moreover, to reconstruct a high-resolution image, we developed a robust deep unfolding algorithm using a state-of-the-art spectral transformer in the denoising network. We demonstrated the performance of the system through various experiments, including imaging standard targets, utilizing an eye phantom, and conducting in vivo imaging of the human retina.

Retinal imaging is crucial for the detection and management of ophthalmic diseases, such as age-related macular degeneration (AMD) 1 and glaucoma 2 . The current standard-of-care retinal imaging technologies encompass color fundus photography, scanning laser ophthalmoscopes (SLO) 3 , and optical coherence tomography (OCT) 4 . Despite being extensively used in clinics, these techniques measure only spatial information of the retina. In contrast, spectral imaging captures light in three dimensions, i.e. acquiring both spatial coordinates (x, y) and wavelengths (λ) of a scene simultaneously. The rich information could be used to classify the underlying components of the object. Originally developed for remote sensing 5 , spectral imaging has gained increasing popularity in medical applications, including retinal imaging 6 . The overall rationale of using a spectral camera in retinal imaging is that the ocular tissue's endogenous optical properties, such as absorption and scattering, change during the progression of a retinal disease, and the spectrum of light emitted from tissue carries quantitative diagnostic information about tissue pathology.
To measure a spectral datacube (x, y, λ), conventional spectral imaging cameras rely on scanning, either in the spatial domain, such as using a slit scanning spectrometer 7 , or in the spectral domain, such as using a liquid-crystal-tunable-filter 8 . The scanning mechanism typically leads to a prolonged acquisition, making these techniques prone to motion artifacts. Furthermore, the data acquired from sequential measurements need to be registered in post-processing, a complicated procedure that is sensitive to motion and image signal-to-noise ratio (SNR). Particularly in retinal imaging, post-acquisition registration can result in artifacts due to tissue movement between successive images caused by arterial pulses as well as changes in the lens-eye geometry 9 . Additionally, it is challenging to keep a patient fixating on a target for an extended period of time.
A snapshot spectral imaging system can avoid all these problems and provide an ideal solution for obtaining retinal spectral data. In this category, representative techniques include computed tomographic imaging spectrometer (CTIS) 9,10 , the four-dimensional imaging spectrometer (4D-IS) 11 , image mapping spectrometer (IMS) 12,13 and coded aperture snapshot spectral imagers (CASSI) 14,15 . Among these methods, only CASSI can measure a large-sized spectral datacube because it uses compressive sensing to acquire data 16,17 , leading to a high resolution along both spatial and spectral dimensions. CASSI uses a coded aperture (mask) and a dispersive element to modulate the input scene, and it captures a two-dimensional (2D), multiplexed projection of a spectral datacube. Provided that the spectral datacube is sparse in a given domain, it can faithfully be reconstructed with a regularization method. Essentially, just one spectrally dispersed projection of the datacube that is spatially modulated by the aperture code over all wavelengths is sufficient to reconstruct the entire spectral datacube 14 . Therefore, the acquisition efficiency of CASSI is significantly higher than non-compressive methods that directly map spectral datacube voxels to camera pixels, such as IMS.
In this work, we developed a spectral imaging device based on CASSI and integrated it with a commercial fundus camera. The resultant system can capture a 1180 × 1100 × 35 (x, y, λ) spectral datacube in a snapshot with a 17.5 μm and 15.6 μm resolution along the horizontal and vertical axes, respectively. The average spectral resolution is 5 nm from 445 nm to 602 nm. Moreover, to enable fast and high-quality image reconstruction, we developed a deep-learning-based algorithm. Once trained, our algorithm can reconstruct a megapixel CASSI System principle and method Optical setup and system model. The schematic and photograph of the system are shown in Fig. 1a and c, respectively, where we couple a CASSI system to a commercial fundus camera (Topcon TRC-50dx) through its top image port. The illumination light is provided by the internal halogen lamp of the fundus camera. After being reflected by an annular mirror, a doughnut-shaped beam is formed and refocused onto the subject's eye pupil, forming uniform illumination at the retina. The reflected light is collected by the same optics and forms an intermediate image at the top image port of the camera. This image is then passed to the CASSI system. In the CASSI system, we use two achromatic lenses (AC254-50, f = 50 mm, Thorlabs) to relay the input image to a coded mask with a random binary pattern. The mask was fabricated on a quartz substrate with chrome coating by photolithography (Frontrange-photomask), as shown in Fig. 1d. The smallest coded feature of the pattern is sampled by approximately 2 × 2 pixels on the camera (11.6 μm), leading to a maximum solution of 1180 × 1100 pixels in the reconstructed image. The spectral range of the system is 450-600 nm, bandlimited by a combination of a 450 nm long pass filter (FELH0450, Thorlabs) and a 600 nm short pass filter (FESH0600, Thorlabs). Next, a 4f system consisting of an objective lens (4× /0.1 NA Olympus objective) and an achromatic lens (AC254-100, f = 100 mm, Thorlabs) relays the coded image to a CMOS image sensor (acA2500-60 um, Basler). To disperse the image, we position a round wedge prism (PS814-A, 10° Beam Deviation, Thorlabs) at the Fourier plane of the 4f relay system.
The CASSI measurement can be considered as encoding high-dimensional spectral data and mapping it onto a 2D space. As is shown in Fig. 1b, the coded aperture modulates the spatial information over the entire spectrum. The image right after the coded aperture is where f 0 (x, y; ) represents the spectral irradiance of the image at the coded aperture, and T(x, y) denotes the transmission function of the coded aperture.
After passing the coded aperture, the spatially modulated information is spectrally dispersed by a prism. The spectral irradiance at the camera plane is where D( ) denotes the nonlinear wavelength dispersion function of the prism.
The resultant intensity image captured at the detector plane is the superposition of multiple images of the spatially modulated scene at wavelength-dependent locations, which can be expressed as www.nature.com/scientificreports/ Because the detector plane is spatially discretized with a pixel pitch of , Y x, y is sampled across the entire 2D dimension of the detector plane. The measurement at the (m, n) pixel can be written as where g mn denotes the measurement noise at the (m, n) pixel, and rect is a rectangular function.
After discretizing the spectral information into L bands, the discrete measurements from the camera pixel can be written as where f mnk and T mn are the discretized representations of the source spectral irradiance and the coded aperture pattern, respectively.
Deep unfolding reconstruction algorithm. Despite being simple in hardware, the image reconstruction of CASSI can be computationally extensive when using conventional iterative algorithms like Two-step iterative shrinkage/ thresholding (TwIST). Additionally, conventional iterative algorithms usually have two steps in each iteration: physical projection and hand-crafted priors [18][19][20] . The mismatch between the hand-crafted priors and real data often results in poor image quality. Recently, deep learning methods have been used in CASSI to improve reconstruction quality and reduce the time cost [21][22][23] . However, most deep learning methods fail to incorporate the physical information of the system, such as the mask pattern, into the model and treat reconstruction as a "black box" during the training phase. Therefore, these methods lack robustness and are prone to artifacts. To solve this problem, we combined the advantages of physical projection in conventional iterative reconstruction and the strong denoising ability of deep learning and developed a deep unfolding network for CASSI reconstruction. Because the mask information is mainly processed in the projection operation, our algorithm exhibits strong robustness to variations in mask patterns.
To reconstruct the original spectral datacube from the 2D CASSI measurement, we first vectorize the image and rewrite Eq. (5) in a matrix form: Given the measurement y and matrix , there are two optimization frameworks to predict the original spectral scene f : the penalty function method and the augmented Lagrangian (AL) method. Because the AL method outperforms the penalty function method, as shown in the previous studies 24-26 , we use the AL method to solve the above inverse problem: where f , 1 , and γ 1 denote the prior regularization, Lagrangian multiplier, and penalty parameter, respectively. Equation (7) can be further written as To solve Eq. (8), we adopt an alternating direction method of multipliers (ADMM) method [27][28][29] . According to ADMM, Eq. (8) can be stratified into two subproblems and solved iteratively where v is an auxiliary variable, and the superscript i denotes the iteration index. Equation (9) is a classical denoising problem, which can be solved by a denoising prior such as total variation, wavelet transformation, or denoising network. Herein we use a deep unfolding network and a state-of-the-art spectral transformer for denoising 30 .
Equation (10) has a closed-form solution 29 , which is termed projection operation www.nature.com/scientificreports/ Due to the special structure of , which consists of a diagonal block matrix, as shown in Ref. 20 , Eq. (11) can be solved in one shot. Therefore, f can be solved by multiple iterations of spectral-transformer (denoising) and projection operation, as shown in Fig. 2. Here f 0 is written as: Figure 2 shows the iterative architecture of the deep unfolding algorithm. In the projection operation stage, f i is calculated from v i according to Eq. (11), where y , γ i 1 , γ i 2 , , i 1 , and i 2 are inputs (For convenience, we use FV i package to represent these inputs in Fig. 2). In the denoising operation stage, v i is calculated from f i−1 according to Eq. (9), where γ i 2 , ′ , and i 2 are inputs (we use VF i package to represent these inputs in Fig. 2). Additionally, the spectral transformer is used as the denoiser, which uses matrix ′ to guide the transformer. ′ is transformed by a convolution neural network (CNN) based on γ i 2 , , y: The Lagrangian multiplier can be updated as and the penalty parameters γ i 1 , γ i 2 can be trained in deep unfolding.
Training. We used PyTorch 31 to train our model on an NVIDIA RTX3090 GPU. The mean square error (MSE) was selected as the loss function. For training, we adopt an Adam optimizer 32 and set the number of iterations in deep unfolding as four, the mini-batch size as one, and the spatial resolution as 256 × 256. Although we trained the model using images with 256 × 256 pixels, we can apply it to images of any dimensions after rescaling. The initial learning rate was set to 1×10 −4 . After the first five epochs, the learning rate decays at a rate of 0.9 every 15 epochs. The total number of epochs is 200, and the training time is about 60 hours.

Results and discussion
Rainbow object. We first validated our system by imaging an object illuminated with rainbow light. As shown in Fig. 3a, we positioned a linear variable visible bandpass filter (LVF) (88365, Edmund optics) in front of a broadband halogen light source. In LVF, the thickness of the coating varies linearly along one dimension of the filter, resulting in a linear and continuous variation in spectral transmission. The spectral resolution of the LVF is 7-20 nm. To project a broad spectrum to the field of view, we used a lens pair with the ratio of their focal lengths of 3.3:1 (MAP1030100-A, Thorlabs) to demagnify the linear filter and relayed it to an intermediate plane, where a letter object was located. Therefore, each lateral location of the object exhibited a distinct color. Figure 3b and c show the raw image and the reconstructed panchromatic image, respectively. Ten representative spectral images from a total of 35 channels are shown in Fig. 3d.
Spatial and spectral resolutions. We quantified the spatial resolution of the system by imaging a USAF resolution target. We positioned the resolution target at the back focal plane of a lens that mimicked the crystalline lens in the eye and located the combined model in front of the fundus camera. Meanwhile, rather than using the theoretical mask pattern as a prior for reconstruction, we experimentally captured the coded mask image under uniform monochromatic illumination (Fig. 1d) and used this data to improve the reconstruction accuracy.
We illuminated the USAF target in the transmission model with broadband light (450-600 nm). The reconstructed spectral images are shown in Fig. 4a, and a zoomed-in view of the 532 nm channel is shown in Fig. 4b. We also show the corresponding raw measurement in Fig. 4c, where the spatio-spectral crosstalk is clearly visible. After reconstruction, we successfully removed the spatial modulation pattern and restored high-resolution images in all spectral channels.
To quantify the spatial resolution, we zoomed in on the central part of the reconstructed 532 nm channel and plotted the intensities across the dashed lines in the image (Fig. 4d). The image contrast is defined as I max −I min I max +I min , where I is the intensity. We calculated the image contrast for each group of bars within the field of view (FOV). Given a threshold of 0.4, Group 6 element 1 horizontal bars and group 5 element 6 vertical bars are minimally (12) i Figure 2. Deep unfolding algorithm for CASSI reconstruction. The projection operation is formulated in Eq. (11), and the spectral transformer is depicted in Ref. 30 . L is the total number of iterations. The spectral resolution of the system is determined by the size of the smallest feature on the coded mask, the camera pixel size, and the spectral dispersion power of the prism 12 . In our system, the smallest coded feature corresponds to two camera pixels. The spectral dispersion across this distance determines the spectral resolution. Because the prism has nonlinear spectral dispersion, its spectral dispersion power varies as a function of wavelength. We measured the wavelength-dependent spectral resolution by imaging a blank FOV uniformly  www.nature.com/scientificreports/ illuminated by monochromatic light of varied wavelengths. The dispersion distances (in pixels) between the coded mask images of adjacent wavelengths at four spectral bands are shown in Table 1. Because the smallest coded feature is mapped to two camera pixels, the spectral resolution is double the pixel dispersion.

Standard eye phantom.
To further validate the system in retina imaging, we imaged a standard eye phantom (Wide field model eye, Rowe Technical design), which has both a realistic eye lens and a vasculature-like pattern painted at the back surface. A direct image of the retina obtained with a reference camera is shown in Fig. 5a. We positioned the eye phantom in front of the fundus camera (Fig. 5b), illuminated it with the camera's internal halogen lamp, and captured the retina image using CASSI in a snapshot. The reconstructed spectral channel images of two ROIs are shown in Fig. 5c, showing a close resemblance to the corresponding regions in the reference image (Fig. 5a). Furthermore, to quantitatively evaluate the spectral accuracy, we measured the field-averaged spectrum using a benchmark fiber spectrometer (OSTS-VIS-L-25-400-SMA, Ocean Optics) as the reference. The CASSI reconstructed spectrum matches well with the ground truth (Fig. 5d).  www.nature.com/scientificreports/ Spectral imaging of the retina in vivo. To demonstrate our system in vivo, we imaged the retina of a 23-year-old healthy female volunteer. Before the experiment, the subject's pupil was dilated using mydriatic (Mydriacyl, 1%). We operated the camera in the reflectance mode and utilized the internal light source of the camera for illumination. The snapshot measurement and reconstructed spectral image are displayed in Fig. 6a and b, respectively. Next, we calculated the reflectance spectrum S r at a vessel location by averaging the pixels' spectra in the circled area (A in Fig. 6a). To compute the absorption spectrum, we measured the illumination spectrum S i of the fundus camera's internal lamp by placing a white paper in front of the retinal camera's front lens and averaging the reflectance spectra of pixels within the field of view. The absorption spectrum S a (Fig. 6c) was then calculated by subtracting the normalized reflectance spectrum S r from the normalized lamp's illumination spectrum S i : The resulting spectrum (Fig. 6c) exhibits two prominent peaks, which correspond to the peak absorptions of oxyhemoglobin at approximately 540 nm and 575 nm 33 . Figure 3 demonstrates the capability of our system in resolving the overlap between spatial and spectral information. By encoding each lateral coordinate of the planar letter object with a unique color, we successfully reconstructed the images of individual letters in different spectral channels (487 nm, 499 nm, 517 nm, and 527 nm for the letter 'UCLA'). Our deep unfolding reconstruction network effectively removes the spatial modulation pattern caused by the coded aperture in all spectral channels.

Discussion
To further enhance the spatial resolution of our proposed system, several strategies can be employed. One approach is to utilize an objective lens with a higher numerical aperture (NA). Additionally, using a mask with smaller features and a camera with a smaller pixel size while maintaining the 2 × 2 sampling rate can also improve spatial resolution. Another contributing factor to the quality of reconstruction is the camera sensor.
In our system, we employed a machine vision camera with limitations such as low quantum efficiency and high readout noise. Overcoming these limitations can be achieved by selecting a scientific monochromatic sensor with high quantum efficiency and superior noise performance. Such improvements are particularly beneficial for clinical applications. Figures 5 and 6 present the reconstruction results of the eye phantom and in vivo retina, respectively. The spectrum of the eye phantom aligns well with the ground truth illumination light source. Moreover, we successfully acquired the absorption spectral signatures of oxyhemoglobin through in vivo imaging. In terms of image quality, the reconstructed vascular structure in the eye phantom exhibits higher contrast compared to in vivo imaging. This is due to the relatively high reflection of the painting material in the eye phantom, resulting in a distinct contrast between the vascular structure (bright regions) and the background (dark regions). However, in the case of in vivo imaging, the low reflectance of the human retina poses a more challenging problem for . www.nature.com/scientificreports/ conventional iterative algorithms like TwIST 18 . Leveraging the denoising capabilities of the spectral transformer module, our deep unfolding network demonstrates robustness under various imaging conditions. Furthermore, our deep-learning-based algorithm can reconstruct the entire megapixel datacube within 60 s, a significant improvement compared to the conventional iterative algorithms that would typically require 20 min. While compromising the snapshot capability and capturing multiple shots with different mask patterns is another method to improve performance, it necessitates additional computational time and hardware components such as a spatial light modulator (SLM) for dynamically changing the mask patterns 34 .

Conclusions
In summary, we developed a snapshot spectral retinal imaging system by integrating a CASSI system with a fundus camera. We also developed a deep unfolding method for fast and high-quality image reconstruction. The resultant system can acquire a large-sized spectral datacube in the visible light range in a single exposure.
The system performance has been demonstrated with standard targets, eye phantom and human retina imaging test. In the in vivo human retinal imaging experiment, the absorption spectral signatures of oxyhemoglobin were successfully acquired by using our proposed system. Seeing its high-resolution snapshot imaging advantage, we expect our method can find broad applications in retinal imaging.

Data availability
Data underlying the results presented in this paper may be obtained from the corresponding author upon reasonable request.