Video-rate hyperspectral camera based on a CMOS-compatible random array of Fabry–Pérot filters

Hyperspectral (HS) imaging provides rich spatial and spectral information and extends image inspection beyond human perception. Existing approaches, however, suffer from several drawbacks such as low sensitivity, resolution and/or frame rate, which confines HS cameras to scientific laboratories. Here we develop a video-rate HS camera capable of collecting spectral information on real-world scenes with sensitivities and spatial resolutions comparable with those of a typical RGB camera. Our camera uses compressive sensing, whereby spatial–spectral encoding is achieved with an array of 64 complementary metal–oxide–semiconductor (CMOS)-compatible Fabry–Pérot filters placed onto a monochromatic image sensor. The array affords high optical transmission while minimizing the reconstruction error in subsequent iterative image reconstruction. The experimentally measured sensitivity of 45% for visible light, the spatial resolution of 3 px for 3 dB contrast, and the frame rate of 32.3 fps at VGA resolution meet the requirements for practical use. For further acceleration, we show that AI-based image reconstruction affords operation at 34.4 fps and full high-definition resolution. By enabling practical sensitivity, resolution and frame rate together with compact size and data compression, our HS camera holds great promise for the adoption of HS technology in real-world scenarios, including consumer applications such as smartphones and drones. A hyperspectral camera based on a random array of CMOS-compatible Fabry–Pérot filters is demonstrated. The hyperspectral camera exhibits performance comparable with that of a typical RGB camera, with 45% sensitivity to visible light, a spatial resolution of 3 px for 3 dB contrast, and a frame rate of 32.3 fps at VGA resolution.

Hyperspectral (HS) imaging provides rich spatial and spectral information and extends image inspection beyond human perception. Existing approaches, however, suffer from several drawbacks such as low sensitivity, resolution and/or frame rate, which confines HS cameras to scientific laboratories. Here we develop a video-rate HS camera capable of collecting spectral information on real-world scenes with sensitivities and spatial resolutions comparable with those of a typical RGB camera. Our camera uses compressive sensing, whereby spatial-spectral encoding is achieved with an array of 64 complementary metal-oxide-semiconductor (CMOS)-compatible Fabry-Pérot filters placed onto a monochromatic image sensor. The array affords high optical transmission while minimizing the reconstruction error in subsequent iterative image reconstruction. The experimentally measured sensitivity of 45% for visible light, the spatial resolution of 3 px for 3 dB contrast, and the frame rate of 32.3 fps at VGA resolution meet the requirements for practical use. For further acceleration, we show that AI-based image reconstruction affords operation at 34.4 fps and full high-definition resolution. By enabling practical sensitivity, resolution and frame rate together with compact size and data compression, our HS camera holds great promise for the adoption of HS technology in real-world scenarios, including consumer applications such as smartphones and drones.
Hyperspectral (HS) cameras-owing to their ability to provide rich spectral information at each pixel of an image-have stimulated advanced sensing applications, including machine vision, food and agricultural analysis, environmental monitoring and healthcare [1][2][3][4][5][6][7][8][9][10][11][12][13] . Although three general types of HS cameras (spatial-scanning 14 , spectral-scanning 15 and snapshot 16,17 ) are commercially available, their adoption in industrial and consumer applications is hindered by fundamental drawbacks such as low sensitivity, resolution and/or frame rate compared with an RGB camera (Supplementary Section 1). The emerging field of computational imaging has the potential to enhance the sensing capability of HS cameras by optimal hardware design and image post-processing. Compressed sensing is a technique to efficiently acquire signals from under-sampled measurements, and has contributed to improving performance in many sensing systems 18 . Recent demonstrations include an on-chip spectroscopic device [19][20][21][22][23][24][25][26] and a coded aperture snapshot spectral imager [27][28][29][30][31] . However, no video-rate HS camera has been made with RGB camera-comparable sensitivity and resolution, thus hindering the potential of HS in industrial and consumer applications.
In this Article, we demonstrate a HS camera that exploits efficient acquisition of the spatial and spectral information compressed in Letter https://doi.org/10.1038/s41566-022-01141-5 the coded mask, f is the reconstructed image, τ is the regularization coefficient and TV represents total variation. In this work, equation (2) is solved by an iterative method based on a two-step iterative shrinkage thresholding (TwIST) algorithm 32 . Solving equation (2) minimizes the cost function Q(f) to find the reconstructed image f through a TV denoising (regularization) process, with τ compared with the first L 2 -norm fidelity term. A smaller value of τ performs weaker TV denoising to maintain the sharp edges of the reconstructed image, but makes the reconstruction process more unstable against noise on the compressed image; the spatial resolution is thus essentially limited by sensor noise 33 .
Although equation (2) is an underdetermined linear system, its convergence is mathematically ensured with a non-zero reconstruction error if A is incoherent for the object signal to be measured (Supplementary Section 2). In this case, incoherence of the measurement matrix is achieved by a spatially and spectrally random transmittance of a coded mask. As shown in Fig. 1b, the mask is designed to show spatially random transmittance patterns with a small correlation between different wavelength bands, that is, spectral randomness.
To design an incoherent coded mask for low-error image reconstruction, we started by conducting numerical simulations to study the reconstruction error as a function of the spatial and spectral randomness of the mask (Supplementary Section 2). Virtual coded masks were numerically generated with varying spatial (spectral) randomness while ensuring the perfect randomness in the spectral (spatial) dimension so that the spatial (spectral) randomness was separately evaluated. The reconstruction error arises not only from the mask's lack of spatial and spectral randomness, but also from noise on the compressed image. Our goal here was therefore to design the mask that produces a maximum reconstruction error of 2%, given that the image sensor used in this work (ams, CMV2000-3E12M1PP) shows a signal fluctuation around 2% ( Supplementary Fig. 8).
The spatial randomness of the mask was characterized for each wavelength band using an evaluation index σ/μ, where the standard two-dimensional sensor signals. Our camera shows a sensitivity of 45% for visible light, a spatial resolution of 3 px for 3 dB contrast, and a frame rate of 32.3 fps at VGA resolution (0.3 MP). These metrics are comparable with an equivalent RGB camera and thus meet the requirements for practical use.
In our HS imaging method (Fig. 1a), light from an object is spatially and spectrally encoded by a measurement matrix, followed by image reconstruction based on spatial sparsity of the object. The measurement matrix is implemented as a coded mask of spatially random transmittance patterns with a small correlation between different wavelengths, that is, spectral randomness (conceptually shown in Fig. 1b and Supplementary Fig. 1). As shown in Fig. 1c, the coded mask is directly placed onto a monochromatic image sensor, otherwise, for mass production, it can be monolithically integrated onto the sensor using a complementary metal-oxide-semiconductor (CMOS)-compatible process (see Methods). The mask is an array of square cells with a pixel pitch of 5.5 μm (inset of Fig.1c), matched with that of the image sensor.
Encoded light from the object is thus captured as a single monochromatic image (compressed image), and image reconstruction is performed assuming the spatial sparsity of the object. The compressed sensing process and the image reconstruction under the sparsity assumption can be expressed as follows: where N is the number of wavelength bands, g is the compressed image, A is the measurement matrix based on the transmittance patterns of The coded mask shows spatially random transmittance patterns with a small correlation between different wavelength bands, thereby realizing spatially and spectrally random sampling. c, A photograph of our HS camera with the coded mask directly placed onto the surface of a monochromatic image sensor. The inset shows a microscope image of the mask, which is an array of square cells with a pixel pitch of 5.5 μm, matched with that of the image sensor.
Letter https://doi.org/10.1038/s41566-022-01141-5 deviation (σ) and the average transmittance (μ) are obtained from the histogram of the transmittance pattern. As shown in Supplementary Fig. 2, the reconstruction error monotonically decreases as σ/μ increases and becomes smaller than 2% for σ/μ > 0.1. The spectral randomness of the mask was instead characterized by calculating the two-dimensional correlation coefficients between the wavelength bands i and j as: where i(j) mn is the cell transmittance at location (x, y) = (m, n) in the mask and ī (j ) is the average transmittance. The non-diagonal components (i ≠ j) represent indices for the spectral randomness whereas the diagonal components (i = j) are naturally equal to 1. As shown in Supplementary Fig. 3, the reconstruction error is suppressed at ~1% for r ij < 0.9 (i ≠ j).
To realize a coded mask that satisfies σ/μ > 0.1 and r ij < 0.9 (i ≠ j), and ensure a reconstruction error below 2%, Fabry-Pérot filters are used to form the square cells in the mask. Figure 2a shows the schematic cross-section of the Fabry-Pérot filters, which consist of distributed Bragg reflectors (DBRs) and a cavity layer on a SiO 2 substrate. Considering the required randomness and fabrication process complexity, 64 Fabry-Pérot filters are designed; half (32) of the filters have top and bottom DBRs with a cavity layer in between, whereas the other half (32) have only a bottom DBR with a cavity layer. By progressively varying the cavity thickness in 32 steps, the 64 filters show unique transmission spectra. More filters (for example, 128) would only result in a minor improvement to the spectral resolution, but such a design dramatically increases the difficulty of the fabrication process (Supplementary Fig. 5).
In Fig. 2b, the light (dark) grey curves show the simulated spectra of the Fabry-Pérot filters with (without) top DBRs, whereas the red, green and blue spectra correspond to the filters highlighted in Fig. 2a. The average transmittance of the 64 filters is 46.4% for the visible wavelength range (from λ = 450 nm to 650 nm). The coded mask comprising the 64 filters was then fabricated and integrated onto the image sensor, and the transmittance patterns were measured to determine the measurement matrix through a calibration process (see Methods). In the calibration, the wavelength bands were set from λ = 450 nm to 650 nm with a sampling interval Δλ of 10 nm (a total of 20 bands). Figure 2c shows an experimentally measured histogram and transmittance pattern (inset) of the fabricated mask at λ = 550 nm. The σ/μ values are plotted in Fig. 2d, where σ/μ gradually decreases at longer wavelengths because the Fabry-Pérot filters show broader transmission peaks leading to larger μ and smaller σ. The measured σ/μ is larger than 0.2 for all bands, corresponding to a reconstruction error of 1-1.5% ( Supplementary Fig. 2). Figure 2e shows the experimentally measured r ij of the fabricated mask, where the yellow line indicates a contour at r ij = 0.9. The fabricated mask satisfies r ij < 0.9 (i ≠ j), corresponding to ~1% reconstruction error, as expected from Supplementary Fig. 3.
We next evaluate the sensitivity of our HS camera by imaging objects under average office lighting with 550 lux, which is a relatively dim environment compared with natural scenes. Figure 3a,b shows photographs captured by our HS camera and a standard RGB camera (see Methods). The HS image obtained by our HS camera is rendered as a synthetic RGB image, showing a different hue from that of the RGB camera. The three expanded images in Fig. 3a,b indicate that the image quality of our HS camera is qualitatively comparable with that of the RGB camera. Although our HS camera loses spatial information owing to TV denoising, the reconstruction error at the edges is not clearly observed. A more detailed investigation on object boundaries is performed below (Fig. 4 and Supplementary Section 7). Quantitatively, the average transmittance (that is, the effective sensitivity) is evaluated with reference to the monochromatic camera transmittance of 100%, as the monochromatic camera uses no spectral filter. The effective sensitivities are 45% for our HS camera and 37% for the RGB camera over the 450-650 nm wavelength range, much higher than that of a conventional HS camera (typically ~5%); the raw images used to calculate the sensitivities are shown in Supplementary Fig. 11. The experimental sensitivity of our HS camera is close to the simulation result (46.4%), comparable with that of the RGB camera. From three expanded images in Fig. 3a,b, signal intensity is obtained in each band and plotted in Fig. 3c. Our HS camera shows spectral information of 20 bands, whereas the RGB camera shows only three bands. An intensity dip of LED lighting at λ = 485 nm is clearly observed by our HS camera, which is not detected by the RGB camera. As demonstrated below ( Fig. 4 and Supplementary Video 1), our HS camera enables video-rate (> 30 fps) acquisition of 20 bands of spectral information. The spatial resolution is experimentally visualized in Fig. 3d,e and evaluated as modulation transfer function (MTF) curves in Fig. 3f by imaging a ladder resolution chart. The MTF curves indicate that the spatial resolution of our HS camera is comparable with the RGB camera, exhibiting 3 dB contrast (normalized peak height of 0.5) obtained at the spatial frequency of three cycles per millimetre. The spatial resolution of three cycles per millimetre indicates that 3 px are required to separate adjacent lines with 3 dB contrast, as the millimetre/pixel conversion ratio is 1,150 px/130 mm, as evidenced from Fig. 3d,e. This result is reasonable because edge smoothing occurs in our HS camera during the TV denoising process, whereas the RGB camera with a Bayer filter mosaic also loses spatial information due to 2 × 2 px convolution and inter-pixel crosstalk. Further evaluation of the spatial resolution using a star resolution chart is performed in Supplementary Section 8.
The spectral accuracy of our HS camera is experimentally evaluated by using standard colour samples, as shown in Fig. 3g. The continuous lines in Fig. 3g represent ground truth spectra obtained using a spectral-scanning HS camera (see Methods), whereas the symbols represent three typical spectra obtained by our HS camera. Note that white balance was performed for the spectra in Fig. 3g by using a diffuse reflectance target (SphereOptics, SG3151) to compensate the spectrum of LED lighting. The results are in good agreement, as evidenced by further testing all eight colours and obtaining an absolute average error of 2.2% over 20 bands (Supplementary Section 9). The experimentally measured error is a reasonable value considering the sensor noise (2%) and the simulated reconstruction error arising from the spatial and spectral randomness (1-1.5% and ~1%).
To further characterize the performance of the image reconstruction, numerical simulations were performed using HS image datasets (CAVE 34 , Tokyo Tech 35 and Manchester 36 ), as shown in Fig. 4. In the simulation, the compressed image was first synthesized from each dataset following equation (1) with the experimentally obtained measurement matrix; the corresponding HS images were then obtained. Here, no signal fluctuation was assumed because the signal fluctuation arises from an image sensor, but not from the compression/reconstruction processes. The impact of sensor noise on the spectroscopic performance and image quality is separately discussed in Supplementary Section 5.
We first evaluated the image reconstruction quality. The rows of Fig. 4a show the ground truth images, the reconstructed images and the absolute error maps simulated using the datasets (CAVE, Tokyo Tech and Manchester, along the columns). The ground truth and   Fig. 4a. The reconstruction error is localized at the edges of structures where the spatial sparsity is low (Supplementary Section 6); however, the absolute error is still below 10% at the edges of structures, which is comparable with previous reports 37,38 . Such an error at the edges can be further reduced by optimizing the reconstruction algorithm 39 . We also evaluated the convergence and the frame rate of the image reconstruction. Figure 4b shows the simulated image quality (that is, the peak signal-to-noise ratio, PSNR) as a function of the iteration step of the image reconstruction algorithm. The PSNR saturates to more than 30 dB after around 50 iteration steps for all datasets. These simulated PSNR values are comparable with those in past reports 30,31,37,38,40 .
As our HS camera enables a fast shutter speed (~10 ms at 550 lux with f/4.0 aperture), the speed of the image reconstruction is slower than the shutter speed and limits the frame rate of our HS camera in the most cases. The frame rates (limited by the post-image reconstruction process, see Methods) were experimentally measured using an iterative method with 50 iteration steps. Figure 4c shows the frame rates as a function of the number of pixels, that is, the size of measurement matrix in equation (2), revealing an exponential dependency. With the commercially available single graphics processing unit (GPU) that we used here (see Methods), the frame rate reaches 32.3 fps for VGA resolution (640 × 480 pixels) and 7.14 fps at full-HD resolution (1,920 × 1,080 pixels) (see Supplementary Video 1). For further acceleration, non-iterative image reconstruction is also attempted with the help of AI, demonstrating video-rate (34.4 fps) operation at full-HD resolution.
In conclusion, we demonstrated a video-rate HS camera with RGB camera-comparable sensitivity and resolution. Our HS camera was fabricated by integrating a CMOS-compatible spatially and spectrally random coded mask onto a monochromatic image sensor. The measured filter transmittance was 45% for visible light, and the spatial resolution was 3 px for 3 dB contrast, thus making the camera comparable with a standard RGB camera. Hyperspectral images with an absolute error of 2.2% on average were obtained through iterative image reconstruction assuming spatial sparsity. A frame rate of 32.3 fps is achieved for VGA resolution by using iterative image reconstruction and further enhanced to full-HD resolution by using AI-based reconstruction. The advantages of our HS camera (that is, practical sensitivity, compact size and data compression) hold great promise for the adoption of HS imaging technologies in various scenarios, including consumer applications such as smartphones, drones and Internet of Things (IoT) devices.

Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41566-022-01141-5.

Design of the coded mask
The rigorous coupled-wave analysis (RSoft, DiffractMOD) was used to simulate the transmittance of the Fabry-Pérot filters with different cavity thicknesses. The refractive indices used in the simulation were determined by spectroscopic ellipsometry of the deposited films.

Fabrication of the coded mask
The fabrication process started with an ion-assisted physical vapour deposition of a bottom DBR and a cavity on a 15 mm diameter SiO 2 wafer with a thickness of 625 μm. The cavity layer was patterned using a mask with a random pattern at the pixel level by standard photolithography and etching process. By changing the random pattern and etching depth, the process was repeated, thereby forming 32 different cavity thicknesses randomly distributed over the mask. A top DBR was deposited onto the patterned cavity and then removed using a mask with a random pattern for half of the pixels.

Integration of the coded mask on the image sensor
Using optical adhesive, the coded mask was directly attached onto a monochromatic image sensor (ams, CMV2000-3E12M1PP), where the top DBR and the sensor surface were facing each other. The air gap was controlled by introducing 5-μm-thick epoxy resin (SU-8) pillars between the mask and the sensor surface. Due to the spatial randomness of the mask, the image quality was robust against the lateral misalignment, whereas rotation was controlled to be nearly zero to avoid undesirable moire interference between the pixel pitch of the mask and sensor. Large mismatch of the pixel pitch between the filters and the image sensor results in spatial averaging of the filter transmittance, thereby deteriorating the spatial randomness. For mass production, the coded mask was also monolithically integrated on an 8-inch diameter image sensor wafer by using a CMOS-compatible process at Imec (see ref. 16 for more detail). The preliminary result of our prototyping shows the same performance as this work, demonstrating an absolute average error of 2.6% over 20 bands. As the coded mask can be implemented by both on-and off-chip processes, our approach can be readily applicable to different types of sensors at various wavelengths.

Calibration of the camera
The measurement matrix of our HS camera was experimentally determined as follows; monochromatic light with a full-width at half-maximum of 10 nm was obtained from a supercontinuum laser (Leukos, Rock-450-5) with a monochromator (Shimadzu, SPG-120S-REV), and was introduced into an integrating sphere (Labsphere, HAS-08L). The spatially uniform monochromatic illumination from an exit of the integrating sphere was then captured by our HS camera. By changing the wavelength band, the transmittance patterns of the coded mask were measured to determine the measurement matrix in equations (1) and (2). To compensate optical system properties with the measurement matrix, the lens settings (focal length and F number) during the imaging should be the same as those in the calibration process. By preparing presets of the measurement matrix associated with the different calibration settings, the lens settings and the wavelength bands can be changed in the post-image reconstruction process.

Characterization of the camera
An RGB camera (Ximea, MQ022CG-CM) with an equivalent sensor and a Bayer filter mosaic (ams, CMV2000-3E5C1PP) was used for fair comparison of the sensitivity and resolution. The objects in Fig. 3a,b were illuminated by standard LED lighting with the illuminance of 550 lux in total. To capture the objects, the exposure time, digital gain, analogue gain and gamma were set at 20 ms, 3.2 dB, 1 dB and 1.0, respectively. The ladder resolution chart printed on a white paper and the color samples in Fig. 3c,d,f were illuminated by a pair of wideband LED lighting (effilux, EFFI-FLEX-HSI) with the illuminance of 550 lux in total. To capture the resolution chart, the exposure time, digital gain, analogue gain and gamma were set to 5 ms, 3.2 dB, 1 dB and 1.0, respectively. To capture the colour samples on a black paper, the exposure time, digital gain, analogue gain and gamma were set to 20 ms, 3.2 dB, 1 dB and 1.0, respectively. For the resolution chart and MTF curves, the HS image at λ = 550 nm was compared with the green component of the RGB camera. The spectra of the colour samples were obtained from the HS image by averaging 50 × 50 pixels at the center of each sample. For the ground truth, a spectral-scanning HS camera was assembled using bandpass filters (Thorlabs, FB'XXX'-10, where 'XXX' is from 450 nm to 650 nm in 10 nm steps) and a monochromatic camera with the same image sensor (ams, CMV2000-3E12M1PP). A 16-mm fixed focus objective lens (Edmund optics, no. 59-870) was used with f/4.0 aperture in all scenes.

Iterative/AI-based image reconstruction
Iterative image reconstruction was developed based on the TwIST algorithm 32 , but we modified and optimized the algorithm for a much faster convergence rate in a GPU environment (NVIDIA, GeForce RTX 2080Ti); TV was used for the regularization term with the coefficient τ of 0.5. AI-based image reconstruction was developed using a convolutional neural network-based method 38 . To further improve the frame rate, the size of convolutional neural network model was effectively reduced with maintaining the image reconstruction quality, followed by model optimization using NVIDIA TensorRT. With reduction of the number of parameters from 31 million to 2 million, the AI-based method exhibits 206.6 fps for VGA resolution (640 × 480 pixels), 76.9 fps at HD resolution (1,280 × 720 pixels) and 34.4 fps at full-HD resolution (1,920 × 1,080 pixels) in the same GPU environment as the iterative-based method. The image quality of the AI-based method was also evaluated using CAVE database 34 , where the HS image sets were divided into training and test groups with a ratio of 80:20. Under this training condition, the AI-based method shows PSNR of 29.21 dB on average, which is a similar level to previous attempts 30,31,37,38,40 . Additional training naturally improves the image quality, reaching the same level as that of the iterative-based method.

Data availability
The main data supporting the findings of this study are available in Supplementary Information. Further data that support the findings of this study are available from the corresponding authors on reasonable request.