Spectral imaging with deep learning

The goal of spectral imaging is to capture the spectral signature of a target. Traditional scanning method for spectral imaging suffers from large system volume and low image acquisition speed for large scenes. In contrast, computational spectral imaging methods have resorted to computation power for reduced system volume, but still endure long computation time for iterative spectral reconstructions. Recently, deep learning techniques are introduced into computational spectral imaging, witnessing fast reconstruction speed, great reconstruction quality, and the potential to drastically reduce the system volume. In this article, we review state-of-the-art deep-learning-empowered computational spectral imaging methods. They are further divided into amplitude-coded, phase-coded, and wavelength-coded methods, based on different light properties used for encoding. To boost future researches, we’ve also organized publicly available spectral datasets.


Introduction
With the ability of getting distinctive information in spatial and spectral domain, spectral imaging technology has vast applications in remote sensing 1 , medical diagnosis 2 , biomedical engineering 3 , archeology and art conservation 4 , and food inspection 5 . Traditional methods of spectral imaging include whiskbroom scanning, pushbroom scanning, and wavelength scanning. Whiskbroom spectroscopy performs scanning pixel by pixel. A widely acknowledged example is Airborne Visible/Infrared Imaging Spectrometer 6,7 , which implemented whiskbroom approach on aircraft for Earth remote sensing. Pushbroom scan system uses the entrance slit and builds image line by line. The Hyperspectral Digital Imagery Collection Experiment instrument 8,9 implemented pushbroom imaging optics with a prism spectrometer, offering a good capability for remote sensing. Wavelength scanning methods capture spectral image cubes through swapping narrow bandpass filters in front of the camera lens or using electronically tunable filters 10,11 . These typical scanning spectral imaging approaches are illustrated in Fig. 1.
However, traditional scanning methods suffer from the low speed of the spectral image acquisition process because of the time-consuming scanning mechanism. As a consequence, they are not applicable for large scenes or dynamic recording. To solve this problem, researchers started to explore snapshot spectral imaging methods 12 . Early endeavors include integral field spectrometry, multispectral beam splitting, and image-replicating imaging spectrometer, as mentioned in ref. 12 . These methods cannot obtain massive spectral channels and have bulky optical systems, though achieving multispectral imaging through splitting light.
With the development of compressed-sensing (CS) theory 13,14 , compressive spectral imaging has received growing attention from researchers because of its elegant combination of optics, mathematics and optimization theory. It has the ability to perform spectral imaging through fewer measurements, which is essential in resource-constrained environments. Compressive spectral imaging techniques often use a coded aperture to block or filter the input light field, namely the encoding process in the compressive sensing pipeline. As the name indicates, this process plays a role in information compression, which is flexible in design and provides the prior knowledge for later reconstruction. Different from the hardware-based encoding, its decoding process requires the computation via designed algorithms. Traditional reconstruction approach is iterative, using designed measurement of the encoding process and other prior knowledge for reconstruction. As a consequence, the decoding procedure is computationally expensive and can take minutes or even hours for spectral reconstruction. Furthermore, degradation problem when using fewer measurements also limits its application in resourceconstrained environments.
While using coded aperture for amplitude encoding has shown the capability of spectral imaging from fewer measurements, the reduced light throughput and large system volume make it unsuitable for practical applications. To overcome this drawback, phase-coded spectral imaging 15,16 is developed to improve light throughput and reduce system volume. Its main idea is using a carefullydesigned thin diffraction optical element to manipulate the input light phase, which will affect the spectra in the diffraction process. Then, to recover spectra modeled in the complex diffraction process, powerful deep-learning techniques are required.
Researchers in computer graphics are also seeking to optimize spectra reconstruction, because using spectra is better than RGB triplets when rendering a scene illumination or display a virtual object on a monitor device. Early works [17][18][19] obtain spectrum from RGB triplet, but this can be an ill-posed problem that has non-unique solutions and negative spectrum values. Later works involved more effective methods such as basis function fitting 20 and dictionary learning 21 . The latter is based on the hyperspectral dataset, yet still have the problem of long-time weight fitting procedure. As demonstrated in a statistical research on hyperspectral images 22 , spectra within an image patch are correlated. Nevertheless, these pixel-wise methods fail to exploit the correlation information in a spectral data cube, hence effective patch feature extraction algorithms are expected. The pursuit of accurate and fast RGB-to-spectra approach has pushed the development of wavelength-coded methods. Researchers extended the RGB filters to multiple selfdesigned broadband filters for delicate wavelength encoding, and a reliable decoding algorithm is in demand. Completing such complex computing tasks is the mission of deep learning.
To alleviate the high computation costs in the aforementioned methods, deep-learning algorithm has been proposed as an alternative for learning spatial-spectral prior and spectral reconstruction. Deep-learning techniques can perform faster and more accurate reconstruction than iterative approaches, thus is suitable to apply on spectral recovery tasks. In recent years, many works have employed deep-learning models (such as convolutional neural networks, CNNs) in their spectral imaging framework and showed improved reconstruction speed and quality 15,16,[23][24][25] .
In this review, we will look back at the development in spectral imaging with deep-learning tools and look forward to the future directions for computational spectral imaging systems with deep-learning technology. In the following sections, we will first discuss the deep-learningempowered compressive spectral imaging methods that perform amplitude encoding using coded apertures in "Amplitude-coded Spectral Imaging". We will then introduce phase-coded methods that use diffractive optical element (DOE, or diffuser) in "Phase-coded Spectral Imaging". In "Wavelength-coded Spectral Imaging", we will introduce wavelength-coded methods that use RGB or broadband optical filters for wavelength encoding, and adopt deep neural networks for spectral reconstruction. To boost future researches on learned spectral imaging, we have organized existing spectral datasets and the evaluation metrics (in "Spectral Imaging Datasets"). Finally, we will summarize the deep-learning-empowered spectral imaging methods in "Conclusions and Future Directions" and share our thoughts on the future.

Amplitude-coded spectral imaging
Amplitude-coded methods use coded aperture and dispersive elements for compressive spectral imaging. The classical system is coded aperture snapshot spectral imager (CASSI). To date, there are four CASSI architectures based on different spatial-spectral modulation styles, as shown in Fig. 2. The first proposed architecture is dualdisperser CASSI (DD-CASSI) 26 , which consists of two dispersive elements for spectral shearing with a coded aperture in between. Single-dispersive CASSI (SD-CASSI) 27 is a later work, using one dispersive element placed behind the coded aperture. Snapshot colored compressive spectral imager (SCCSI) 28 uses also a coded aperture and a dispersive element, but places the coded aperture behind the dispersive element. In comparison to SCCSI that attaches colored coded aperture (or, color filter array) to the camera sensor, spatial-spectral CASSI (SS-CASSI) architecture 29 adds the flexibility of coded aperture position between spectral plane and sensor plane. This increases the complexity of the codedaperture model, which may play a role in improving the system performance. Some deep-learning-based compressive spectral imaging methods have found better results with SS-CASSI 30,31 .

Coded-aperture model
Since most works were based on SD-CASSI system, we will give a detailed derivation of the image construction process of SD-CASSI. The image formation procedure is different for other CASSI architectures in Fig. 2, but the key processes (vectorization, discretization, etc.) are the same. We refer readers to refs. 26,28,29 for a detailed modeling of the DD-CASSI, SCCSI, and SS-CASSI, respectively.
At the time when SD-CASSI was proposed, coded aperture had block-unblock pattern, which was extended to colored pattern in ref. 32 . We will use a colored coded aperture in derivation for generality. Consider a target scene with spectral density f(x, y, λ) and track its route in an SD-CASSI system: it first encounters a coded aperture with transmittance T(x, y, λ) and then is sheared by a dispersive element (assume at x-axis), finally punches on the detector array. Figure 3 illustrates the whole process.
The spectral density before the detector is formulated as where delta function represents the spectral dispersion introduced by the dispersive element, such as a prism or gratings. α is a calibration factor, and λ c is the center wavelength of dispersion. Since we can only measure the intensity on the detector, the measurement should be the integral along the wavelength: where Λ is the spectrum range.
Next, we discretize Eq. (2). Denote Δ as the pixel size (in x and y dimension) of the detector, and assume the coded aperture has square pixel size Δ code = qΔ, q ≥ 1. The code pattern is then represented as a spatial array of its pixels: Finally, signals within the region of a pixel will be accumulated in the sampling process: To further simplify Eq. (4), we discrete f and T using their central pixel intensity. Take spectral resolution Δ λ as the spectral interval. We use the intensity f(m, n, l) (m; n; l 2 N) to represent a pixel of the spectral density f Adjust the calibration factor α so that the dispersion distance satisfies αΔ λ ¼ kΔ; k 2 N. Then Eq. (4) becomes To adopt reconstruction algorithms, we need to rewrite Eq. (5) in a matrix form. This procedure is illustrated in Fig. 4.
First, we vectorize the measurement and spectral cube as Fig. 4a:  In DD-CASSI architecture, the spectral scene experiences a shearing-coding-unshearing procedure.
In SD-CASSI, the diffraction grating before the coded aperture is removed, therefore it becomes a coding-shearing process. In SCCSI, the coded aperture is placed behind the dispersive element and the spectral data will experience a shearing-coding procedure. In SS-CASSI, coded aperture position becomes flexible between spectral plane and camera sensor, where the ratio (d 1 + d 2 )/d 2 determines the extent of spectral encoding Spectrally sheared data Spatially coded data Fig. 3 Spectral imaging process within SD-CASSI architecture. The spectral data cube first passes a coded aperture for spatial encoding, then its spectral arrangement is shifted by a dispersive element. Finally, a detector captures the spatial and spectral encoded data image where the measurement term g 2 R M N and spectral cube f 2 R M N L , with spatial dimension M × N and spectral dimension L. After vectorization, we have the vectorized terms y 2 R MN ; x 2 R MNL .
Next, the coded aperture and dispersion shift are modeled into a sensing matrix Φ 2 R MV MNL ; where V = N + k(L − 1) contains the dispersion shift (the shift distance is αΔ λ ¼ kΔ; k 2 N). A sensing matrix (for k = 1) produced from a colored coded aperture is shown in Fig. 4b.
Finally, the reconstruction problem is formulated as where Φ is the sensing matrix, and y is measurement. Term R stands for priority, which is a regularizer determined by the prior knowledge of the input scene x (e.g., sparsity), and term η is a weight for the prior knowledge.

Deep compressive reconstruction
Traditional methods for spectral image reconstruction usually utilize iterative optimization algorithms, such as GAP-TV 33 , ADMM 34 , etc. These methods suffers a long reconstruction time for iterations. Besides, the spatial and spectral reconstruction accuracy is not solid by using hand-crafted priors. For example, total variance (TV)  Each diagonal pattern is generated from the vectorized coded aperture pattern of a band. The block-unblock coded aperture is similar, just turning the color bands into black and white prior is always used in reconstruction algorithms, but it sometimes brings over-smoothness to the result. Deep-learning techniques can be applied to each step in amplitude-coded spectral imaging methods, from the design of amplitude encoding strategy (coded aperture optimization) to finding a representative regularizer (term R in Eq. (7)), and the whole reconstruction process can be substituted with a neural network. Adopting deeplearning methods can improve the reconstruction speed by hundred of times. Moreover, learning priors from large amount of spectral data by neural networks can promote the reconstruction accuracy in both spatial and spectral domains. We have summarized recent years' works of deep-learning-based coded aperture spectral imaging in Table 1 for comparison.
Based on different places deep learning is used, we divide the deep-learning-based compressive reconstruction methods into four categories: (i) end-to-end reconstruction that uses deep neural networks for direct reconstruction; (ii) joint mask learning that simultaneously learns the coded aperture pattern and the subsequent reconstruction network; (iii) unrolled network that unfolds the iterative optimization procedure into a deep network with many stage blocks; (iv) untrained network that uses the broad range of the neural network as a prior and performs iterative reconstruction. The main ideas of these four categories are illustrated in Fig. 5.

E2E reconstruction
End-to-end (E2E) reconstruction sends measurement into a deep neural network which directly outputs the reconstruction result. Among E2E methods, deep external-internal learning 35 proposed a novel learning strategy. First, external learning from large dataset was performed to improve the general capability of the network. Then for a specific application, internal learning from single spectral image was used for further improvement. In addition, fusion with panchromatic image showed benefits in improving spatial resolution. λ-Net 36 is an alternative architecture based on conditional generative adversarial network (cGAN). It also adopted self-attention technique and hierarchical reconstruction strategy to promote the performance.
Dataset, network design and loss function are three key factors of the E2E methods. For future improvement, various techniques from RGB patch-wise spectral reconstruction can be employed (see section "RGB Pixel-wise Spectral Reconstruction"). For example, residual blocks, dense structure, and attention module are expected to be adopted. For the choice of loss functions, back-projection pixel loss is suggested to employ, which is beneficial to data fidelity. It simulates the measurement using the known coded aperture pattern and reconstructed spectral image, and compares the simulated back-projected

Learned tensor decomposition
Evaluation results are collected from each original works measurement with the ground truth. Novel losses such as feature and style loss can also be attempted.

Joint mask learning
Coded aperture relates to sensing matrix Φ involved in spectral image acquisition process. Conventional methods based on CASSI often adopt random coded apertures since the random code can preserve the properties needed for reconstruction (e.g., restricted isometry property, RIP 37 ) in high probability. As demonstrated in ref. 38 , there are approaches for optimizing coded apertures by considering RIP as the criteria. However, such optimization does not present a significant improvement compared to the random coded masks.
In deep compressive reconstruction architecture, coded aperture is seen as an encoder to embed the spectral signatures. Therefore, it should be optimized together with the decoder, i.e., the reconstruction network. HyperReconNet 39 jointly learns the coded aperture and the corresponding CNN for reconstruction. Coded aperture was appended into the network as a layer, and BinaryConnect method 40 was adopted to map float digits to binary coded aperture entities. However, most works that used deep learning did not carefully optimize the coded aperture, hence this direction remains to be researched deeper.

Unrolled network
Unrolled network unfolds the iterative optimizationbased reconstruction procedure into a neural network. In detail, a block of the unrolled network learns the solution of one iteration in the optimization algorithm.
Wang et al. 24 proposed a hyperspectral image prior network that is adapted from the iterative reconstruction problem. Based on half quadratic splitting (HQS) 41 , they obtained an iterative optimization formula. By using network layers to learn the solution, they unfolded the K-iteration reconstruction procedure into a K-stage neural network. As a later work, Deep Non-local Unrolling  (DNU) 42 further simplified the formula derived in ref. 24 and rearranged the sequential structure in ref. 24 into a parallel one. Sogabe et al. proposed an ADMM-inspired network for compressive spectral imaging 43 . They unrolled the adaptive ADMM process into a multi-staged neural network and showed a performance improvement compared to HQS-inspired method 24 . Unrolled network can boost the reconstruction speed by freezing the parameters of iteration into neural network layers. Each stage has the mission to solve an iteration equation, which makes the neural network explainable.

Untrained network
Deep image prior, as proposed in ref. 44 , states that the structure of a generative network is sufficient to capture image priors for reconstruction. To be more specific, the range of deep neural networks can be large enough to include all common spectral image that we are going to recover. Therefore, carefully-designed untrained network is capable of performing spectral image reconstruction. Though it takes time for the iterative gradient descent procedure, such approach is free from pre-training and has high generalization ability.
Those labeled untrained in Table 1 adopted untrained network for compressive spectral reconstruction. The HCS 2 -Net 31 took random code of the coded aperture and snapshot measurement as the network input, and used unsupervised network learning for spectral reconstruction. They adopted many deep-learning techniques such as residual block and attention module to enhance the network capability. In ref. 45 , spectral data cube was considered as a 3D tensor and tensor Tucker decomposition 46 was performed in a learned way. They designed network layers based on Tucker decomposition and used low rank prior of the core tensor, which may be beneficial to better capture the spectral data structure.

Phase-coded spectral imaging
Phase-coded spectral imaging formulates the image generation as a convolution process between wavelength specified point spread function (PSF) and monochrome object image at each wavelength. The phase encoding manipulates the phase term of the PSF which will distinguish spectral signature as light propagates. Compared with amplitude-coded spectral imaging, phase-coded approach can greatly increase the light throughput (hence the signal-to-noise ratio). Since the phase encoding is mainly operated on a thin DOE, which is easy to attach onto a camera, the phase-coded spectral imaging system can be very compact.
One can recover the spectral signature by designing algorithms with the corresponding DOE (also called diffuser in some works 16,[47][48][49]. With the aid of deep learning, these methods displayed comparable performance. Furthermore, benefitting from the depth dependence of diffraction model, they can also obtain depth information apart from spectral signature of a scene 50 . Phase-coded approach for spectral imaging consists of two parts: (i) phase encoding strategy, often related to the design of DOE; (ii) reconstruction algorithm establishment. In this section, we first describe the phase encoding diffraction model, then introduce deeplearning-empowered works using different phase encoding strategies and systems.

Diffraction model
The phase-coded spectral imaging system is based on previous works of diffractive imaging 51,52 . The system often consists of a DOE (transmissive or reflective) and a bare camera sensor, separated by a distance z. As illustrated in Fig. 6, there are two kinds of phase-coded spectral imaging systems, namely DOE-Fresnel diffraction (left) and DOE-Lens system (right), different from whether there is a lens.

PSF construction
We use the transmissive DOE for model derivation. PSF p λ (x, y) is the system response to a point source at the image plane. Suppose the incident wave field at position ðx 0 ; y 0 Þ of the DOE coordinate at wavelength λ is The wave field first experiences a phase shift ϕ h determined by the height profile of the DOE: where Δn is the refractive index difference between DOE (n(λ)) and air, k = 2π/λ is the wave number. For the DOE-lens system, the PSF is 16 : where F À1 is the inverse 2D Fourier transform due to the Fourier characteristics of the lens. For DOE-Fresnel diffraction system, the wave field propagates a distance z that can be modeled by the Fresnel diffraction law such that λ ≪ z: Finally, for computation convenience, we expand the Eq. (11) and represent it with a Fourier transform F . The final PSF is formulated as Image formation Considering an incident object distribution o λ ðx 0 ; y 0 Þ at DOE, we can decompose it into integral of object points: Before hitting the sensor, the spectral distribution is where PSF denotes system response to a point source and p λ is shifted by ξ and η in x and y axis because of the same shift at the point source.
Finally, on the sensor plane (with sensor spectral response D), the intensity is Similar to Fig. 4, vectorize o λ to x and matrixize the convolution with PSF function to Φ, we can discretize Eq. (15) and form the reconstruction problem as Eq. (7). Researchers can use similar optimization algorithms or deep-learning tools for DOE design and spectral image recovery.

Phase encoding strategies
A good PSF design contributes to the effective phase encoding, which can bring more precise spectral reconstruction results. Based on the slight difference of the imaging system, we categorize the phase encoding strategies below.

DOE with Fresnel diffraction
Many phase-coded spectral imaging methods are developed from diffractive computational color imaging. Peng et al. 53 proposed an optimization-based DOE design approach to obtain a shape invariant PSF towards wavelength. Together with the deconvolution method, they reconstructed high-fidelity color image.
Although the shape invariant PSF 53 is beneficial for highquality achromatic imaging, the overlap of PSF at each wavelength causes difficulty on spectral reconstruction, which hinders its application on spectral imaging. Jeon et al. 15 designed a spectrally varying PSF that regularly rotates with wavelength, which encoded the spectral information. Their rotational PSF design makes it distinct at different wavelength, which is quite suitable for spectral imaging. By putting the resultant intensity image into an optimization-based unrolled network, they achieved high peak signal-to-noise ratio (PSNR) and spectral accuracy in visible wavelength range, within a very compact system.

DOE/diffuser with lens
A similar architecture is using DOE (or, diffuser) with an imaging lens closely behind, which is shown in Fig. 6 (right). In 2016, Golub et al. 49 proposed a simple diffuser-lens optical system and used compressedsensing-based algorithm for spectral reconstruction. Hauser et al. 16 extended the work to 2D binary diffuser (for binary phase encoding) and employed a deep neural network (named DD-Net) for spectral reconstruction. They reported high-quality reconstruction in both simulation and lab experiments.  Fig. 6 Schematic diagram of diffractive spectral imaging via a diffractive optical element (DOE). The left is the system using a transmissive DOE and a sensor, where the incident wave passes a DOE and then propagates a distance z before hitting the sensor. The propagation can be modeled by Fresnel diffraction. The right system uses an imaging lens just behind the DOE. After passing the DOE, the incident wave converged on the sensor through the lens. DOE has a height profile that introduces the phase shift

Combination with other encoding approach
Combining phase encoding with other encoding architectures is also a feasible approach, and deep learning can handle such complicated combinedarchitecture model. For example, compressive diffraction spectral imaging method combined DOE for phase encoding with coded apertures for further amplitude encoding 54 . However, the reconstruction progress is very tough, and the light efficiency is not high. Another example is the combination with optical filter array. Based on previous works of lensless imaging 47,55 , Monakhova et al. proposed a spectral DiffuserCam 48 , using a diffuser to spread the point source and a tiled filter array for further wavelength encoding. As the method has a similar mathematical spectral formation model, it is promising to apply deep learning to spectral Diffu-serCam's complex reconstruction task.

Wavelength-coded spectral imaging
Wavelength-coded spectral imaging uses optical filters to encode spectral signature along wavelength axis. Among wavelength-coded methods, RGB image, which is encoded by RGB narrowband filters, is mostly used. It is necessary to reconstruct the spectral image from the RGB one, because RGB image is commonly used by people, and the corresponding spectral image is fundamental to rendering scenes on monitors. Over the years, researchers have been pursuing fast and accurate approaches of wavelength-coded spectral imaging. They found RGB filters may be suboptimal, thus different narrowband filters as well as self-designed broadband filters are explored.

Image formation model
We first introduce the image formation model in wavelength encoding context. Consider an intensity I k (x, y) from a pixel at (x, y), k is the channel index indicating different wavelength modulation. For RGB image, k ∈ {1, 2, 3}, representing red, green, and blue. The encoded intensity is generated by the scene reflectance spectra S under illumination E: where Q k is the kth filter transmittance curve, D is the camera sensitivity, and Λ is the wavelength range.
Illumination distribution E and scene spectral reflectance S can be combined as the scene spectral radiance R: The imaging process is illustrated in Fig. 7. In practice, we have the encoded object intensities I and filter curves Q, but the camera sensitivity is sometimes inconvenient to measure, thus many methods assume it be ideally flat. Under experimental conditions, we also know illumination E. Then Eq. (17) (or Eq. (16)) becomes an (underdetermined) matrix inversion problem after discretization.

RGB pixel-wise spectral reconstruction
Early works of wavelength-coded spectral reconstruction is pixel-wise on RGB images. They consider the reduced problem of how to reconstruct a spectrum vector that has more channels from a 3-channel RGB vector, without knowing the camera's RGBfilter response. In general, these pixel-wise approaches seek a representation of the single spectrum (either manifold embedding or basis functions) and develops methods to reconstruct spectrum from that representation.
There are two modalities of methods on spectrum representation: (i) spectrum manifold learning that seeks the hidden manifold embedding space to express the spectrum effectively; (ii) basis function fitting that expands the spectrum as a set of basis functions, and fit a small number of coefficients.

Spectrum manifold learning
This approach assumes that a spectrum y is controlled by a vector x in the low-dimensional manifold M and tries to find the mapping f that relates y with x: where D is the high-dimensional data space (commonly, M ¼ R m ; D ¼ R n :m; n 2 N is the space dimension). F is a functional space that contains functions mapping data from M to D. Manifold learning assumes a low-dimensional manifold M embedded in the high-dimensional data space D, and attempts to recover M from the data drawn in D. Reference 56 proposed a three-step method: (i) Find an appropriate dimension of the manifold space through Isometric Feature Mapping (Isomap 57 ); (ii) Train a radial basis function (RBF) network to embed the RGB vector in M, which determines the inverse of f in Eq. (18); (iii) Use dictionary learning to map the manifold representation in M back to the spectra space, which determines the function f in Eq. (18). The RBF network and dictionary learning method can be substituted by deep neural networks (such as AutoEncoder) to improve the performance, hence the manifold-based reconstruction can be further promoted.

Basis function fitting
This approach assumes that a spectrum y = y(λ) is expanded by a set of basis functions {ϕ 1 (λ), … , ϕ N (λ)}: where α are the coefficients to fit. In a short note by Glassner 17 , a simple matrix inversion method was developed for RGB-to-spectrum conversion, but the resultant spectrum only has three nonzero components, which is rare in real world. At the end of the note, the author reported a weighted basis function fitting approach to construct spectrum from RGB triplet, with constant, sine, and cosine three functions. To render light interference, Sun et al. 18 compared different basis functions for deriving spectra from colors and proposed an adaptive method that uses Gaussian functions. Nguyen et al. 20 further developed the basis function approach, proposing a data-driven method that learns RBF to map illumination normalized RGB image to spectral image.
In ref. 21 , an over-complete hyperspectral dictionary was constructed using K-SVD algorithm from the proposed dataset, which contained a set of nearly orthogonal vectors that can be seen as learned basis functions. Similar to the dictionary learning approach, deeplearning tools can be used for learning basis functions. In ref. 58 , basis functions are generated during training, and coefficients are predicted through a U-Net at test time. It is very computationally efficient since it only needs to fit a small number of coefficients during the test time. Although the spectral reconstruction accuracy is not as high as other CNN-based methods (which sufficiently extract spectral patch correlation), it is the fastest method in NTIRE 2020 with reconstruction time only 34 ms per image.

RGB Patch-wise spectral reconstruction
As reported in ref. 22 , spectra within an image patch has certain correlation. However, pixel-wise approaches cannot exploit such correlation, which may lead to poor reconstruction accuracy in comparison with patch-wise approaches. In ref. 59 , a handmade patch feature through convolution operation was proposed, which extracts neighborhood feature of a RGB pixel from the training spectral dataset. This work gave a practical idea of how to utilize such patch feature in a spectral image, which is just suitable for convolutional neural networks (CNNs).
CNNs can perform more complex feature extraction through multiple convolution operators. In 2017, Xiong et al. proposed HSCNN 23 to apply a CNN on up-sampled RGB and amplitude-coded measurements for spectral reconstruction. At the same year, Galliani et al. proposed learned spectral super-resolution 60 , using a CNN for endto-end RGB to spectral image reconstruction. Their works obtained good spectral reconstruction accuracy on many open spectral datasets, encouraging later works on CNNbased spectral reconstruction. The number of similar works grew rapidly as New Trends in Image Restoration and Enhancement (NTIRE) challenge was hosted in 2018 61 and 2020 25 , where many deep-learning groups joined in and contributed to the exploitation of various network structures for spectral reconstruction.
Neural network-based methods takes the advantage of deep learning and can better grasp the patch spectra correlation. Diverse network structures as well as advanced deep-learning techniques are exploited by different works, which are arranged in Table 2.

Leveraging advanced deep-learning techniques
We can gain some inspirations from Table 2. First, most works are CNN-based, this perhaps because CNN can better extract patch spectral information than generative adversarial networks (GANs). There was a work based on conditional GAN (cGAN) 62 , which takes RGB image as conditional input. They also used L 1 distance loss (mean absolute error loss) as ref. 63 to encourage less blur, but the reconstruction accuracy was not better than HSCNN 23 (ref. 62 has relative root-mean-square error (RMSE) 0.0401 on ICVL dataset, while HSCNN has 0.0388).
Moreover, many advanced deep-learning techniques are introduced and shown to be effective. For instance, residual blocks 64 and dense structure 65 become increasingly common. This is because residual connection can broaden the network's receptive field and dense structure can enhance the feature passing process, resulting in better extraction of spectral patch correlation. Attention mechanism 66 is a popular deep-learning technique and is also introduced in spectral imaging works. For spectral reconstruction, there are two kinds of attention: spatial  Fig. 7 Illustration of wavelength encoding spectral imaging process. The scene S is illuminated by the light source E, and is wavelength-coded through filters Q. Then the encoded scene spectral radiance is captured by the imaging lens on a sensor with spectral response D In the network architecture column, level means parallel CNN layers for data flow. For the deep-learning techniques column, we highlight the techniques that may play an important role in the method's performance.
Performance evaluations are collected from reported results of the original article, corresponding articles or NTIRE competition. Evaluation results from the original article is considered first, then NTIRE competition, and finally the corresponding articles. If the evaluation results occurred in both the original article and the NTIRE competition report, we use [] to denote the evaluation result in the NTIRE report. Evaluation values labeled "s" in the table are from the scaled dataset (datasets that are linearly scaled to [0,1] range) attention (e.g., the self-attention layer 67,68 ) and spectral attention (channel attention 69 ). Attention module learns a spatial or spectral weight, helping the network focus on the informative parts of the spectral image. Feature fusion is the concatenation of multiple parallel layers, which was researched in ref. 70 . It was adopted in refs. [71][72][73] and showed positive influence on spectral reconstruction. Finally, ensemble technique is encouraged to further promote the network performance. Model ensemble and self ensemble are two kinds of ensemble strategies. Model ensemble averages networks that are retrained with different parameters, while self ensemble averages the results of transformed input to the same network. Single network may fall into local minimum, which leads to poor generalization performance. By applying the ensemble technique, one can fuse the knowledge of multiple networks or different viewpoint to the same input. HRNet 73 adopted model ensemble, and it showed improvement on reconstruction result.
Since the spectral reconstruction is a kind of image-toimage task, many works borrow effective deep-learning techniques from other image-to-image tasks, such as U-Net architecture from 74 segmentation task, sub-pixel convolution layer 75 , channel attention 69 from image super resolution task, and feature loss and style loss from image style transfer task 76,77 . This is also a way to introduce advanced deep-learning techniques into spectral reconstruction.

Towards illumination invariance
Object reflectance spectrum without illumination is a desired objection for spectral reconstruction, since it honestly reflects the scene components and properties. To recover object reflectance, one need to strip out environment illumination E from scene spectral radiance R, but it is inconvenient to measure the illumination spectra. Researchers often use illumination invariant property of the object spectrum to remove attached illumination from the scene radiance.
Reference 20 proposed an approach to employ illumination invariance. They proposed RGB white-balancing to normalize the scene illumination. As an additional product, they can estimate the environment illumination by comparing reconstructed scene with the original scene. In ref. 78 , Denoising AutoEncoder (DAE) was used to obtain robust spectrum from noised input, which contains original spectrum under different illumination conditions. Through this many-to-one mapping, reconstruction to spectrum became invariant to illumination.
Utilizing RGB-filter response RGB-filter response is the wavelength encoding function Q in Eq. (16). In many works 79,80 , the RGB-filter response is termed camera spectral sensitivity (CSS) prior.
To avoid semantic ambiguity of CSS and camera response D in Eq. (16), we substitute it with RGB-filter response.
RGB-filter response is not always accessible for practical applications, which is a notable problem. A common way to tackle it is using CIE color mapping function for simulation 81 . Reference 79 proposed another solution to address this problem. They adopted a classification neural network to estimate a suitable RGB-filter response from the given camera sensitivity set. Then they can use the estimated filter response function and another network to recover the spectral signature. These two nets were trained together via a united loss function.
When RGB-filter response is known, RGB image can be reconstructed from spectral image, thus back-projection (or perceptual) loss can be used. Experiments have shown benefits to add the filter response prior in reconstruction. For example, AWAN 80 , who ranked 1st in NTIRE 2020 Clean track, adopted filter response curves in loss function and got a slight improvement on MRAE metric.
In ref. 82 , the RGB-filter response Q is carefully exploited. They demonstrated that the reconstructed spectrum should follow the color fidelity property Q T ψ (I) = I, where ψ is the RGB-to-spectrum mapping and I is the RGB pixel intensity.
They defined the set of spectra that satisfy color fidelity as plausible set: where r is spectrum. The concept of physically plausible was illustrated in Fig. 8. They suggest that the reconstructed spectrum should contain two parts: one from the space spanned by three column filter response vectors in Q, and the other from the orthogonal complement space of the former. Formally, there exists an orthogonal basis B 2 R nÀ3 such that B T Q = 0. Therefore, the spectrum to be reconstructed can be expanded as where P Q and P B are projection operators. Note that P Q can be precisely calculated in advance, which reduces the reconstruction calculation by 3 dimensions. The remaining task is estimating the spectrum vector in an orthogonal space of filter response vectors, which can be done by training a deep neural network.

Beyond RGB filters
Since the RGB image has limited information, researchers tend to manually add more information before reconstruction. There are two ways to realize this: (i) using self-designed broadband wavelength encoding to expand the modulation range; (ii) increasing the number of encoding filters. Works in this area mainly use deep-learning tools to design filter response curves and perform spectral reconstruction [83][84][85] , since the modulation design and the reconstruction process are complicated in computation.

Using broadband filters
Based on the idea that traditional RGB camera's spectral response function is suboptimal for spectrum reconstruction, Nie et al. 83 employed CNNs to design filter response functions and jointly reconstruct spectral image. They observed the similarity between camera filter array and convolutional layer kernel (the Bayer filter mosaic is similar to a 2 × 2 convolution kernel) and used camera filters as a hardware-implementation layer of the network. Their result showed improvement than traditional RGBfilter-based methods. However, limited by the filter manufacture technology, they only considered filters that were commercially available.
With the maturity of the modern filter manufacture technology, flexible designed filters with specific response spectrum becomes realizable. Song et al. presented a joint learning framework for broadband filter design, named parameter constrained spectral encoder and decoder (PCSED) 84 , as illustrated in Fig. 9.
They jointly trained filter response curves (as spectral encoder) and decoder network for spectral reconstruction. Benefited from the development of thin-film filter manufacture industry, they can design various filter response functions that are favored by the decoder. They extended the work in ref. 85 and got impressive results. The developed hardware, broadband encoding stochastic (BEST) camera, demonstrated great improvements on noise tolerance, reconstruction speed and spectral resolution (301 channels). For the future direction, anti-noise optical filters produced from metasurface is promising with the development of metasurface theory and industry 86 .

Increasing filter number
Increasing filter number is a straightforward approach to enhance reconstruction accuracy by providing more encoding information. However, this will inevitably lead to bulky system volume. An alternative way to perform wavelength modulation is using liquid crystal (LC). In this way, changing the voltage will switch LC to a different modulation, thus it is convenient to use multiple modulations by applying different voltages. By fast changing the voltage on LC, multiple wavelength encoding operators can be obtained, which is equivalent to increasing filter numbers. Based on different responses of the LC phase retarder to different wavelengths, the Compressive Sensing Miniature Ultra-Spectral Imager (CS-MUSI) architecture can modulate the spectra like multiple optical filters. Oiknine et al. reviewed spectral reconstruction with CS-MUSI instrument in ref. 87 . They also proposed DeepCubeNet 88 that adopted CS-MUSI system to perform 32 different wavelength modulations and used CNN for spectral image reconstruction.

Spectral imaging datasets
Spectral dataset that contains realistic spectral-RGB image pairs are important for data-driven spectral imaging methods, especially for those using deep learning. CAVE 89 , NUS 20 , ICVL 21 61 ) and larger-than-ever database NTIRE 2020 25 were provided. We summarize the public available spectral image datasets in the following tables. Table 3 gives an overview of the spectral datasets and Table 4 provides a detailed description of the data.
Two problems still exist for these datasets: (i) insufficient capacity for extracting high-complexity spatial-spectral feature; (ii) unfixed train-test split. Some datasets don't provide a fixed train-test split, causing unfair comparison among methods that use different train-test split strategy. Therefore, it is important to have a large but standard database. We hope the database has unparalleled scale, accuracy and diversity to boost future researches.
At present phase when such a giant standard dataset is not available, we think the popular datasets ICVL 21 , CAVE 89 , NUS 20 and KAIST 30 are sufficient for the reconstruction accuracy analysis on both spatial and spectral domain.

Spectral image quality metrics
There are numerous metrics used for performance evaluation in spectral reconstruction, and we refer to ref. 93 for their definition and comparison.
In general, PSNR, structural similarity (SSIM) index and spectral angle map (SAM) are mostly used for amplitudecoded methods, while different metrics like root-mean square error (RMSE) and mean relative absolute error (MRAE) are applied on wavelength-coded methods. As a consequence, it is inconvenient to compare the performance between wavelength-coded and amplitude-coded methods. Therefore, for the convenience of the community to compare different methods, it is necessary to set unified metrics. We think some common metrics are needed for the comparison between the two methods. For example, SSIM, RMSE and RMAE can be employed by both methods at evaluation. Furthermore, we also need metrics to compare the reconstruction speed. Different works perform spectral reconstruction for different resolution images on various computing devices. We think pixel reconstruction speed is a reliable metric to compare reconstruction speed. It is the average speed on test dataset divided by the the 3D resolution of the data (i.e., total pixels of the spectral data used for testing).

Conclusions and future directions
We have summarized different computational spectral reconstruction methods that adopted deep neural networks, detailing their working principles and deeplearning techniques, under three encoding-decoding modalities: (i) Amplitude-coded. It uses coded aperture for amplitude encoding and is a compressive spectral imaging approach, which exploits compressive sensing theory and iterative optimization process for spectral reconstruction. Based on this feature, some learned reconstruction algorithms are designed to reduce the time consumption for optimization (e.g., unrolled networks), or use deep neural networks to improve the optimization accuracy (e.g., untrained networks). (ii) Phase-coded. It uses DOE to modulate the phase of the input light for each wavelength, and is physically based on Fresnel propagation to expand such phase modulation onto the resultant image. By leveraging creative design of DOE, it enjoys the compactness of the system and improved light throughput. (iii) Wavelength-coded. A common case of wavelength encoding is the RGB image. RGB-to-spectrum is essential in computational graphics, for the benefit of easy-tuning in rendering scenes with spectra on monitors. To extract spectra feature from the RGB data, deep-learning algorithms either map them to a manifold space, or explore the inherent spatial-spectral correlation. As an extension of the RGB-based approaches, multiple self-designed broadband filters for wavelength encoding is developed in recent years. It is more advantageous in the reconstruction precision of the spectra, but the results are also sensitive to the filter fabrication error and imaging noise.
For future directions, extra scene information is expected to promote the reconstruction performance on specific application. In C2H-Net 92 , object category and position was used as a prior, similar to the famous object detection framework YOLO 94 . Based on the observation that pixel patches with object information was often more important than background environment, they introduced object category and position into the reconstruction process. Using additional information can also benefit functional applications of spectral imaging. As a later work of C2H-Net, ref. 95 contributes to objection detection using spectral imaging with additional object information.
Additionally, joint encoder-decoder training is also an important direction. Encoder is the hardware layer before the reconstruction algorithm, such as coded aperture, DOE, or optical filter. Simultaneously training the encoder and decoder can provide the decoder with the coding information, thereby improving the performance 39,84 . However, two problems are waiting to be addressed. (i) Finding more efficient encoding hardware and modeling it into a network layer, such as using DOE to improve the light throughput. CS-MUSI architecture that can replace multiple filters 88 is also encouraged to explore. (ii) Overcoming gradient vanishment. Since the hardware layer is the first layer of the whole deep neural network, when gradient propagates back, it is always very small, which in turn confines the possible change of the hardware layer. If the above two problems are elegantly solved, we believe the deep-learning-empowered computational spectral imaging can step further.
The past decade has witnessed a rapid expansion of deep neural networks in spectral imaging. Despite the success of deep learning, it still has a lot of room for further optimization. Reinforcement learning (RL) is a promising technique to improve the performance. To date, it proves useful to employ RL in finding optimal reconstruction network architectures (i.e., neural architecture search, NAS 96 ). With the improvement of computing power, such techniques are promising to increase the performance of learned spectral imaging methods. Finally, we think transformer-based large-scale deeplearning models have great potential in spectral reconstruction task. Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism 66 . It presents strong representation capabilities and has been widely applied in vision tasks 97 . However, such large-scale deep neural networks require huge data for training, hence large-than-ever spectral datasets are demanded, as suggested in section "Spectral Imaging Datasets".