Unveiling the potential of diffusion model-based framework with transformer for hyperspectral image classification

Hyperspectral imaging has gained popularity for analysing remotely sensed images in various fields such as agriculture and medical. However, existing models face challenges in dealing with the complex relationships and characteristics of spectral–spatial data due to the multi-band nature and data redundancy of hyperspectral data. To address this limitation, we propose a novel approach called DiffSpectralNet, which combines diffusion and transformer techniques. The diffusion method is able extract diverse and meaningful spectral–spatial features, leading to improvement in HSI classification. Our approach involves training an unsupervised learning framework based on the diffusion model to extract high-level and low-level spectral–spatial features, followed by the extraction of intermediate hierarchical features from different timestamps for classification using a pre-trained denoising U-Net. Finally, we employ a supervised transformer-based classifier to perform the HSI classification. We conduct comprehensive experiments on three publicly available datasets to validate our approach. The results demonstrate that our framework significantly outperforms existing approaches, achieving state-of-the-art performance. The stability and reliability of our approach are demonstrated across various classes in all datasets.

Advanced architectures such as autoencoders were later developed to extract a compressed representation of HSI data for classification purposes 18 .Attention mechanisms were integrated into CNN architectures to enhance the accuracy of classification by weighing the importance of different spectral bands 19 .Furthermore, advancements in the CNNs led to the introduction of novel pooling and unpooling mechanisms that better preserve spatial information during classification 20 .In recent years, the CNNs have been shown to be effective in HSI classification; however, there are still several limitations.For instance, the convolutional operations handle a local neighborhood.Hence, the number of layers and kernel size restrict the CNNs' receptive field, making it less effective at capturing long-range dependencies in input data 21 .As a result, learning the long-range dependencies of the HSI, often consisting of hundreds of spectral bands, is challenging.
Recurrent neural network (RNNs) 22 are capable of capturing the spatial-spectral relationship from long-range sequence data, they face challenges such as vanishing gradients and dependency on the order of spectral bands.Transformers 23 , originally designed for natural language processing (NLP), have shown promising results when integrated into HSI classification.They effectively capture long-range dependencies in hyperspectral data 24,25 .Here, CNN is a vector-based method that considers the inputs as collection of pixel vectors 26 , and thus it can lead to information loss when processing with hyperspectral pixel vectors 27 .In the work 28 , a multispectral image classification framework was introduced to overcome the limitations of the CNNs in pixel-wise remote sensing classification and spectral sequence representation and, integrates fully connected (FC) layers, CNNs, and transformers.Unlike the classic transformers that focus on band-wise representations, SpectralFormer 24 is an example of such a framework that captures spectrally local sequence information, creates group-wise spectral embeddings, and introduces cross-layer skip connections to retain crucial information across layers through adaptive residual fusion.Another novel model, namely SS1DSwin 29 , is based on transformers and implements the network architecture of swin transformer 30 .It was shown to effectively capture reliable spatial and spectral dependencies for HSI classification.
Effectively learning rich representations and addressing the complexities of spectral-spatial relations in high-dimensional data are crucial for achieving optimal HSI classification results.However, transformer-based methods face challenges in directly capturing reliable and informative spatial-spectral representations available in HSI.They generally do not fully leverage spatial information 31 and have limitations in extracting fine-grained local feature patterns 32 .Recently, the denoising diffusion probabilistic model (DDPM) 33 has emerged as a groundbreaking class of generative models, adept at modeling complex relationships and effectively learning high-level and low-level visual features.SpectralDiff 34 leveraged a diffusion model to extract potent features.However, it employed a pixel-wise classification approach, which limits the ability to effectively capture and identify distinct spatial-spectral relationships in HSI.
To overcome these challenges, we have thoroughly re-evaluated the process of extracting features of the HSI data from different perspectives.Consequently, we have developed a novel HSI classification method that incorporates diffusion and transformer techniques leveraging their respective advantages.The features' representation learned from the diffusion models have been demonstrated to be highly effective in various discriminating tasks with impressive performance like semantic segmentation 35 , object detection 36 , and face generation 37 .
This paper presents a novel classification framework called DiffSpectralNet, combining a diffusion-based spectral-spatial network with transformers.This diffusion model, a type of generative models, excels in capturing the relationships between spectral and spatial information in HSI data.Deep features are extracted both effectively and efficiently to make the most of the spectral-spatial information present in the data.The main stages of the framework are summarized as follows: first, we ultilise forward and reverse diffusion processes to learn high-level and low-level features from HSI.Second, to make effective use of the extensive timestampswise features, we extract intermediate hierarchical features from the denoising U-Net at different timestamps.Subsequently, we employ a proposed supervised transformer-based classifier for performing HSI classification.
We examine the effectiveness of the proposed method conducted on three widely known datasets that their download link can be found in the Data availability section.Our results clearly demonstrate that the proposed method significantly improves classification results and outperforms other advanced HSI classification methods.Moreover, this study also opens the way for further investigations into the potential of diffusion models in learning high-and low-level spectral-spatial features with significant flexibility in HSI.Ongoing research will likely enhance the application of diffusion models in processing complex, high-dimensional hyperspectral data, opening up promising prospects for diverse applications.

Results
In this section, we begin by providing an introduction to three different experimental datasets for HSI.After that, we delve into the details of the experimental results that have been produced by our proposed model.In addition, we conduct a thorough analysis of parameters of the framework to gain a better understanding of their significance and implications.

Dataset
Three well-known available datasets, Indian Pines, Pavia University and Salinas Scene, were used to examine the classification performance.Number of categories and their correspondent samples were shown in Table 1.First, the Indian Pines dataset collected in 1992 using the Airborne Visible Infrared Imaging Spectrometer (AVIRIS) Sensor, covering the northwestern region of Indiana in the United States.It consists of 145 × 145 pixels with each pixel having a spatial resolution of 20 metres (m) and 220 spectral bands in the wavelength range of 400-2500 nm.The dataset contains labeled pixels with 16 categories.We use 10% of the labeled samples for training and the rest for testing.The second HSI dataset, Pavia University, was acquired by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor.The ROSIS sensor acquired 103 bands covering the spectral range from 430 to 860 nm, and the dataset consists of 610 × 340 pixels at GSD of 1.3 m.Moreover, there are 9 land cover classes in the dataset.We use 5% of the labeled samples for training and the rest for testing.Lastly, Salinas Scene dataset was collected using the AVIRIS sensor and is situated in Salinas Valley, California.The spatial resolution is set at 3.7 m. and the dataset includes 16 crop types and has been widely utilized in classification.After the exclusion of 20 bands associated with water vapor and noise, a total of 204 bands remained, resulting in a data size of 512 × 217 .We use 5% of the labeled samples for training and the rest for testing.

Training process
We used the PyTorch framework to implement and train the DiffSpectralNet model.The training was done on a basic hardware setup, which consists of a POWER8NVL production-grade CPU with 128 CPU threads spread across 2 sockets for efficient processing.Additionally, four NVIDIA Tesla P100 GPUs were used for enhanced graphical computations, each offering a memory of approximately 16 GB.
The diffusion model was optimised using the Adam optimizer and trained for 30, 000 epochs for all datasets.We set the learning rate to 1 × 10 −4 , with a batch size of 128 and a patch size of 32 × 32 .Due to hardware limitations, we use batch size 64 for the Salinas scene dataset.To determine the amount of spectral information preserved in the compressed data, we employed PCA.Given that each dataset presents a distinct number of features post-pre-training with the diffusion model, the range of PCA components varies among three datasets.The classification model was trained using the Adam optimizer, maintaining the same learning rate of 1 × 10 −4 and a batch size of 128 for Indian Pines, Pavia University, and 64 for Salinas Scene.The size of feature patch is empirically set as 7 × 7. The number of epochs was set to 300 for Indian Pines and 600 for Pavia University and Salinas Scene datasets.

Performance evaluation
We evaluate the performance using three prominent metrics: overall accuracy (OA), average accuracy (AA), and Kappa coefficient ( κ ).OA gives a direct insight into general model performance, and AA ensures each class has a balanced contribution, especially in imbalanced datasets.On the other hand, κ measures the reliability between the ground truth and model predictions.
To demonstrate the effectiveness of our proposed DiffSpectralNet, we compare its classification performance with various state-of-the-art approaches, and the following methods were chosen: DMVL 38 , similar to our proposed model, follows the two-stage algorithms.It performs unsupervised feature extraction followed by classification using an SVM classifier.3DCAE 39 is an unsupervised method to learn spectral-spatial features.It uses the encoder-decoder backbone with 3D convolution operations, GSSCRC 40 algorithm incorporates the cooperative representation classification model and introduces the geodesic distance calculation method to select spectral nearest-neighbour information, thereby effectively utilising the neighbour information in HSI.This approach facilitates the exploration and utilization of the spatial-spectral neighbourhood structure of HSI data for classification.SS1DSwin 29 design reveals local and hierarchical spatial-spectral links through two modules: the Groupwise Feature Tokenization Module (GFTM) and the 1DSwin Transformer with Cross-Block Normalized Connection Module (TCNCM).GFTM processes overlapping cubes and uses multihead selfattention for spatial-spectral relationships.Meanwhile, TCNCM utilises window-based strategies for spectral relationships and cross-block feature fusion.SpectralFormer 24 uses transformers from a sequential perspective for classification, learns spectrally local sequence information from neighbouring bands of HSI, yielding group-wise spectral embeddings.Also, to reduce the possibility of losing valuable information in the layer-wise propagation process, a cross-layer skip connection is devised from shallow to deep layers by adaptively learning across layers.
We conducted experiments on the Salinas dataset, not covered in the SpectralFormer 24 , using the same traintest split ratio employed in our experiments for consistent comparison.SpectralDiff 34 employs an unsupervised feature extraction using a spectral-spatial diffusion module.These features are then processed per pixel by the supervised attention-based classification module.
It is worth mentioning that we directly used the outcomes reported in the papers of each of these methods.Both CNN-based and transformer-based methods produced good classification results.
Based on the analysis of classification results obtained for the Indian Pines, Pavia University, and Salinas Scene datasets presented in Table 2, the DiffSpectralNet algorithm proposed in this study shows improved classification accuracy for most ground objects when compared to other classification methods.The proposed method achieves the best OA, AA, and κ values, with OA reaching 99.06%, 99.74 %, and 99.87% on the Indian Pines, Pavia University and Salinas Scene datasets, respectively.Visualisation in the Fig. 1 clearly shows DiffSpectralNet outperformed others.Moreover, we conducted additional statistical analyses using Analysis of Variance (ANOVA) and Mann-Whitney U Test, both with a confidence score of 95% .The p value from these tests were lower than 0.05, indicating that DiffSpectralNet's performance is significantly different across three measurement metrics.All of these results proves that the DiffSpectralNet algorithm efficiently and effectively learns low and high-level features using the diffusion model.Additionally, the DiffSpectralNet algorithm leverages the combination of spectral and spatial information, enabling it to extract a greater amount of information for classification.Therefore, the DiffSpectralNet algorithm proposed in this study demonstrates promising potential for improving the accurate classification of ground objects.
In addition to the above quantitative metrics, classification maps in the proposed method have been produced, as shown in Figs. 2, 3 and 4. Compared with ground truth, the proposed method obtains more accurate classification results, which further proves the effectiveness of the proposed method in the classification of hyperspectral data.
Figure 2 illustrates the classification results obtained using the DiffSpectralNet and the comparison algorithms on the Indian Pines dataset.The map highlights that the algorithm proposed in this study exhibits classification performance that closely resembles the actual terrain map of the Indian Pines dataset.The misclassification of terrain pixels is observed to be relatively minimal, resulting in a smoother overall effect.Notably, the algorithm demonstrates superior performance in classifying Grass-pasture-mowed, Oats, Wheat, and Woods features.Figure 1.The boxplots to visualise the performance of each model using three prominent metrics.Note that, there is no published result of 3DCAE on these datasets, therefore it was not included in the visualisation.Figure 4 presents the classification effect maps of the proposed model and the comparison algorithms on the Salinas Scene dataset.By observing the classification effect map of the model, it can be concluded that in the Brocoli_green_weeds_1, Brocoli_green_weeds_2, Fallow, Soil_vinyard_develop, Lettuce_romaine_4wk, Lettuce_ romaine_5wk, Lettuce_romaine_6wk and Vinyard_vertical_trellis regions, there are fewer misclassified pixels of ground features compared with the comparison algorithms, resulting in a smoother overall effect map.This  demonstrates that the DiffSpectralNet proposed in this paper can effectively reveal the intrinsic features hidden behind a HSI by learning low and high-level features.
For a comprehensive examination of the detailed performance metrics of each class for all three datasets, readers are directed to the Supplementary materials provided.In supplementary sections, we thoroughly compare our classification performance across various classes against a range of state-of-the-art methodologies to demonstrate the stability and reliability of our approach.

Discussion
In this section, we explore further experiments and discussions on the following three aspects to explore the optimal classification performance and the application of the proposed model in practical remote sensing classification.First, we conduct experiments to discuss how to extract features from the pre-trained diffusion model to achieve optimal performance at various Timestamp and Feature index values.Second, we analyse the impact of the number of training samples directly affecting the network's performance.Finally, we examine the influence of the quantity of PCA components on the spectral information in HSI datasets.
• Sensitivity analysis of Timestamps and Feature index: In order to analyse the features obtained from the dif- fusion pre-trained model, we have conducted classification experiments on various Timestamp and Feature index values and then recorded the change in the classification performance.Using the DDPM, we monitored classification efficacy alterations when Timestamp and Feature Index varied, and the optimal combination of Timestamp and Feature Index is essential to ensure accurate outcomes.Table 3 showcases the performance is sensative to Timestep and FeatureIndex.For the Indian Pines and Pavia University datasets, there is a certain correlation between Timestamp and FeatureIndex.When considering the Timestamp dimension, a decreasing trend in classification performance is observed when using features with larger Timestamps, and the optimal affects the performance of the network.To verify this with the proposed DiffSpectralNet, We evaluated the training dataset using random proportions ranging from 10 to 100% with increments of 10% , and depict the comparative results in Fig. 5a.As expected, the classification accuracy gradually improves with an increase in training samples.It is worth noting that OA tends to be stable when the percentage of training samples is greater than 50% .However, when the percentage of training samples in the Indian Pines dataset is less than 50% , the performance is unsatisfactory may be due to the insufficient number of samples for a proper training.Therefore, it is reasonable to extrapolate that DiffSpectralNet is reliable and stable for this task.• Effect of PCA components on diffusion feature: We investigate the impact of the number of PCA components on the compressed spectral data.The data retain more spectral details with more PCA components but at the cost of increased computational demand and redundancy.The number of diffusion features varies across datasets, influencing the range of PCA components, which varies from D/6 to D/15, where D represents the diffusion features in a dataset.The results in Fig. 5b suggest optimal performance with D/8 PCA components.

Methods
In this section, we describe a novel method called DiffSpectralNet that consists of two stages: an unsupervised diffusion process and a supervised classification.The unsupervised diffusion process is derived from the DDPM with the purpose to learn spectral-spatial representations effectively.In this process, we extract plenty of spectral-spatial features from various time steps t during the reverse diffusion process of DDPM to capture the characteristics of different objects in HSI data.Finally, these features are inputted into the supervised classification model for classification.

Diffusion-based unsupervised spectral-spatial feature learning
In order to capture complex spectral-spatial relations and label-agnostic information of HSI data effectively, the first step of our proposed approach is to train a diffusion model in an unsupervised manner, as shown in Fig. 6a.We introduce the detailed formulation of our unsupervised feature learning procedure, which involves diffusion-based forward and backward processes with the HSI data.
• Forward diffusion process: DDPM represents a category of models based on likelihood estimations.In the forward process, Gaussian noise is added to the original training data.In our proposed model, we aim to learn spectral-spatial features effectively in an unsupervised manner.We start by training our DDPM using unlabeled patches randomly cropped from the HSI dataset.To prepare the data for training, the data is pre-processed by patch cropping operation.Next, patches are randomly sampled from HSI for DDPM training.Formally, given an unlabeled patch x 0 ∈ R P×P×B , where P denote the height and width of patch x 0 , B represents the number of spectral channels, respectively.During the forward diffusion process, Gaussian noise is gradually added to the HSI patch according to the variance schedule {β t } T t=0 in the diffusion process where T is the total number of the timestep.The process follows the Markov chain 33 process: where N is a Gaussian distribution.The above formulation leads to the probability distribution of the HSI at a given time t + 1 is obtained by its state at time t.During the first diffusion, the spectral-spatial instance with noise is expressed as follows: At the t th step, the spectral-spatial instance incorporated with noise is expressed as follows: where α t = 1 − β t and, α t represents the product of α 1 to α t .Given these inputs, the hyperspectral instance at timestep t can be straightforwardly produced by Eq. ( 3).
• Reverse diffusion process: In the reverse diffusion process, a spectral-spatial U-Net 41 denoising network is employed is trained to predict the noise added on x t−1 , taking noisy patch x t and timestep t as inputs.And x t−1 is calculated by subtracting the predicted noise from x t .DDPM uses a Markov chain process to remove the noisy sample x T to x 0 step by step.Under large T and small β t , the probability of reverse transitions is approximated as a Gaussian distribution and is predicted by a U-Net as follows: where the reverse process can be re-parameterized by estimating µ θ (x t , t) and σ θ (x t , t) .σ θ (x t , t) is set to σ 2 t I , where σ 2 t is not learned.To obtain the mean of the conditional distribution p θ (x t−1 |x t ) , we need to train the network to predict the added noise.The mean of µ θ (x t , t) is derived as follows: where ε θ (•, •) denote the spectral-spatial denoising network whose input is the timestep t and the noisy hyperspectral instance x t at timestep t.The denoising network takes in the noisy hyperspectral instance along with the timestep to produce the predicted noise.The U-Net denoising model ε θ (x t , t) is optimised by minimising the loss function of the spectral-spatial diffusion process can be expressed as follows:

Supervised classification using spectral-spatial diffusion feature
After training the network using unsupervised spectral-spatial methods, we start extracting useful diffusion features from the pre-trained DDPM.Next, we employ a transformer-based classifier for classification.
During the feature extraction step, we utilize the U-Net denoising network to extract a spectral-spatial timestep-wise feature.The pre-training of DDPM enables it to capture rich and divers information from the input data during the reverse process.As a result, we extract features from the intermediate hierarchies of DDPM at various timesteps to create robust representations that encapsulate the salient features of the input HSI.The parameters of the pre-trained DDPM remain constant, as shown in Fig. 6b.We gradually add Gaussian noise to the input patch x 0 ∈ R P×P through the diffusion process.For a noisy input patch x t at timestep t , the noisy version x t can be directly determined using Eq.(3).Subsequently, x t is fed into the pre-trained spectral-spatial denoising U-Net to derive hierarchical features from the U-Net decoder.Diffusion features from various decoder layers are collectively upsampled to P × P and then merged to form the feature f t in R P×P×L at timestep t , where P represents the height and width of the patch and L denotes the feature channel.For each feature f ti ∈ R P×P×L , we retain only the vector associated with the center pixel, indexed as C i ∈ R p×p×L .This approach significantly reduces the computational cost due to a decrease in parameters.We input the extracted diffusion features (C(f ti ) patch-wise to learn group-wise spectral embeddings.By proposing to learn group-wise spectral embeddings, we aim to precisely identify and classify the diverse features based on their distinct spectral properties.The groupwise spectral embedding features use a linear projection layer for mapping features to a token sequence for the transformer.Positional embedding is added to the input token sequence before feeding it to the transformer.This provides the transformer with information about the relative positions of the patches.Therefore, the abundant features contain diverse and multi-level information of the input HSI data, which we use for classification.
After mapping the patch representation, a network is needed to predict the classification label.Transformerbased classifiers are trained based on the inspiration from 24 , as shown in Fig. 6b.The classification module combines the CNN and transformer structures to form an effective classifier.These classifiers take positionally embedded feature patches as inputs and use an MLP head to predict the final classification scores.Inspired by the success of skip connection in U-Net 42 , and ResNet 16 for image segmentation and recognition, respectively.A cross-layer skip connection is introduced in the classifier to minimise the possibility of losing valuable information in the layer-wise propagation process and enhance the information transitivity between layers.The classifier model utilises skip connection, multi-head attention mechanisms, feed-forward neural networks to spectral-spatial feature mapping, and a transformer structure for deep feature extraction, resulting in outstanding classification performance.

Conclusion
HSI contains rich spectral-spatial information and complex relations, which are critical for classification tasks.
The proposed method provides a unique viewpoint for the spectral-spatial diffusion process, which is capable of modeling complex relationships for understanding inputs and learning both high-level and low-level features.
In conclusion, most current methods for HSI classification rely on CNN or Transformer models, which may not efficiently extract patterns and information.In contrast, our proposed method, employing the diffusion model, effectively and efficiently learns discriminative spectral-spatial features.This approach allows us to explore and utilise the spatial-spectral neighborhood structure of hyperspectral data, resulting in the effective extraction of deep features.Instead of processing on a pixel-by-pixel basis, the diffusion features are introduced in patches to improve the ability to capture details for more accurate classification.We employed a transformer-based model with a cross-layer skip connection, which reduces the possibility of losing valuable information in the layer-wise propagation process.We demonstrated the superiority of our proposed DiffSpectralNet approach by achieving state-of-the-art results in HSI classification based on quantitative trials conducted on three HSI datasets.In future studies, we aim to validate and enhance the performance of our proposed model on additional hyperspectral datasets across various domains, such as the medical field.Our model can be generalised and shows promise in HSI classification due to its ability to capture complex relationships between bands. Vol.:(0123456789) https://doi.org/10.1038/s41598-024-58125-4www.nature.com/scientificreports/Moving forward, Fig. 3 provides a visual representation of the classification performance of the proposed model on the Pavia University dataset.The algorithm exhibited fewer misclassifications in the dataset, resulting in a smoother overall effect.Notably, in the classification of the Meadows, Metal sheets, and Bare soil features, the performance of the proposed algorithm is superior.This observation highlights the capability of the DiffSpec-tralNet to extract spectral and spatial information more comprehensively with the usage of the diffusion model.

Figure 2 .
Figure 2. Classification results of on the Indian Pines dataset (a) Original HSI (b) ground truth (c) proposed method.

Figure 3 .
Figure 3. Classification results of on the Pavia University dataset (a) original HSI (b) ground truth (c) proposed method.

Figure 4 .
Figure 4. Classification results of on the Salinas Scene dataset (a) original HSI (b) ground truth (c) proposed method.

Figure 5 .
Figure 5. Classification accuracy (OA) achieved by the proposed DiffSpectralNet with (a) varying percentages of training samples (b) different PCA components on three benchmark datasets.

Figure 6 .
Figure 6.Overview of our proposed DiffSpectralNet (a) unsupervised spectral-spatial feature learning network.x 0 and x T represent HSI patches of timestep 0 and timestep T .q(x t | x t−1 ) and p(x t−1 | x t ) represent forward and reverse spectral-spatial diffusion processes, respectively.(b) Supervised classification (1) extracting hierarchical features from the pretrained denoising U-Net decoder in terms of different timestep t. (2) Using the patch-wise feature vectors to train an cross-layer transformer for HSI classification.

Table 1 .
Details of Indian Pines, Pavia University, and Salinas Scene Datasets.

Table 2 .
Classification results of different HSIs, and the best result is bolded.

Table 3 .
The performance of different layer indices and timestamps in the Indian Pines, Pavia University, and Salinas Scene.Significant values are in bold.