Unsupervised blind image quality assessment via joint spatial and transform features

A novel unsupervised blind image quality assessment (BIQA) method, which requires no mean opinion scores for model training is presented in this paper. The method employs joint spatial and transform features as quality degradation metrics, specifically, phase congruency, gradient magnitude (GM), and GM and Laplacian of Gaussian response and local normalized coefficient are extracted as spatial features, and Karhunen–Loéve transform coefficient and discrete cosine transform coefficient are modeled as transform features. Both spatial and transform features are well analyzed to remove the redundancy, and then fitted to the multivariate Gaussian model for no-reference image quality assessment. Extensive experiments conducted on seven IQA databases demonstrate the superiority of the proposed method over the state-of-the-art both supervised and unsupervised BIQA methods.

The quality of images plays a more important role with the development of multimedia and transmission technologies, efforts to accurately assess image quality with lower complexity have become increasingly important. As human is the ultimate receiver of images, subjective quality assessment always has the highest accuracy but is time-consuming and expensive, making it impractical for most applications. On the contrary, objective quality assessment attempts to assess image quality without human involvement, which is more practical for real-world use and has therefore become increasingly important.
Objective image quality assessment (IQA) can be classified into three categories based on the usage of reference information, namely full-reference (FR) IQA 1-4 , reduced-reference (RR) IQA [5][6][7] , and no-reference IQA (NR-IQA)/blind IQA (BIQA) [8][9][10][11] . FR-IQA requires complete access to the reference image and compares it with the distorted image using a distortion measure for quality assessment. RR-IQA utilizes a subset of pre-determined features from both the reference and distorted images for quality assessment and requires less reference information than FR-IQA but more than BIQA. BIQA predicts the perceived quality of an image by extracting and analyzing its internal features, requiring no reference. This makes BIQA highly practical in real-world applications where reference images may be limited or unavailable.
BIQA methods can be further classified into two categories based on the usage of training labels. Supervised BIQA methods usually use subjective scores as labels to train the quality assessment models, and the difference mainly exists in the features and regression models used. For example, Mittal et al. 8 extracted mean subtracted contrast normalized (MSCN) feature, Yang et al. 12,13 extracted and enhanced naturalness and structural features via Karhunen-Loéve transform (KLT), while Zhang et al. 14 proposed to extract quality-aware features from joint generalized local binary pattern statistics. These features were then mapped to subjective scores via support vector regression (SVR) to perform BIQA. Min et al. 15 proposed to utilize multiple pseudo reference images created through various types and levels of distortion aggravation with FR-IQA method to generate similarity scores for BIQA.
With the advent of deep learning technologies, several approaches have been proposed for BIQA that utilize convolutional neural networks (CNN) for end-to-end joint feature extraction and regression. Ma et al. 16 and Zhu et al. 17 employed CNNs for this purpose. Wang et al. 18 proposed a dual-perception network (DPNet) that uses end-to-end multi-task learning with knowledge distillation, while Lan et al. 19 developed a framework that combines two feature extraction networks and a multilevel feature fusion (MFF) network to obtain multilevel degradation features for BIQA. Additionally, Wang et al. 20 reformulated BIQA as an ordinal regression problem and achieved improved prediction accuracy by using deep CNNs and Transformers. Madhusudana et al. 21 performed the prediction of distortion type and degree as an auxiliary task to learn features for BIQA. Pan et al. 22 introduced a distortion aware module in CNN to perform BIQA on different distortions. Chen et al. 23 propose an NR-IQA method via feature level pseudo-reference hallucination. Pan et al. 24 proposed a multi-branch • We propose an unsupervised BIQA method, to evaluate image quality from multiple aspects, we extract phase congruency (PC), gradient magnitude (GM), GM and Laplacian of Gaussian response (GM-LOG) to measure image structure, local normalized coefficient to measure image naturalness. Additionally, we extract KLT and discrete cosine transform (DCT) coefficients to measure image perception quality from the transform domain. • Since we utilized multiple features in our method to reflect the characteristics of the HVS, ablation studies are conducted to analyze the contributions of different features, and comprehensive analyses are performed to select the most significant features to remove the redundancy of these multiple features.

Method
The framework of the proposed method is shown in Fig. 1. To better measure the perceived quality of an image, we extract spatial and transform features from non-overlapping image patches to form the feature matrix. Herein, to avoid the impact of non-texture patches, we only utilize high-contrast patches in pristine images for model training. During testing, we consider all patches from the test image. We then fit the MVG model with the mean vector ν and covariance matrix of the feature matrix. We calculate the local and global qualities of distorted image as the distance between the pristine and distorted MVG model parameters: where n is the number of image patches in each distorted image and f i is the corresponding feature vector, ν p , ν d and p , d are the mean vectors and covariance matrices of pristine image MVG model and distorted image MVG model, respectively.
Finally, the quality of the distorted image is measured as the weighted average of local and global quality as follows: where we set α = 0.5 in this paper based on experiments on seven IQA databases as demonstrated in "Parameters optimization and ablation studies". Spatial features. An image is a two-dimensional (2D) spatial signal, and therefore, we extract spatial features from images to measure their perceptual quality. Specifically, since the perceived quality of an image is highly dependent on its structure and naturalness, we extract image structure-related features such as PC, GM, and GM-LOG features, along with naturalness-related features like MSCN.
PC is utilized as an indicator of the edge strength of an image, which is highly related to image structure. And we adopt the method in 35 to compute the PC value of image I at position p as follows: w h e r e H θ f (p) = E n,θ f (p) 2 + O n,θ f (p) 2 , E n,θ f (p) = n e n,θ f (p) , O n,θ f (p) = n o n,θ f (p) , A n,θ f (p) = e n,θ f (p) 2 + o n,θ f (p) 2 , e n,θ f (p) and o n,θ f (p) are responses of even and odd-symmetric filters, n and (1) where * is the convolution operation, and we compute the GM in the luminance channel. The distribution of GM coefficient is modeled with the Weibull distribution to form a feature vector. Besides, we also extract finely selected GM-LOG features as structure features. We utilize Eq. (6) to extract the GM feature, with D h and D v being the Gaussian partial derivative filters in the horizontal and vertical directions, respectively. LOG feature is calculated as follows: where h LOG is the two-dimensional Laplacian of Gaussian operator. Then, the normalized GM and L are quantized into 10 levels, and the normalized bivariate histogram is calculated get the GM-LOG feature vector. We extract the GM-LOG feature from pristine images in 29 and select the indices with the values all higher than 0.05 to remove the redundancy.
Image naturalness generally refers to the degree to which an image appears to be natural or realistic, and is a factor that can affect the perceived quality of an image. Therefore, it is often considered in IQA. The MSCN coefficient distribution of distorted natural scene images differs from pristine ones. Therefore, we employ it to measure the naturalness of an image, where MSCN is computed as: and ω is a unit-volume Gaussian window with size 7 × 7. General Gaussian distribution (GGD) in 37 is utilized to model the MSCN coefficient distribution as features. Besides, asymmetric generalized Gaussian distribution (AGGD) is employed to model the adjacent MSCN coefficient in four directions, including horizontal, vertical, main diagonal, and sub-diagonal 8 as feature representation.
Transform features. Considering that the transform technologies are widely utilized in image processing, we extract transform features from KLT and DCT coefficients to measure image perceptual quality from different aspects.
KLT is a data-driven transform and can extract quality-aware features 12,38 . Non-overlapping patches of MSCN normalized pristine image with size √ k × √ k are used to collect vectorized patches to calculate the covariance matrix. The transform kernel P with size k × k is the eigenvectors of the covariance matrix arranged in descending order based on the eigenvalues. We set k to 4 in this paper, and GGD is adopted to fit the KLT coefficient distribution in each frequency band as transform feature.
According to Benford's law, for a carry system with b as the base, the probability of occurrence of a number starting with n is p(n) = log b (n + 1) − log b (n) , and Ou et al. 39 found that the distance between the distribution of pristine image p(n) and the distorted image p d (n) in the DCT domain is highly correlated with the subjective score of the distorted image. Therefore, we calculate the Euclidean distance between these two distributions in the color relevant space in Eq. (5) as features. An example of Benford's law of distorted image is shown in Fig. 2, we can see that the distribution of DCT coefficient from reference image is very close to the distribution of Benford's law, however, the distribution of distorted images are quite different.
Significant feature selection and model training. We extract all the above-mentioned spatial and transform features from two scales, i.e., the original scale, and the 1/2 downsampled scale, except for the DCT feature, which is extracted only from the downsampled scale (for lower computational complexity). The original patch size is 96 × 96. To reduce the redundancy among different types of features, we select the columns from the feature matrix extracted from 125 pristine natural scene images in 29 whose average values are higher than 0.01 as significant features. These pristine images are also utilized to train the benchmark MVG model parameters and KLT kernels.
Comparison with state-of-the-art methods. Table 1 shows the Spearman Rank Order Correlation Coefficient (SROCC) of the proposed method as well as supervised BIQA methods 8,16,48,49 and unsupervised methods 29-31 on common distortion types (JPEG compression, JPEG2000 compression, white noise, and Gaussian blur) and real-world distortion. Specifically, JPEG and JPEG2000 compression refer to lossy image compression using the JPEG and JPEG2000 codecs, which are very common in image compression. White noise refers to additive white Gaussian noise, which is commonly encountered in image acquisition and transmission. Gaussian blur refers to blurring an image using a Gaussian filter, which is quite common in image acquisition. LIVE, MICT, CSIQ_sub, TID2013_sub, and KADID-10k_sub have common distortions. CID2013 and LIVE-C have real-world distortion. Specifically, the SROCC values in each row are calculated between the objective scores predicted by the method and the subjective scores that are given in the databases. We utilize LIVE to train the supervised models and test on the remaining six IQA databases. "W. A. " refers to the weighted average results on the above seven IQA databases with the number of distorted images in each database being the weights. The best results of both supervised and unsupervised methods are boldfaced. We can see that our method achieves the highest results on both common and authentic distortions, as well as the weighted average result compared to reference methods.
We also perform statistical analysis by applying t-test on the prediction residuals, the results are also tabulated in Table 1, wherein (1), (0), and (− 1) tell that our method is superior, comparative, and inferior to the reference method statistically with 95% confidence. According to 31 , the residuals are obtained by calculating the differences between the subjective scores and the converted objective scores using Eq. (11): www.nature.com/scientificreports/ where q and s(q) are the objective and the converted scores, respectively, β 1 -β 5 are the curve fitting parameters. This conversion is necessary because objective and subjective scores have different scales. The proposed method achieves comparable results with NIQE and SNP-NIQE, and better results than ILNIQE on common distortions. Table 2 shows the feature dimension and average extraction time of different unsupervised BIQA methods. The test methods are all implemented with MATLAB and tested on Windows system with Intel Core i7-3770   www.nature.com/scientificreports/ 3.40 GHz dual-core CPU and 8 GB RAM. ILNIQE has good generalization performance but its computational complexity is too high. LPSI has very low computational complexity, but its performance is limited. The proposed method achieves the highest average results on seven databases, and the generalization performance and running time are comparative.
The SROCC results of all distortions on TID2013 and KADID-10k are tabulated in Tables 3 and 4, respectively. For better visualization, the best and the second-best results in the unsupervised methods are boldfaced and underlined. The proposed method achieved comparable average performance with state-of-the-art unsupervised BIQA methods on both databases, which has demonstrated the good performance of our method on uncommon distortion. However, the proposed method has failure cases on some specific distortions, such as non-eccentricity pattern noise (NERP) and mean shift (MS) in Table 3, which are luminance-related distortions, and change of color saturation (CCS) in Table 3, color diffusion (CD), denoise (DEN), and quantization (QN) in Table 4, which are color-related distortions. Since the features utilized in our method mainly measure image structure and naturalness, we cannot accurately quantify luminance and color-related distortions, which is a shortcoming of our method and can be addressed in future work.
From these tables, we can see that with the joint utilization of finely selected spatial and transform features, the proposed method can assess the image perceived quality from different aspects, resulting in achieving the highest quality assessment accuracy on most distortion types with relatively low computational complexity. The results have shown the superiority and good generalization capability of our method as an unsupervised BIQA method.

Parameters optimization and ablation studies. The original GM-LOG features have many zeros and
outliers which make it less efficient, therefore we attempt to remove these redundant features to improve the efficiency of GM-LOG. We perform the GM-LOG feature selection based on selecting the features with their minimum values higher than TH GM−LOG , the indices of selected features are obtained from pristine images 29 , and then applied to features extracted from distorted images. The results of GM-LOG features selected with different TH sf values are tabulated in Table 5, with the best results boldfaced. We can see from the table that when setting TH sf = 0.05, the proposed method achieves the highest results on CSIQ, Kadid-10k, CID2013, and LIVE- www.nature.com/scientificreports/ C, and comparable results on the rest three databases. Therefore, in consideration of both efficiency and feature dimension, we set TH sf = 0.05 to select efficient GM-LOG features in this paper.
We conduct experiments on seven IQA databases to study the optimal KLT kernel size for transform feature extraction, and employ SROCC as well as Kendall rank-order correlation coefficient (KROCC) and Pearson linear correlation coefficient (PLCC) to evaluate the performance. The objective scores are mapped to the subjective scores via nonlinear mapping in Eq. (11) before calculating PLCC 50 . Weighted average results on seven IQA databases are shown in Fig. 3a, we can see that with the increase of KLT kernel size (from 4 × 4 to 16 × 16), the three metrics decrease, therefore we set k = 4 in this paper. We also conduct experiments on seven IQA databases to study the impact of α in Eq. (3) and plot the weighted average SROCC results on Fig. 3b, we can see that when setting α = 0.5, the proposed achieved the highest result on seven databases.
As we extracted multiple features as spatial and transform features, some features may contribute less than the other features. therefore, to further improve the performance and reduce the feature dimension, we perform significant feature selection to remove the insignificant features. The experiment is conducted on seven IQA databases, and we remove the insignificant features by selecting the features with their average values higher than TH sf . We mark these feature indices based on features extracted from pristine images 29 , and then test on IQA databases, the results are tabulated in Table 6, with the best results boldfaced. We can see from the table that when setting TH sf = 0.01, the proposed method achieves the highest results on MICT, TID2013, and LIVE-C, and comparable results on the rest four databases. Table 4. SROCC results with all distortion types on the full KADID-10k database, the best and the secondbest results in the unsupervised methods are boldfaced and underlined. www.nature.com/scientificreports/ To verify the effectiveness of the two types of features, i.e. spatial features and transform features, we report the ablation test results in Table 7, where we can see that the spatial features usually take the leading role and transform features work as supplements, and the combination of these two features achieves the optimal results. However, we can also see that the performance of combined features is inferior to the single transform feature, but better than the single spatial feature on full TID2013 and KADID-10k databases, which means the transform feature takes the leading role on uncommon artificial distortion types.

Performance on screen content images.
The proposed method is designed for natural scene images, we also conduct an experiment on the screen image quality assessment database (SIQAD) 47 to further verify the generalization ability. SROCC, KROCC, PLCC, and root mean square error (RMSE) of different unsupervised BIQA methods are tabulated in Table 8, with the best results boldfaced. Similarly, the objective scores are mapped to the subjective scores via nonlinear mapping in Eq. (11) before calculating RMSE 50 . We can see from the table that the proposed method achieved the best performance compared with competing unsupervised BIQA methods, but the result is still not good enough and can be further improved in the future.

Conclusion
In this paper, we proposed an unsupervised BIQA method utilizing joint spatial and transform features. Specifically, we utilized PC, GM, GM-LOG and MSCN as spatial features, and KLT and DCT coefficients as transform features. And these features were well analyzed to remove the redundancy and then fitted to MVG model for unsupervised BIQA. Experiments on multiple IQA databases indicated that the proposed method achieved stateof-the-art results with low complexity on both artificial and authentic distortions. Future work will be focused on improving the performance on authentic distortions and illustrated images.

Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request. And we clarify that no human or animal is directly involved during this study.