CUI-Net: a correcting uneven illumination net for low-light image enhancement

Uneven lighting conditions often occur during real-life photography, such as images taken at night that may have both low-light dark areas and high-light overexposed areas. Traditional algorithms for enhancing low-light areas also increase the brightness of overexposed areas, affecting the overall visual effect of the image. Therefore, it is important to achieve differentiated enhancement of low-light and high-light areas. In this paper, we propose a network called correcting uneven illumination network (CUI-Net) with sparse attention transformer and convolutional neural network (CNN) to better extract low-light features by constraining high-light features. Specifically, CUI-Net consists of two main modules: a low-light enhancement module and an auxiliary module. The enhancement module is a hybrid network that combines the advantages of CNN and Transformer network, which can alleviate uneven lighting problems and enhance local details better. The auxiliary module is used to converge the enhancement results of multiple enhancement modules during the training phase, so that only one enhancement module is needed during the testing phase to speed up inference. Furthermore, zero-shot learning is used in this paper to adapt to complex uneven lighting environments without requiring paired or unpaired training data. Finally, to validate the effectiveness of the algorithm, we tested it on multiple datasets of different types, and the algorithm showed stable performance, demonstrating its good robustness. Additionally, by applying this algorithm to practical visual tasks such as object detection, face detection, and semantic segmentation, and comparing it with other state-of-the-art low-light image enhancement algorithms, we have demonstrated its practicality and advantages.

details and reduce the impact of noise, thereby improving the quality of the image 22 .Transformer-based methods have made important progress in low-level vision tasks such as image super-resolution 23,24 , image denoising 25 , and image dehazing 26 .Currently, related Transformer methods 27,28 have also been applied to low-light image enhancement and have achieved good performance, as they can better model non-local information to achieve high-quality image reconstruction.However, these methods do not enhance the local features of the image well, which is what CNNs excel at.Therefore, recent researchr [29][30][31] has attempted to combine CNN and Transformer networks to combine their advantages and improve the performance of the corresponding tasks.For low-light enhancement tasks, network architecture design needs to be adapted to the characteristic of low-light images having more low-light features than high-light features.At the same time, for low-light enhancement tasks in real scenes, zero-shot learning 32 methods are needed to better solve high-level vision tasks in real scenes where paired datasets are lacking.Specifically, zero-shot learning means that no paired or unpaired data is needed during training.
The substantial contributions of this study are meticulously designed to combat the issue of uneven illumination.Transformers, armed with their global attention mechanism, can comprehensively process long-range pixel relations in an input image.However, the traditional self-attention mechanism demands a high quantity of computational resources, and its multitude of parameters could lead to overfitting.On the other hand, CNN networks are well-regarded for enhancing local features and maintaining robustness.Still, they struggle in capturing global context information.The integration of these two networks, without thoughtful design of the CNN network, could lead to an ineffective learning of global information features generated by the Transformer network.
Aiming to unite the advantages of CNN's local feature extraction and Transformer's global modeling, the network introduced in this study comes with specific improvements.The complexity of the Transformer module increases linearly, not quadratically, with the rise in image resolution, facilitating efficient acquisition of contextual information.The CNN module's class Transformer structure is designed to concentrate better on the features extracted by the Transformer, making up for the difficulties in global information acquisition and thus enhancing the model's efficiency 33 .Ablation experiments were conducted during the development process, and multiple combinations were tested before finalizing the network architecture presented in this paper.
Particularly, the channel attention mechanism of the auxiliary module and the Multi-Dconv head Sparse Attention(MDSA) module designed in this research addresses to some extent the issue of high time and space complexity inherent to traditional Transformers.The introduction of the sparse attention mechanism provides a deeper understanding and handling of the local features in the image.In low-light enhancement tasks, overly bright local features may hinder the model's ability to capture other critical low-light features.To mitigate this problem, the MDSA module is adopted for a more precise depiction of local features and to boost their enhancement ability, marking the first application of the improved sparse attention mechanism in low-light enhancement tasks.
Figure 1 illustrates that in unevenly lit low-light environments, conventional self-attention mechanisms or ordinary sparse self-attention mechanisms tend to place the primary focus and weight on the highlight features, which is not ideal for low-light enhancement tasks.The sparse self-attention mechanism applied in this study properly biases the main weight towards low-light features while effectively reducing the weight of highlight features, significantly improving the model's performance in low-light enhancement tasks.This method, unexplored in original methodologies, represents innovative thinking.
Among the two inputs in the Cross Gating Feedforward Network(CGFN) module, one is processed through the MDSA module, and the other bypasses it.The MDSA module implements the sparse attention mechanism Figure 1.It depicts the handling strategies of different attention mechanisms under the conditions of unevenly lit low-light environments.The traditional self-attention mechanism generally prefers to place its main focus on highlight features.Furthermore, the conventional sparse self-attention mechanism tends to concentrate a significant portion of the weights on the highlight features.Such an approach is not ideal for low-light enhancement tasks because it results in a tendency for overexposure in highlight areas while inhibiting sufficient enhancement of details in low-light areas.However, our proposed sparse self-attention mechanism breaks away from this norm.It is capable of appropriately shifting the majority of the weights towards low-light features while simultaneously effectively reducing the weights of highlight features.This facilitates a more balanced extraction and processing of features.
on the channel dimension.Therefore, the proposed CGFN calculates weights in the spatial dimension, addressing the lack of spatial information after the feature passes through the MDSA module.Additionally, the presence of the gating mechanism can better suppress the further propagation of information features that are unfavorable to model convergence.In low-light enhancement tasks, the feature information in the highlight area can severely hamper the enhancement quality.The CGFN module can further alleviate this problem, introducing a method not previously seen in other methodologies.
Therefore, considering the characteristics of low-light images under uneven lighting, this article proposes a more effective zero-shot learning low-light enhancement network structure.The main contributions are summarized as follows: • A zero-shot learning low-light enhancement network named CUI-Net was designed.The entire network com- prises enhancement modules and auxiliary modules.The enhancement module merges the global attention mechanism of the Transformer and the ability of the CNN network to process local features.It has efficient computing efficiency and powerful modeling capabilities.This unique structure enables better handling of the problem of uneven lighting, richer feature information extraction, and achievement of image enhancement in low-light environments.The CNN network in the auxiliary module augments the convergence ability of the enhancement module and indirectly rectifies the influence of lighting.

Related work
Traditional enhancement methods.Traditional low-light enhancement methods can be primarily divided into two types: the methods based on histogram equalization (HE) and the methods based on the Retinex model.Methods based on HE 2,3 redistribute pixel values based on the cumulative distribution function of the input image to expand the dynamic range.However, these methods are also prone to color fidelity loss and the generation of noise, resulting in image distortion 4 .The Retinex theory 5 decomposes low-light images into reflectance part and illumination part based on prior knowledge or regularization, such as the Single Scale Retinex model (SSR) 6 and the Multi-Scale Retinex model (MSR) 7 .MSR is considered a weighted sum of several different SSR outputs.The output of these methods may cause changes.The relative proportions of the enhanced three color channels can be affected.Compared to the original image, this can lead to color distortion 4 .Fu et al. 34 proposed a fusion method that combines the advantages of sigmoid function and histogram equalization, which has improved performance compared to 2,3 .Guo et al. 35 initialized the illumination map of the image by finding the maximum value in the RGB channels and then optimized the initial illumination map by adding a structural prior to achieve image enhancement.ses an unfolding architecture search to handle low-light image enhancement.Self-Calibrated Illumination (SCI) 13 proposes a simplified network that fits physical principles to achieve low-light enhancement and introduces a calibration process in the training stage to improve the low-light enhancement model's ability, thereby further improving the enhancement effect.

Deep learning-based methods.
Methods combining CNN and transformer.CNN operations provide efficiency and universality, but their receptive fields are limited and cannot fully consider long-range pixel relationships in input images, which can affect image enhancement performance.In contrast, in Transformers, the self-attention mechanism focuses on modeling long-range dependencies, enabling it to capture global information well.However, it lacks attention on the most relevant information 37 and its complexity grows exponentially with spatial resolution 14 , leading to poor performance in some tasks.Thus, combining the two effectively to improve image enhancement quality is the focus of this paper.Conformer 29 uses a CNN branch and a Transformer branch and combines them through Feature Coupling Units to fuse local convolution blocks, self-attention modules, and MLP units to adjust feature resolution and channel numbers while continually eliminating semantic differences between the CNN and Transformer branches.HNCT 30 integrates CNN and Transformer while using local and non-local priors to extract features beneficial for super-resolution and an enhanced spatial attention module to further improve performance.ECFAN 31 proposes a new hybrid super-resolution method, called ACT, that combines CNN and Vision Transformer 19 to effectively aggregate local and non-local features and introduces cross-scale token attention modules to effectively utilize multi-scale token representations.
Through careful consideration and experimental comparison, we have found that our method uses three TransformerBlocks as the encoder to preserve the most useful self-attention values, avoiding the further propagation of aggregated highlight features, allowing useful global features to be fully utilized, and transmitting useful local features to ensure that the enhanced low-light images have sufficient details.Two CNN blocks serve as the decoder to further utilize the feature information obtained from the Transformer blocks to better enhance the details and texture information of low-light images, leveraging the advantages of CNN networks.

Sparse attention.
Images captured in real-world scenarios often suffer from uneven illumination 32 .For example, images taken at night may contain both dark and bright areas or overexposed regions, such as areas around light sources.Existing methods often enhance both the dark and bright regions of the image simultaneously, which can affect the visual quality of the enhancement results.However, current low-light image enhancement methods have not fully addressed this open problem.Zhao et al. 38 proposed sparse Transformer to select the attention degree of the model.Fu et al. 37 proposed a target focus network and sparse Transformer technique for visual object tracking.The target focus network focuses on the target of interest in the search region and highlights the features of the most relevant information for better estimating the states of the target.Inspired by SparseTT 37 , we adapt sparse Transformer to the low-light enhancement task.For low-light images with uneven illumination, the Transformer is susceptible to the influence of high-light features when computing self-attention, resulting in higher attention values.This naturally leads to a bias towards enhancing high-light features rather than low-light features with low attention values when modeling global feature dependencies.Therefore, we propose a sparse attention operation that differs from the usual one, choosing to set high-light features to lower values to effectively suppress high-light information and focus on the most relevant information in low-light enhancement tasks.

Proposed method
In this section, the framework of CUI-Net and the two main modules: which is the enhancement module will be introduced and the auxiliary module.Finally, we will explain the unsupervised training losses used in our neural network model.
Overall procedure.The proposed CUI-Net is a cascaded two-stage image enhancement network (Fig. 2) In the first stage, a Transformer network is introduced to obtain global information, which can better enhance the details of low-light images.In the second stage, an auxiliary network based on multiple convolutional network blocks is constructed, and the original input image is used as a constraint to control the output detail features of the first stage.Unlike traditional methods, the training part of CUI-Net requires multiple Enhancement Modules and Auxiliary Modules, while the testing part only contains Enhancement Module.
Here, assume that the low-light input image I ∈ R H×W×C .where the height is H, the width is W, and the number of channels is C.For RGB images, C is equal to 3. According to the Retinex theory, the low-light image I can be obtained by performing the following operation on the clear image R and the illumination image L 5 : (1) where EM t is the t-th image enhancement module network with learnable parameters ϑ , and AM t is the t-th auxiliary module network with learnable parameters µ .When t = 1 , i.e., in EM 1 , only the original low-light image I is used as input, i.e.EM 1 (I; ϑ) , and the original low-light image I is not added as input.
Unlike the training part, the auxiliary module is not needed in the testing part, and only one enhancement module is used to obtain the clear image.

Image enhancement module. The image enhancement module consists of an efficient Transformer block
and a CNN block, serving as the encoder and decoder, respectively.The Transformer model enhances low-light images by filtering out information from uneven lighting channels and local details, and then transferring the useful features to the next part of the network.The core of the Transformer block lies in the Multi-Dimensional Sparse Attention (MDSA) mechanism and the Cross-Gated Feed-Forward Network (CGFN).MDSA can effectively reduce redundant features and improve the weights of important features, thus enhancing the network's robustness and generalization ability.The cross-gated mechanism can compensate for the lack of information in the spatial dimension, allowing useful information to propagate further and enhance the integrity of the entire feature representation.The CNN block replaces the attention block in the traditional Transformer network with deep convolutions and the feed-forward layer with a simplified CNN structure, ensuring lightness.Meanwhile, a structure similar to the Transformer network can further process feature information and has the generality and efficiency advantages of a convolutional neural network.
In summary, the channel-wise sparse attention and cross-gated Transformer are used as the encoder in the image enhancement module.With the increase of layers number, the extracted features become increasingly abstract and semantically rich.The CNN block is used as the decoder to extract and enhance features at a higher level, making it more suitable for image enhancement tasks in uneven lighting conditions.Realizing pixel-level information transfer and context association through convolution calculation can further improve the performance and efficiency of the model.
The specific process of the image enhancement module is shown in Fig. 3.The network structure diagrams of the Transformer and CNN modules in the enhancement module are shown in Fig. 4. First, the input lowlight image I undergoes a 3 × 3 convolutional operation to extract low-level features and increase the number of (2)  Multi-Dconv head sparse attention.In traditional Transformer modules, multi-head self-attention mechanisms compute global information through self-attention mechanisms in the spatial dimension, resulting in a quadratic growth in complexity with increasing resolution.The main purpose of sparse attention mechanisms is to reduce the time and space complexity of traditional Transformers 39 .In this paper, the channel attention mechanism used in the MDSA module not only reduces model complexity and improves efficiency, but also helps the model better understand local features in the image.In low-light enhancement tasks, the appearance of too many high-brightness local features may interfere with the model's ability to capture other low-light features.Therefore, this paper uses sparse attention mechanisms to assist the model in better representing local features and improving its enhancement ability.The specific structure of MDSA is shown in Fig. 5.The input tensor is denoted as I ∈ R Ĥ× Ŵ×3 .Q, K, and V represent query, key, and value.The 1 × 1 point-wise convolution is applied to aggregate pixel-level cross-channel context, followed by a 3 × 3 depth-wise convolution to encode channel-level spatial context.The operation in the figure stands for reshape.IsInMap is used to filter out the weights in the attention map matrix that are the same as the weights in the TopK matrix, and set the corresponding weights in the attention map to 0.01.
Different from the Vision Transformer model 19 , MDSA uses self-attention mechanism to calculate the similarity between each channel, i.e., attention calculation is performed on the channel dimension rather than on the spatial dimension.This enables MDSA to better capture the relationships between feature channels, thereby improving the model's representation ability and robustness.
Specifically, the TopK operation is performed on the attention map to select the top K attention values, followed by further operations.It should be noted that, unlike general sparse attention calculation, in the low-light task under uneven illumination, the channel information of the high-light area in the attention map is more likely to receive higher attention scores.These K attentions need to be set to 0.01 to allow the low-light channel features to be sent to CGFN for obtaining the required local information.
are obtained by reshaping the original scale R Ĥ× Ŵ× Ĉ .The mean- ing of SpAttention is sparse attention.W p represents a 1 × 1 point-wise convolution.is a learnable scaling parameter used to control the magnitude of the dot product of K and Q.
Cross-gated feed-forward network.The two inputs of the Cross-Gated feed-forward Network (CGFN) are the input and output obtained through MDSA.The cross-gating part is equivalent to calculating weights on the spatial dimension and weighting specific positions, in order to compensate for the lack of spatial dimension information in the image that has not passed through MDSA.The specific structure of the CGFN is shown in Fig. 6.Each single path of the CGFN module has two branches.One branch is a gating unit used to obtain the activation state of each pixel.The 1 × 1 convolutional layer is used to expand the channel number, followed by a 3 × 3 depthwise convolutional layer and StarReLU to generate the gate map.The other branch does not need to pass through the StarReLU activation function.Then, the two branches are dot-multiplied.The cross-gating is cross-calculated on the two paths to compensate for the lack of spatial information.If the input of CGFN from MDSA is X ∈ R Ĥ× Ŵ× Ĉ , Y ∈ R Ĥ× Ŵ× Ĉ is the input from the previous module without MSDA then the CGFN can be represented as follows: (5) Auxiliary module.The auxiliary module is necessary for unsupervised image enhancement methods as they may have limitations such as over-enhancement and color bias 8 .Therefore, the CNN network with high efficiency and generalization ability is chosen as the auxiliary module to converge the outputs of multiple enhancement modules to one enhancement effect, enabling the use of one enhancement module during the testing phase to achieve the same enhancement effect as the multiple enhancement modules during the training part.
As shown in Fig. 2 , Formulas (2) and (3), the purpose of the auxiliary module is to correct the input of the enhancement module, indirectly affecting the output of the enhancement module.The input of the auxiliary module can be obtained by element-wise addition of the output of the previous enhancement module and the output of the auxiliary module, followed by division with the original low-light image.Thus, the auxiliary module can obtain the features of the enhancement module and correct the uneven illumination through the original low-light image.
The auxiliary module uses depth-wise convolution multiple times, which can effectively reduce the number of parameters and computation cost, as shown in Fig. 7. Firstly, the input image is passed through a 3 × 3 convolu- tion layer to increase the channel number, and then through three CNN blocks.Finally, a 3 × 3 convolution layer is used to reduce the channel dimension.As shown in Fig. 4, the CNN block enhances the local details by passing the input features through depth-wise convolutions of 3 × 3 and 5 × 5 , followed by StarReLU activation function and multiple 1 × 1 convolutions to minimize the number of parameters.The corrected illumination information is then inputted to the enhancement module, improving the enhancement effect of the enhancement module.
Training loss.In order to consider color preservation, artifact removal and gradient backpropagation, the loss function needs to be optimized.The loss function used by CUI-Net is as follows:  www.nature.com/scientificreports/Here, L represents the total loss, L c and L c represent the correction loss and the smoothness loss respectively, and α and β are two positive balancing parameters.In the experiments, the balancing parameters are set to α = 1.5 and β = 1 .The correction loss L c is to ensure the consistency between the estimated illumination and the adjusted result, that is: Here, EM x is the x-th enhancement module, and AM x is the x-th auxiliary module.AM 0 is the original input I.
As an unsupervised loss, this loss function only constrains the output through the auxiliary module.Then, the smoothness loss is used 40 , that is: Here, N is the total number of pixels.i is the i-th pixel.N(i) represents the neighboring pixels in its 5 × 5 window.Weight i,j represents the weight, which is specified as equation 14, where c represents the image channel in the YUV color space, and σ = 0.1 is the standard deviation of the Gaussian kernel.

Experiment
To test the effectiveness of the algorithm, this paper verifies it on multiple datasets and tasks.Firstly, the experimental settings are given, and tests are conducted on public datasets to demonstrate the effectiveness of the algorithm through quantitative comparison and qualitative analysis with existing methods.Then, high-level tasks, including low-light object detection, dark face detection, and nighttime semantic segmentation, are tested and compared with existing algorithms to further validate the effectiveness of the algorithm.Finally, ablation experiments are conducted to verify the effectiveness of each module.StarReLU performs well in both algorithm performance and computational efficiency due to reducing the computational cost of the activation function 43 .Adan can complete the training of ViT 19 with only half the computational cost.Compared with the popular optimizer Adam 44 , Adan has an additional hyperparameter β 2 for adjustment.β 2 is set to 0.08 in the experiments 42 .

Experimental settings.
Here, EM x represents the x-th enhancement module, and AM x represents the x-th auxiliary module.AM 0 represents the original input I.As an unsupervised loss, the loss function L only constrains the output through the auxiliary module.
To verify the effectiveness and superiority of the proposed algorithm, CUI-Net is compared with state-ofthe-art (SOTA) methods, including EnlightenGAN 9 , KinD 45 , ZeroDCE 11 , ZeroDCE++ 46 , RUAS 12 , SCI 13 , and Uretinex-Net 47 .Additionally, comparisons are made in high-level vision tasks such as face detection, object detection, and semantic segmentation.

Benchmark description and evaluation metrics.
For image enhancement testing, 100 random images from the MIT dataset 48 and 50 random images from the LSRW dataset 49 are used for testing.To quantitatively measure the algorithm's performance, three full-reference metrics, including PSNR, SSIM, and LPIPS 50 , and four no-reference metrics, including NIQE 51 , ILNIQE 52 , NIMA 53 , and MUSIQ 54 , are used as evaluation metrics.
For dark face detection tasks, the DARK FACE dataset 55 , consisting of 1000 challenging test images, is used.500 random images are selected as the training set, and 50 images are used for testing, with the average precision (AP) used as the evaluation metric.
For low-light object detection tasks, the ExDark dataset 56 specifically designed for low-light object detection is used.1051 images are selected as the training set, and 406 images are used for testing, with evaluation metrics including mAP 0.5:0.95 and mAP 0.5 .
For nighttime semantic segmentation tasks, the ACDC dataset 57 is used.The ACDC dataset is a self-driving dataset released in ICCV 2021.400 dark condition images are used for training, and the remaining 106 images are used as the test set.The evaluation metrics include IoU and mIoU.
Quantitative and qualitative metrics.The quantitative results on the MIT dataset are shown in Table 1.CUI-Net achieved the best performance in SSIM, PSNR, LPIPS, and ILNIQE among the seven evaluation metrics.Specifically, CUI-Net achieved a PSNR of 193.328dB, which is 1.0259dB higher than the best existing best ( 12) algorithm's score of 18.3201dB, and an ILNIQE evaluation metric has a score of 31.9151, which is 1.5756 lower than the score of the best existing algorithm.
The enhancement results on the MIT dataset are shown in Fig. 8. Compared with the ground truth (Fig. 8GT) for the input low-light original image (Fig. 8LL), EnlightenGAN (Fig. 8a), KinD (Fig. 8b), ZeroDCE (Fig. 8d), SCI (Fig. 8f), and Uretinex (Fig. 8g) methods show inadequate enhancement, while ZeroDCE++ (Fig. 8e) shows over-enhancement.RUAS (Fig. 8c) enhances the white petals on the upper part of the image into pinkish color, but the overall saturation is too high.In contrast, CUI-Net Fig. 8h) shows better color restoration while maintaining realistic lighting conditions.
The quantitative results on the LSRW dataset are shown in Table 2 .Among the seven evaluation metrics, CUI-Net achieved the best result in the NIMA and the third-best results in the PSNR, NIQE and MUSIQ.Uretinex achieved good results on the LSRW dataset, which may be because the data augmentation method of the LSRW dataset is similar to that of the LOL dataset used in supervised training of Uretinex.However, our unsupervised method may be less sensitive to artificially augmented datasets.
The enhancement results on the LSRW dataset are shown in Fig. 9. Except for ZeroDCE++ (Fig. 9e), which shows over-enhancement, the overall enhancement effect of the EnlightenGAN (Fig. 9a), KinD (Fig. 9b), RUAS   (Fig. 9c), ZeroDCE (Fig. 9d), ZeroDCE++ (Fig. 9e), SCI (Fig. 9f), Uretinex (Fig. 9g), and CUI-Net (Fig. 9h) methods is similar.By enlarging the selected local areas for detailed comparison, we observed two parts of the scene: the outdoor and indoor scenes are observed separately.RUAS (Fig. 10c), ZeroDCE++ (Fig. 10e), and SCI (Fig. 10f) showed over-exposure in the outdoor scenes.Uretinex (Fig. 10g), which achieved better quantitative results, also showed over-exposure.It is worth noting that even the ground truth (Fig. 10GT) shows over-enhancement in the outdoor scenes compared to the low-light original image (Fig. 10LL).Since CUI-Net (Fig. 10h) can suppress highlight areas under uneven lighting conditions, better enhancement of outdoor scenes may not always contribute to some evaluation metrics.For indoor scenes, EnlightenGAN (Fig. 10a), KinD (Fig. 10b), and ZeroDCE (Fig. 10d) resulted in blurred text and less realistic surface reflections, while CUI-Net can not only enhance the details and contours of low-light areas but also restore the realistic lighting conditions of the scene.In addition, CUI-Net can enhance the text on the white paper and paper box on the desk more clearly, which may have practical applications in low-light image text extraction tasks.
Although CUI-Net has some shortcomings in quantitative metrics on the LSRW dataset, the qualitative analysis of the enhancement results shows some discrepancies between the relevant metrics and subjective observations in practical applications.
We conducted training and testing on the unpaired low-light enhancement datasets MEF 58 , VV, DICM 59 , and LIME 35 , with the qualitative results illustrated in Figs.11, 12, 13, and 14 , respectively.As can be observed, our method effectively prevents overexposure across all four datasets, achieves a satisfactory enhancement of details, and restores realistic shadows and lightings.This can be observed, for instance, in the details of the tabletop, facial features, the flower cluster and door numbers, and the cliff and buildings.
The quantitative results are shown in Tables 3, 4, 5, and 6 .
From the tables, it can be observed that our method outperforms others in terms of quantitative results on unpaired low-light datasets, further demonstrating the robustness of our approach.

Dark face detection.
The DSFD 60 face detection framework was utilized for the experiment, which adopts the SSD 61 network structure and was trained on the WIDER FACE 62 dataset.In the face detection experiment, results from different low-light enhancement methods were used as inputs to DSFD.Finally, we compared the AP (average precision) at different IoU thresholds.The test results are shown in Table 7, where CUI-Net achieved the highest AP values at IoU thresholds of 0.5 and 0.6 and the second-highest AP value at an IoU threshold of 0.7.  Figure 15 shows the detection results of different methods and adds the low-light input image (Fig. 15LL) and its face detection result (Fig. 15LD) for comparison.The lower right corner of each method's result image is the corresponding magnified detail image.It can be seen that at an IoU threshold of 0.5, only RUAS (Fig. 15c) and CUI-Net (Fig. 15h) can detect the face in the area pointed by the arrow.EnlightenGAN (Fig. 15a), KinD (Fig. 15b), ZeroDCE (Fig. 15d), ZeroDCE++ (Fig. 15e), SCI (Fig. 15f), and Uretinex (Fig. 15g) failed to detect the face in the area pointed by the arrow.However, RUAS has serious overexposure, and the details on the ground cannot be seen clearly.CUI-Net not only can detect more face but also produces realistic enhancement effects, with better quantitative indicators than other SOTA methods.
Low-light object detection.We trained the YOLOv3 63 model on the ExDark object detection dataset and tested it on the ExDark validation dataset.YOLOv3 is a series of object detection frameworks and models pre-trained on the COCO dataset 64 .Unlike face detection experiments, we fine-tuned the YOLOv3 pre-trained model for object detection, i.e., we retrained the object detection model to evaluate the enhancement effects of all methods.Table 8 shows the quantitative results among different methods.CUI-Net achieved the best mAP values in both mAP 0.5:0.95 and mAP 0.5 .
The experimental results were obtained by performing object detection on low-light images after being enhanced by various SOTA algorithms.The baseline is object detection directly on the unenhanced low-light images.The specific detection results object detection on the low-light image (Fig. 16LL) are shown in Fig. 16, Only RUAS (Fig. 16c), ZeroDCE++ (Fig. 16e), Uretinex (Fig. 16g), and CUI-Net (Fig. 16h) can recognize the most targets.EnlightenGAN (Fig. 16a), KinD (Fig. 16b), ZeroDCE (Fig. 16d), SCI (Fig. 16f), and baseline(Fig.16LD) did not detect the targets completely.The overall average confidence values of RUAS, ZeroDCE++ and Uretinex are lower than CUI-Net.In addition, the main reason why RUAS and ZeroDCE++ have lower mAP values in Table 8 is due to the overexposure problem.However, CUI-Net found a good balance and was able to avoid the overall lower mAP scores caused by overexposure.
Low-light semantic segmentation.We evaluated the performance of all segmentation methods on the ACDC low-light semantic segmentation dataset using the DeepLab-V3+ 65 model with pre-training and finetuning mode.The pre-trained model was trained on the Cityscape dataset 66 .Table 9 shows the mIoU values for  multiple categories and the overall average among different low-light enhancement methods.CUI-Net achieved the best mIoU score among the six segmentation targets and was the second-best method among the seven segmentation targets.It outperformed the second-best method by 4.5 in the wall category, 1.9 in the traffic light category, and 6.6 in the motorcycle category.The overall average mIoU value was 2.8 higher than the second-best method.
Table 10 shows the mAcc values for multiple categories average among different low-light enhancement methods.CUI-Net achieved the highest mAcc values for five segmentation targets, with 12.7 higher than the second-best method in the motor category and 22.9 higher in the rider category.CUI-Net also obtained the second-highest mAcc value for four segmentation targets, with an overall mAcc value 5 higher than the secondbest method.
Figure 17 shows the overlaid results of semantic segmentation masks and enhanced images on the ACDC dataset.Overall, RUAS (Fig. 17c) and SCI (Fig. 17f) exhibited overexposure.EnlightenGAN (Fig. 17a), KinD (Fig. 17b), ZeroDCE (Fig. 17d), ZeroDCE++ (Fig. 17e), Uretinex (Fig. 17g), and CUI-Net(Fig.17h) methods showed no significant differences, but for nighttime semantic segmentation applications, attention to detail is   www.nature.com/scientificreports/particularly important, such as timely segmentation of pedestrians traffic signs on the road to avoid serious accidents during nighttime autonomous driving.The local detailed semantic segmentation results for each method corresponding to the red box in Fig. 17 are shown in Fig. 18.Comparing with the ground truth in Fig. 19, for the first red box region, which contains two traffic signs, EnlightenGAN (Fig. 18a), KinD (Fig. 18b), RUAS (Fig. 18c), ZeroDCE++ (Fig. 18e), and Uretinex (Fig. 18g) failed to segment both traffic signs, while ZeroDCE (Fig. 18d) and SCI (Fig. 18f) only recognized the left traffic sign.However, CUI-Net (Fig. 18h) was able to recognize both traffic signs.For the middle red box region, which contains two pedestrians and two traffic signs, only ZeroDCE++ (Fig. 18e) and Uretinex (Fig. 18g) recognized both traffic signs, while our CUI-Net (Fig. 18h) recognized an additional pedestrian.For the right red box region, which contains two pedestrians, only KinD (Fig. 18b), SCI (Fig. 18f), and CUI-Net (Fig. 18h) were able to segment both pedestrians well.In addition, for the pedestrian crossing category that does not exist in the ACDC dataset, it can be seen from Fig. 17 that CUI-Net has the most obvious enhancement effect, which may play a role in nighttime safety autonomous driving tasks.Clearly, CUI-Net has some potential in nighttime semantic segmentation tasks.Secondly, to verify whether the network structure design of the enhancement module is effective, we replaced the five modules in the overall network with a full CNN module, a full Transformer module, and the three     Finally, an ablation study was conducted on the sparse attention operation on channels in the MDSA module of CUI-Net.The results are shown in Table 14.The Topk_normal operation is the usual sparse attention opera- tion where all attention weights except for the TopK are set to zero.In contrast, the Top_CUI operation used in CUI-Net reduces the attention weights of the channels obtained by TopK to a very low value.The results of the ablation study indicate that the sparse attention on channels used in CUI-Net contributes to achieving better enhancement results.

Conclusion
In this paper, we propose a CUI-Net framework consisting of an enhancement module and an auxiliary module, which can achieve differential enhancement of low-light and highlight regions in low-light environments.In the enhancement module, an efficient low-light enhancement Transformer and CNN network are introduced to enhance low-light images by acquiring global pixel information.In the auxiliary module, a lightweight CNN network is designed to assist the enhancement module to converge better and correct lighting effects.Quantitative analysis and qualitative comparison of CUI-Net with other state-of-the-art low-light image enhancement methods were conducted on two public low-light datasets, demonstrating the effectiveness of the proposed method.Furthermore, the practicality of the method was further verified through high-level vision tasks, namely low-light object detection, dark face detection, and nighttime semantic segmentation.

Figure 2 .
Figure 2. Overall framework of the CUI-Net.Only one enhancement module is used to obtain results during the testing phase.

Figure 3 .
Figure 3. Network architecture of the enhancement module.

Figure 4 .
Figure 4.The network structure diagrams of the Transformer module used in the enhancement module and the CNN module used in both the enhancement and auxiliary modules.

Figure 7 .
Figure 7. Overall architecture diagram of the auxiliary module.

Figure 18 .Figure 19 .
Figure 18.Enlarged details of the red boxes in Fig. 17: (a) EnlightenGAN; (b) KinD; (c) RUAS; (d) ZeroDCE; (e) ZeroDCE++; (f) SCI; (g) Uretinex; (h) CUI-Net; • A Multi-Dconv head Sparse Attention (MDSA) module was designed.The MDSA module constrains high- light features at the channel level and increases the weight of important local features.This design helps quell the interference of overly bright features, allowing the model to focus on and extract low-light features better, thereby enhancing the model's performance in low-light enhancement tasks.•A novel Cross Gating Feedforward Network (CGFN) was proposed.CGFN can not only effectively suppress the further spread of information features that are not conducive to model convergence but also supplement the information loss in the spatial dimension through information exchange, thereby further boosting the efficiency and effect of the model.For low-light enhancement tasks, the feature information in the highlight area can seriously disrupt the enhancement quality of low-light enhancement tasks.The existence of the CGFN module can further mitigate this problem.• A multitude of experiments was conducted on nine challenging datasets.Most of the experimental results indicate that CUI-Net surpasses current state-of-the-art methods in terms of image quality enhancement effects and various evaluation indicators.More importantly, CUI-Net's superior performance in high-level visual tasks (such as object detection, face detection, and semantic segmentation) in real-world low-light scenarios further validates its practical value and effectiveness.

Table 1 .
Quantitative results of three supervised metrics (PSNR, SSIM, and LPIPS) and four no-reference metrics (NIQE, NIMA, MUSIQ, and ILNIQE) on the MIT dataset.The best and second best results are highlighted in italic and bold, respectively.

Table 2 .
Quantitative results of three supervised metrics (PSNR, SSIM, and LPIPS) and four non-reference metrics (NIQE, NIMA, MUSIQ, and ILNIQE) on the LSRW dataset.The best result is highlighted in italic, the second-best is highlighted in bold, and the third-best is highlighted in bolditalic.

Table 3 .
Quantitative test results on the MEF dataset.The best results are highlighted in italic, and the secondbest in bold.

Table 4 .
Quantitative test results on the VV dataset.The best results are highlighted in italic, and the secondbest in bold.

Table 5 .
Quantitative test results on the DICM dataset.The best results are highlighted in italic, and the second-best in bold.

Table 6 .
Quantitative test results on the LIME dataset.The best results are highlighted in italic, and the second-best in bold.

Table 7 .
AP (average precision) at IoU of 0.5, 0.6, 0.7 thresholds.The best result is marked in italic, and the second-best result is marked in bold.

Table 8 .
Quantitative results of object detection on the ExDark dataset.The best results are marked in italic, and the second-best results are marked in bold.

mAP 0.5:0.95 mAP 0.5
To verify whether the network structure of the enhancement module in CUI-Net can improve the model's enhancement ability, we conducted four ablation experiments on the LSRW dataset for training and testing, and evaluated the quality of the enhanced images using SSIM, PSNR, and LPIPS.Firstly, to verify whether Adan and StarReLu can accelerate the convergence of the model, we choose to train for 50 epochs.The results obtained are shown inTable 11, where it can be observed that replacing GeLu with StarReLu and Adam with Adan can lead to better results in a smaller number of epochs.

Table 11 .
Ablation experiment of replacing GeLu and Adam with StarReLu and Adan.

Table 12 .
Replacing the five modules used in the original CUI-Net with different ones.Transformer modules and two CNN modules of CUI-Net for experimental analysis.The results obtained are shown in Table12, and the network structure of CUI-Net can achieve better performance.Thirdly, to verify whether MDSA and CGFN can improve the model's enhancement ability, we selected MDTA and GDFN in Restormer for ablation study.The results are shown in Table13, and both MDSA and CGFN can improve the performance of the model.

Table 13 .
Ablation experiments were conducted to compare the network module used in the Transformer block of CUI-Net with MDTA and GDFN.

Table 14 .
Perform an ablation experiment comparing the usual sparse attention mechanism with the sparse attention mechanism used in the CUI-Net network.