Deep guided transformer dehazing network

Single image dehazing has received a lot of concern and achieved great success with the help of deep-learning models. Yet, the performance is limited by the local limitation of convolution. To address such a limitation, we design a novel deep learning dehazing model by combining the transformer and guided filter, which is called as Deep Guided Transformer Dehazing Network. Specially, we address the limitation of convolution via a transformer-based subnetwork, which can capture long dependency. Haze is dependent on the depth, which needs global information to compute the density of haze, and removes haze from the input images correctly. To restore the details of dehazed result, we proposed a CNN sub-network to capture the local information. To overcome the slow speed of the transformer-based subnetwork, we improve the dehazing speed via a guided filter. Extensive experimental results show consistent improvement over the state-of-the-art dehazing on natural haze and simulated haze images.

With the development of CNNs, a lot of works [10][11][12][13][14][15]17,[31][32][33][34][35][36] attempt to solve dehazing using deep learning models. These deazing methods often attempt to compute the key factor of the physical model or the corresponding haze-free image directly.The works 10,11 employ CNN model to compute the transmission map.However, these methods may boost the error of the transmission map and result in poor dehazing results.To deal this problem, End-to-End dehazing methods [12][13][14][15]17,[31][32][33][34][37][38][39][40] are proposed.For example, Zhang et al. design a CNN model that incorporates the physical model.Li et al. propose an all-in-one dehazing model 13 , which fuses the transmission map and airlight into a new parameter.Liu et al. design a novel dehazing model 20 based on attention and multi-scale network.However, all these dehazing methods are based on CNNs, which are limited by the local property of convolution.To capture the long dependency of hazy images, Guo et al. propose a transformer-based dehazing method 26 , which employs the transformer-based encoder to capture the density of haze. Diffeent from the above-mentioned methods, we overcome the problem of CNN by introducing the transformer block into the dehazing model, which can capture the long-range dependency.Some works note the difference between the simulated hazy and real hazy images, which results in a drop of dehazing performance on real hazy images when the model is trained with simulated hazy images.To address these issues, PSD 17 proposes to combine the traditional priors to improve the dehazing quality of real hazy images.Domain adaptation dehazing method (DA) 19 improves the dehazing quality on real hazy images by converting simulated hazy images into real hazy images.We note that these methods are hard to train.Furthermore, the proposed method focuses on improving the learning ability on simulated hazy images, which has a different goal from PSD and DA.
Prior-based dehazing methods.To address the ill-posed of single image dehazing, a lot of prior-based dehazing methods [1][2][3][4][5][6] or additional information [7][8][9] has been proposed.These methods discover the prior based on the statistical analysis of clean images or hazy ones.The famous work is Dark Channel Prior (DCP), which is derived from the observation that a clean image patch contains at least one pixel that has a channel value close to zero.Zhu et al. discover a color attenuation prior 5 , which is that the divergence between intensity and saturation positively is correlated to the depth.Fattal et al. 2 use a color-line prior to removing haze.Berman et al. find a haze-line prior 4 based on the observation that one haze-free image can be presented by a small number of color clusters.However, all these priors are simple, and cannot be held in real word complex scenes.
Transformer for vision tasks.Natural language processing (NLP) has applied Transformer 41 to capture long dependency and improved the performance of learned models.Transformer shows its effectiveness in NLP and image classification task 25 also employs Transformer to improve the performance.With the success of Vision Transformer (ViT) 25 and its follow-ups 42,43 , researchers have shown the potential of transformers to image segmentation 43 and object detection 42 .Although visual transformers have shown their success in visual tasks, it is hard to directly apply it in single image dehazing.First, Transformers often depend on large-scale datasets.However, there is no existing large-scale dataset to train a transformer-based for image dehazing.Second, it is hard to capture local representation for transformers, which may result in the loss of image details.To overcome this issue, we proposed combining the advantage of CNN and transformer to capture the local texture and global structure jointly to boost the dehazing quality.BaseNet.The BaseNet consists of an encoder that extracts features and a decoder that restores the haze-free image.The encoder contains four stages, and the decoder also contains four stages.Specifically, each encoder www.nature.com/scientificreports/stage contains one transformer block, which followed one down-sampling layer.Similar to the encoder stage, each decoder stage contains one transformer block, which is followed by one up-sampling layer.The downsampling layer is designed to downscale the size of feature maps, which is implemented by 3 × 3 convolution with stride 2. The up-sampling layer is designed to enlarge the size of feature maps, which is implemented by 2 × 2 transposed convolution operation with stride 2. The input of the BaseNet is a low-resolution version of a hazy image.The low-resolution hazy image is generated by using a bilinear, which is used to obtain a hazy image with half the size of the original input.We define the output of BaseNet as follows: where BaseNet is the BaseNet, the I l is the low-resolution of input hazy image, B is the base layer of a dehazed result.
DetailNet.The DetailNet is designed to restore missed details.The DetailNet contains four Residual Dilation Blocks (RDBs), whose structure is shown in Fig. 2. Each RDB contains two common convolution layers and two dilation convolution layers.We pass the low-resolution input hazy into the DetailNet and obtain the detail layer.
where DetailNet is the DetailNet, D is the detail layer of a dehazed result.
After obtaining the structure layer of dehazed result and the image detail layer, we can obtain the dehazed result as follow: where Ĥl represents the predicted low-resolution haze-free image.
GuidedFilterNet.GuidedFilterNet is based on the guided filter, which is based on the local linear model.We can express the local linear model as: where q o is the output, I g is the guidance image, l is the location in I g , ω is a local window in I g with radius r, (A ω , B ω ) are the linear const coefficients in a local window.This model can preserve the edges in q o if I g has the edges, because that ∇q o = ∇I g .To obtain the (A ω and B ω ) , we solve the problem (5) that reduces the difference between the output q o and the filtering input p.To solve the problem (5), we minimizes the error: where ǫ is used to penalize large A ω , p is the filtering input.
We employ guided filter to perform joint upsampling, which receives a low-resolution hazy image, the corresponding low-resolution dehazed result, and the original hazy image as input, obtaining the final high-resolution dehazed result.Based on the local linear model, the relation between a low-resolution hazy image and the corresponding low-resolution haze-free image can be expressed: where H l is the low-resolution dehazed result and I l is the low-resolution hazy image, i is the index of the I l .To obtain A l ω and B l ω , we reduce the error between Ĥl and the H l : After obtaining A l ω and B l ω , we simple the Eq. ( 7) to : where .* is element-wise multiplication.Based on the local linear model, we also can express the relation between a high-resolution hazy image and the corresponding haze-free image as: (2) B = BaseNet(I l ), www.nature.com/scientificreports/Based on Eq. ( 10) and ( 9), we can construct the relation between the high-resolution and the low-resolution hazy images.According to 46 , we can obtain the high-resolution A h and B h via bilinearly upsample: Algorithm 1 lists the main steps of the guided filter in DGTDN.U is a bilinearly upsample operation, Box represents the box filtering.As shown in Fig. 1, GuidedFilterNet receives the output of haze remove network as input and enlarges the low-resolution dehazed result according to the original hazy image.In the proposed model, GuidedFilterNet interacts with haze remove network and bilinear downsampling, and performs a joint upsampling function.GuidedFilterNet is designed to enlarge the dehazed result and reduce the dehazing time of the proposed model.

Loss functions.
Loss functions are critical to obtaining high quality dehazing results.The proposed method can obtain two-scale dehazed results.To utilize this useful information, we propose a multi-scale content loss function: where N denotes the number of training samples, � • � 1 denotes L 1 norm, H h is the ground-truth haze-free image, and H l is the low-resolution ground-truth.To make the predicted base layer similar to the low-resolution ground truth, we employ a L 1 loss between the low-resolution ground truth and the predicted base layer: where L baseloss is defined as a base loss.To further boost the quality of dehazed result, we introduce perceptual loss to train the proposed model: where VGG represents the VGG-16 model, which is a classic model trained on ImageNet, and j indicates which layer is used to estimate the perceptual loss.Finally, we combine the perceptual loss, the multi-scale content loss, the base loss, and the perceptual loss to train the whole network, which can be defined as: where 1 is used to determine the contribution of the base loss, and 2 is used to determine the contribution of the perceptual loss. (10)

Experimental results
In this section, we focus on showing the high performance of the proposed method.First, we introduce the implementation details of the proposed method and dataset.Second, we compare the proposed method with other dehazing methods on simulated haze images and real haze images.Third, we show the effectiveness of the proposed modules and loss functions.

Implementation details.
In this subsection, we show the details of the proposed model.The proposed BaseNet is implemented based on the Swin-Transformer block.The configurations of the proposed RDB are listed in Table 1.The proposed DGTDN is implemented in a popular deep learning tool (PyTorch) using a single GPU ( TITAN V ) with 12GB memory.When training, we crop the training dataset into image patches with size 240 × 240 .The learning rate is set to 0.001 and then is decreased by 0.8 every 10000 steps.We set the batch size to 16.We employ the adam to train the proposed model and initialize the β 1 and β 2 to 0.5 and 0.999, respectively.We set 1 and 2 to 1.0 and 0.01, respectively.
According to the strategy adopted by 20,21,36 , ITS from RESIDE is chosen to train the proposed model and indoor hazy images from SOTS subset are used to evaluate the dehazing performance.In addition, we evaluate the performance on NH-HAZE.
Experimental results on simulated hazy images.In this part, we show the dehazing performance of the proposed DGTDN and other dehazing methods on the simulated indoor hazy images.Due to the fact, it is hard to find a ground truth haze-free image for a real haze image, simulated indoor hazy images are used to evaluate the dehazing performance.We show quantitative and visual dehazing results in Table 2 and Fig. 3.As shown in Table 2, traditional dehazing methods can obtain low quantitative results.Traditional dehazing methods derive prior from haze-free images, which may not be held by some hazy images.This is the main reason why traditional dehazing cannot achieve a high dehazing performance.The learning based dehazing methods include two kinds.The first is learning to predict transmission map, such as MSCNN 10 and DehazeNet 11 .The second kind is learning to predict clean images directly, such as DCPDN 15 , GFN 12 , MSBDN 21 , and Dehamer 26 .The learning-based methods 10,11 that learn the relationship between transmission map and hazy images.However, the relationship between transmission maps and dehazing quality is not highly correlated, which results in a low dehazing performance.End-to-end dehazing methods 12,15,21,26 construct the relationship between hazy  images and dehazed results.However, the dehazing ability of these models depends on the model capacity.The transformer-based dehazing method has a high model capacity and achieves the second dehazing performance.
To summarize, the proposed method achieves outstanding performance among famous dehazing methods.As shown in Fig. 3, we note that traditional dehazing methods, such as DCP, NLD, and BCCR often have the problem of color distortion.The learning-based methods [10][11][12]15 have the problem of retaining haze. Othr leaningbased methods 21,26 can obtain dehazed results that are similar to ground truth.The proposed can obtain highquality visual dehazing results, which are more similar to ground truth. We alo test the dehazing performance on NH-HAZE 48 , which is a widely used dataset.NH-HAZE is a famous dehazing dataset, which contains non-homogeneous haze.The non-homogeneous haze is much harder to remove than the traditional homogeneous haze.The dehazing performance tested on non-homogeneous haze can show the model's capability well.We listed the dehazing quantitative performance of dehazing methods in Table 3.As shown in Table 3, DCP, BCCR, and NLD achieve a low quantitative dehazing performance. Wenote that DehazeNet achieves lower quantitative dehazing performance than DCP, BCCR, and NLD.learningbased methods 10,12,13,15,21 achieve higher dehazing performance.Dehamer achieves the second-best quantitative dehazing performance.The proposed method demonstrates the best PSNR and SSIM among the listed dehazing methods.The results demonstrate the effectiveness of the proposed method, which benefits from the combination of CNN and transformer.We also show visual dehazed results of the proposed method and other state-of-the-art methods.As shown in Fig. 4, we can see that the traditional dehazing methods often over-enhance the dehazed results, which contain obvious color distortion.The learning-based methods tend to retain haze in dehazed results.In contrast to these methods, the proposed method often obtains visually pleasing dehazed results, which are vivid color and contain rich image details.

Experimental results on real-world haze images.
To further show the performance, we choose some typical real-world hazy images.The density and distribution of haze in real hazy images are more multiplicative than in synthetic images.Hence, the real-world hazy image dehazing is a more challenging problem.In this part, we choose three hazy images, which include dense haze, large haze distribution, and dark haze images.These haze images can show the generalization and dehazing performance of deep-learning-based models.
Firstly, we conduct an experiment on a dense haze image.The dehazed results of state-of-the-art methods and the proposed method are shown in Fig. 5.As shown, we can see that the image tends to show dense haze over the whole image, which is hard for CNN-based dehazing methods.The dehazed results of AOD-Net 13 and DCPDN 15 tend to retain haze.The dehazed result of GFN 12 contains visible color distortion and haze.The dehazed result of cGAN 49 contains less color distortion than GFN and can remove haze better than AOD-Net, DCPDN, and GFN.We note that the dehazed results of EPDN, Dehamer, and the proposed method are better than other learning-based methods.We note that the area in the lake is not well dehazed in a result of EPDN.The proposed method can remove haze more completely than EPDN and Dehamer.Due to the fact the transformer can capture long dependency, which can boost the dehazing quality.The proposed method and Dehamer remove haze from dense haze images.The proposed method employs CNN to restore the image details, which makes the proposed method can restore more image details than Dehamer.
Secondly, we conduct an experiment on a hazy image with large haze distribution.This hazy image is a typical image, which has been employed to evaluate the dehazing performance widely.This image contains dense haze areas, a middle haze area, and light haze areas, which are marked using black, red, and green circles, respectively.Due to its large haze distribution, the learning-based methods often fail to remove haze well.As shown in Fig. 6, we note that the traditional methods 1,4 often show a better dehazed results than learning-based methods 12,13,15 .Table 3. Evaluation results of dehazed results using average PSNR/SSIM on the dataset NH-HAZE 48 .Dehazenet and MSCNN are based on deep learning and Koschmieder's law.We note that the dehazed result of MSCNN is better than DehazeNet, which can remove more haze.We also note that the dehazed result of MSCNN losses some image details.The dehazed result of PGAN 34 .However, we note the dehazed result of PGAN still contains haze.The dehazed results of EPDN and Dehamer can remove haze better.However, these methods tend to generate a dark dehazed result and tend to show some haze around the green circle area.The proposed method can remove haze more completely and keep the image details well.Thirdly, we conduct an experiment on a more challenging image, which looks dark.The dehazed results of this image often have the problem of losing image details and retaining haze.As shown in Fig. 7, we can see that the dehazed result of DCP, AOD-Net, AECR, AirNet 14 , EPDN, and Dehamer tend to show a dark appearance.The dehazed result of DCPDN, FFA, and PGAN looks brighter.However, the dehazed results of these methods tend to retain haze in the dehazed result.The dehazed result of DA, PSD, and DGTDN can generate a much brighter dehazing result.However, the dehazed result of PSD tend to retain haze in the whole image while the result of DA tends to leave haze in a black rectangle and show a blur dehazed result.AirNet is based on the assumption that the whole image shares similar degradation.In contrast, the proposed method can remove haze more completely and obtain a sharp dehazed result.To show the quality of dehazed results obtained by the proposed method and other dehazing methods quantitatively, we use the metric proposed in 50 .As shown in Table 4, we can see that the proposed method can remove haze better than other dehazing methods.

Ablation studies.
To the effectiveness of the proposed module in DGTDN, we design a series of experiments.Firstly, we design a model to show the effectiveness of the transformer.We remove the transformer from the proposed model, and keep other parts unchanged, we term it as model1.Secondly, we show the effectiveness of the DetailNet.We remove the DetailNet from the proposed model, and keep other parts unchanged, we term it as model2.Finally, we show the effectiveness of the GuidedFilterNet, which can boost the dehazing speed of the proposed model.To show the influence of the GuidedFilterNet, we design a model which removes the GuidedFilterNet and keeps other parts unchanged, we term it as model3.We show the quantitative comparison in Table 5 and a visual example in Fig. 8.As shown in Table 5, we can see that the model1 achieves the lowest dehazing performance due to the limitation of the receptive field.As we can see that the BaseNet can boost the dehazing performance dramatically, which shows the transformer module is necessary for dehazing.The transformer module can improve the dehazing performance by enlarging the receptive field.We note that the application of guided filter reduces the dehazing performance.However, it is necessary to improve the dehazing speed while only reducing the dehazing performance slightly.We show the difference dehazed result of model1, model2, model3, and the proposed model in Fig 8 .We can see that model1 cannot remove haze in remote areas, which are dense haze.The transformer module is necessary for removing dense haze areas.By adding the DetaiNet, we can see that the model can remove haze more completely.The guided filter improve the dehazing quality in remote areas.
To show the influence of loss functions, we design an ablation that involves the models are trained with different losses.First, we train the model without L perc .Second, we train the proposed model without L baseloss .Third, we train the proposed model without L con .We show the quantitative results in Table 6.As shown, L con is critical to obtain a high quantitative dehazing result.L con is designed to boost the details of the dehazed results.L con is designed to make the dehazed results similar to the ground truths.L baseloss is used to reduce the difficulty of dehazing problem, which can boost the dehazing quality.We also show dehazed results of the model trained with different loss functions in Fig. 9.As shown, we note that the model trained without L con obtains a dehazed result that losses image details.The dehazed results obtained by models trained without L baseloss or L perc generate results with color distortion or over-enhancement.As shown in Fig. 9, the model trained with all losses can generate high quality dehazing results.Visual results of some recently dehazing methods and the proposed method.The dehazed results obtained by other state-of-the-art methods tend to show a dark or hazed appearance.DA is designed for natural image dehazing with domain adaption.However, we note that the area marked with a black rectangle retains a lot of haze.In contrast, the proposed method often shows a colorful and sharp dehazed result and removes haze more completely.Analysis of run states.We test the dehazing speed of dehazing methods on 500 images with size 256 × 256 .
The test hazy images are from the outdoor part of RESIDE, we resize these images into a fixed size ( 256 × 256 ).We conduct the experiment on a notebook, which is equipped with an Intel(R) Core i5 CPU@2.3GH,8GB memory, and a 3GB RTX 1060 GPU.The average running times of state-of-the-art dehazing methods and the proposed method is shown in Table 7.The traditional dehazing methods 1,4 are slower than learning-based methods.These methods are executed without parallelization technology, which increases the execution time.The early learning-based method 11 is faster.However, the dehazing performance of this method is poor.The proposed method achieves state-of-the-art dehazing performance while keeping a lower execution time.In addition, we show the rum states of each method in Table 7.The running states include language, platform, execution time, parameters, and consumption of GPU memory.As shown in Table 7, the proposed model has a suitable parameter number and consumes suitable GPU memory, while achieving the highest quantitative performance.We also show the effectiveness of the Guided-FilterNet, which can reduce the execution time and GPU memory compared with model3.As shown in Table 5, the proposed method is with almost no visible degradation compared with model3.We can obtain the conclusion the GuidedFilterNet can improve the execution speed while avoiding performance degradation.

Extended applications.
Based on the fact the proposed model can capture the local and global features jointly, we can apply the proposed model to solve the problem, such as underwater enhancement [51][52][53][54][55] , detain 56 , and human image generation 57 .Single image underwater enhancement is a challenging problem due to its illposed nature.The global information and local details of underwater images are degraded by water, which results in the degeneration of each pixel may be different.Based on this observation, the high-performance model requires global features to capture the degeneration.The underwater enhancement also needs to restore the fine details, which requires the local features.The underwater enhancement is similar to dehazing, which also needs global and local features jointly and a low compute resource requirement.The proposed model can capture the global and local features jointly, which also can be applied to underwater enhancement.

Conclusion
Deep Guided Transformer Dehazing Network (DGTDN) is proposed based on the transformer and guided filter, which boosts the speed of transformer-based dehazing methods and the image quality of dehazed result.The proposed model consists of BaseNet, DetailNet, and GuidedFilerNet.BaseNet and DetailNet are proposed to capture the local and global features jointly.To boost the advantages of the transformer module and the CNN module, we employ the transformer module to predict the base layer of a clean image, and the CNN module to predict the detail layer.To address the dehazing speed problem of the transformer module, we employ the The structure of the proposed model.Based on the motivation in subsection 3.1, we introduce the CNN, transformer, and guided filter into the proposed dehazing network.As shown in Fig.1, we propose a model containing three parts: BaseNet, DetailNet, and GuidedFilerNet.We enlarge the details of haze remove network, which consists of BaseNet and DetailNet.As shown, the proposed model process a hazy image and outputs a high-resolution dehazed result via series steps: (1) Downsampling the input hazy image via bilinear downsampling, and obtaining a low-resolution haze image, we mark it as LI; (2) Feeding the LI into haze remove network, and obtaining a low-resolution dehazed result, we mark it as LO; (3) Feeding the LI, input hazy image, and LO into the GuidedFilterNet, and obtain the final high-resolution deazed result.Next, we introduce the BaseNet, DetailNet, and GuidedFilterNet in detail.

Figure 1 .
Figure 1.The rough structure of the Deep Guided Transformer Dehazing Network.The proposed network contains main three parts: BaseNet, DetailNet, and GuidedFilterNet.Swin represents the Swin-Block, which is used to enlarge the receptive field of the proposed model.Bilinear represents the bilinear downsampling.LI is the output of the bilinear downsampling.LO represents the output of the haze remove network, which is a lowresolution dehazing result.

Figure 2 .
Figure 2. The structure of the Residual Dilation Block (RDB).

Figure 3 .
Figure 3. Visual results of some recently dehazing methods and the proposed method.The dehazed result obtained by other dehazing methods often retain haze or color distortion.The proposed method can remove haze more completely and obtain a more natural dehazing result.

Figure 4 .
Figure 4. Visual results of methods on the dense non-homogeneous haze48 .The proposed method restores more haze-free images with clearer structures and textures.

Figure 5 .
Figure 5. Visual results of some recently dehazing methods and the proposed method on lake scene with dense haze.The dehazed result obtained by other learning-based dehazing methods often retains haze.The proposed method can remove haze more completely and obtain a more natural dehazing result.

Figure 6 .
Figure 6.Visual results of dehazing methods.The dehazed result obtained by other state-of-the-art methods tends to show a hazed or dark appearance.The dehazed results of MSCNN and AOD-Net lose some details.In contrast, the proposed method often shows a sharp dehazed result and removes haze more completely.

Figure 7 .
Figure 7. Visual results of some recently dehazing methods and the proposed method.The dehazed results obtained by other state-of-the-art methods tend to show a dark or hazed appearance.DA is designed for natural image dehazing with domain adaption.However, we note that the area marked with a black rectangle retains a lot of haze.In contrast, the proposed method often shows a colorful and sharp dehazed result and removes haze more completely.

Figure 8 .
Figure 8. Visual results different model and the proposed method.The dehazed result obtained by other different models often retains haze or color distortion.The proposed method can remove haze more completely and obtain a more natural dehazing result.

Figure 9 .
Figure 9. Visual results of models trained with different loss functions.The dehazed result obtained by other different models often retains haze or color distortion.The proposed method can remove haze more completely and obtain a more natural dehazing result.

Table 1 .
Details of the RDB.

Table 2 .
47aluation results of dehazed results using average PSNR/SSIM on the SOTS dataset from RESIDE47.

Table 4 .
Density values for a natural hazy image in Fig.7.The best result is marked with bold.

Table 5 .
The quantitative results with different modules on the synthetic hazy dataset.

Table 6 .
The quantitative results with different loss functions on the synthetic hazy dataset.
Metricw/o L perc w/o L baseloss w/o L