Interactive residual coordinate attention and contrastive learning for infrared and visible image fusion in triple frequency bands

The auto-encoder (AE) based image fusion models have achieved encouraging performance on infrared and visible image fusion. However, the meaningful information loss in the encoding stage and simple unlearnable fusion strategy are two significant challenges for such models. To address these issues, this paper proposes an infrared and visible image fusion model based on interactive residual attention fusion strategy and contrastive learning in the frequency domain. Firstly, the source image is transformed into three sub-bands of the high-frequency, low-frequency, and mid-frequency for powerful multiscale representation from the prospective of the frequency spectrum analysis. To further cope with the limitations of the straightforward fusion strategy, a learnable coordinate attention module in the fusion layer is incorporated to adaptively fuse representative information based on the characteristics of the corresponding feature maps. Moreover, the contrastive learning is leveraged to train the multiscale decomposition network for enhancing the complementarity of information at different frequency spectra. Finally, the detail-preserving loss, feature enhancing loss and contrastive loss are incorporated to jointly train the entire fusion model for good detail maintainability. Qualitative and quantitative comparisons demonstrate the feasibility and validity of our model, which can consistently generate fusion images containing both highlight targets and legible details, outperforming the state-of-the-art fusion methods.


& Guodong Liu 1
The auto-encoder (AE) based image fusion models have achieved encouraging performance on infrared and visible image fusion.However, the meaningful information loss in the encoding stage and simple unlearnable fusion strategy are two significant challenges for such models.To address these issues, this paper proposes an infrared and visible image fusion model based on interactive residual attention fusion strategy and contrastive learning in the frequency domain.Firstly, the source image is transformed into three sub-bands of the high-frequency, low-frequency, and mid-frequency for powerful multiscale representation from the prospective of the frequency spectrum analysis.To further cope with the limitations of the straightforward fusion strategy, a learnable coordinate attention module in the fusion layer is incorporated to adaptively fuse representative information based on the characteristics of the corresponding feature maps.Moreover, the contrastive learning is leveraged to train the multiscale decomposition network for enhancing the complementarity of information at different frequency spectra.Finally, the detail-preserving loss, feature enhancing loss and contrastive loss are incorporated to jointly train the entire fusion model for good detail maintainability.Qualitative and quantitative comparisons demonstrate the feasibility and validity of our model, which can consistently generate fusion images containing both highlight targets and legible details, outperforming the state-of-the-art fusion methods.
The image fusion technique, as an important branch of information fusion, belongs to the quality enhancement in the image processing discipline.Theoretically, image fusion attempts to integrate complementary information from multiple modality images, which are captured from the same scene by different sensors.Owing to this endeavor, the fusion image will contain diverse meaningful information from multiple original images for better visual effects 1 .To be specific, infrared imaging by thermal radiation can reflect the difference in thermal information contained in different objects through grayscale intensities.Hence, infrared images can avoid the influences of illumination and occlusion, but the limited detail property of infrared images makes a big challenge to visual scene understanding.In contrast, visible images are captured by reflected lights, making them rich in textures while susceptible to external environments.The purpose of infrared and vision image fusion (IVIF) is to obtain an improved image that simultaneously contains both rich texture information and thermal radiation information of the target.From the perspective of vision-related applications, the promising fusion image can be used as a reliable upper source for subsequent visual tasks, which is crucial for good performance.In general, IVIF algorithms are mainly divided into two categories: traditional methods and deep learning-based methods.
Traditional approaches usually include multi-scale transform-based fusion algorithms, sparse representationbased fusion algorithms, subspace-based fusion algorithms, and hybrid fusion methods.A multi-scale transform model based on target enhancement was proposed in Ref. 2 .This algorithm uses the Laplace transform to decompose the aligned source image into high-frequency and low-frequency components, the fusion stage controls the infrared feature scaling by regularization parameters.Subsequently, the fusion image is reconstructed by an inverse Laplace transform.Briefly, these fusion algorithms decompose the source image into multidimensional features or transform the source image to other spectral spaces and then fuse the decomposed components using corresponding fusion strategies.The improvement in the fusion performance of 1) A triple-branch feature extraction network in the frequency domain is developed to achieve the fine detail preservation from the perspective of multi-scale decomposition.2) A dense network of residual gradients incorporating pyramidal segmentation attention (PSA) is designed to enhance the feature extraction from the medium-frequency information.3) A learnable interactive residual coordinate attention fusion network (IRCAFN) is proposed to consider the feature correlation between infrared and visible images while building spatial feature attention across channels.The inability to adaptively fuse feature maps is a weakness of current methods [16][17][18] , which is overcome by IRCAFN, although many methods evade it by enhancing feature extraction capabilities.4) A contrast loss function is leveraged to train the multiscale decomposition network for enhancing the complementarity of information at different scales.Moreover, a compensatory loss is introduced to train the efficient IRCAFN, aiming to balance the distinguishing information between infrared and visible images in the fused images.
Fusion results show that the fused images can retain rich detail information while focusing on the brightness of infrared targets.Our code for this model is available at https:// github.com/ shazo ng0526/ IRCAFusion.

Multi-scale decomposition
Multi-scale decomposition (MSD) is one of the classical operations for many IVIF algorithms.The main idea of IVIF based on MSD is to decompose the source image into a set of images according to specific rules, then fuse the decomposed images, and finally implement inverse MSD to reconstruct the fused images.Traditionally, the decomposition methods usually include pyramid transform 19 , discrete cosine transform 20 , nonsubsampled contourlet transform 21 and bilateral filtering 22 , etc.In the meanwhile, the autoencoder (AE)-based approach is gradually gaining popularity and its structure consists of an encoder and a decoder.Data-driven AE-based networks offer great flexibility for image reconstruction.A representative fusion model is the dense block-based DenseFuse 17 .Such method relies on a large number of training samples from the MS-COCO dataset to train the autoencoder.The encoder corresponds to the MSD to extract effective features from the input image, while the decoder corresponds to the inverse MSD to reconstruct the image based on the encoded features 23,24 .Moreover, the loss of information in the multi-scale decomposition process is a key issue.Therefore, a low-loss multi-scale decomposition method is necessary to facilitate the effective information in the source image to be embodied in the fusion image.Based on this finding, we construct a low-loss AE network based on multi-scale decomposition in the frequency domain.

Fusion rules
It is well known that the feature fusion strategy is a crucial influential factor in IVIF 25 .Most existing fusion strategies are simple traditional fusion rules, which cannot adaptively adjust the fusion weights according to specific image features.Briefly, little attention has been paid to the design of efficient fusion rules.Therefore, the choice of fusion strategies remains limited, including addition 18 , max 25 , average 26 , L1-normal 27 , etc. Recently, some deep learning-based fusion strategies have been proposed to solve the problem that the fusion effect is affected by the fusion rules.RFN-Nest develops learnable deep learning fusion networks that use residual fusion networks to fuse multi-scale features proposed by encoders 12 .Different information should be fused with corresponding fusion strategies to make full use of information from each scale.Besides, the information interaction between infrared and visible images should be emphasized when designing the fusion strategy, as infrared and visible images have some structural similarities.
To cope with these issues, the main motivation of this work is to explore an efficient learnable fusion rule for IVIF.Specifically, our fusion method introduces a new interactive residual coordinate attention fusion network (IRCAFN), which is employed in high-frequency feature fusion to retain more detailed information in fused images.Concretely, the coordinate attention weights of the two source images are weight interacted, and then the fusion of high-frequency features is accomplished by the framework using the residual connection structure.Experimentally, simple fusion conventional strategies are applied to fuse features at low and medium frequencies in our multi-scale decomposition network to keep a good tradeoff in the efficiency and computation complexity.

Methods
In this section, we introduce a low-loss frequency domain decomposition image fusion network.The architecture of the proposed network is elaborated in Section "The framework of the proposed method".The two-stage training strategy is depicted in Section "The two-stage training strategy".

The framework of the proposed method
Our network consists of three parts: the encoder, the interactive residual coordinate attention fusion network (IRCAFN), and the decoder.The detailed framework is shown in Fig. 2.
Our network begins with a multiscale decomposition of the image, where the source image is decomposed into the frequency domain using different filters for low-loss multiscale representation.The high-frequency detail information and low-frequency base information are extracted using the high-pass and low-pass filters, respectively.In particular, we set the cut-off frequency of the band-pass filter according to the settings of the high-pass and low-pass filters.The specific bandpass filters are introduced for both high and low-frequency information, to extract useful information that has been disregarded.
Then, the AUIF network is embedded to extract informative features.The feature extraction component of the AUIF network consists of N convolutional layers, called base convolutional layer (BCL) or detail convolutional layer (DCL), and each convolutional layer is converted from the traditional optimization algorithms by algorithmic unrolling.The basic structure of AUIF is shown in Fig. 2. The reflection-padded structure is to prevent artifacts at the image edges.The batch normalization layer and parametric rectified linear unit (P-ReLU) can enhance the feature extraction capability 15 .
Especially, for medium-frequency information, we employ a dense residual gradient network (DRGN) to extract features.Additionally, the additional PSA attention module is integrated to suppress redundant information.
Finally, this work builds an IRCAFN network to fuse the high-frequency features of infrared and visible images and designs a conventional module to fuse the low-frequency and medium-frequency features of images.As for the decoder, its function is to reconstruct the fusion image.The decoder consists of a 3 × 3 convolution unit, batch normalization layer, and the sigmoid function.In other words, the input of the encoder is the fused features and its output is the reconstructed image.The following parts will introduce the medium-frequency feature representation and the interactive residual coordinate attention fusion network.

The dense residual gradient network (DRGN)
The motivation of this work is to build more efficient and effective feature networks.Therefore, A new DRGN incorporating a pyramid squeeze attention (PSA) module is proposed.There are no ideal high-pass and low-pass filters, which makes it difficult for the realization of the medium-frequency components.The specific design of DRGN is illustrated in Fig. 3. DRGN is a targeted variant of the gradient residual dense block (GRDB) 28 , which consists of three feature extraction branches.To extract useful information in the medium-frequency band, this component exploits dense connections in the innermost layer.The residual branch with an integrated gradient operator helps to extract the high-frequency detail information from the medium-frequency content.In particularly, considering the small amount of useful information in the medium-frequency, some informative features may be omitted in multiple convolution operations.Consequently, we introduce an additional branch of raw information in the outermost layer, which constitutes the external dense connection structure.The structure of the PSA module is shown in Fig. 2c.The PSA module can learn diverse multi-scale feature representations and adaptively re-compare cross-dimensional channel attention weights 29 .Briefly, the DRGN with PSA can capture more informative detail in the medium-frequency for subsequent fusion, which contributes to less information loss.

Interactive residual coordinate attention fusion network (IRCAFN)
The IRCAFN is based on the prevalent residual structure.To be specific, the coordinate attention module is applied in this residual structure, instead of traditional convolution operation.The detailed structure of IRCAFN is shown in Fig. 4. www.nature.com/scientificreports/Motivated by the mechanism in 30 , we obtain the attention weights for the horizontal and vertical directions separately for the high-frequency information feature maps of the source images.This method achieves crosschannel information interaction of intra-image features and establishes a long-distance correlation of intra-image features.Then, the attention weights of horizontal and vertical directions of different source images are adjusted correspondingly.Therefore, it can acquire learnable weight interaction between features and structures across different source images.Finally, the high-frequency information of the source image is multiplied with the corresponding weights and then summed with the source image to obtain the fused high-frequency features.

The two-stage training strategy
The feature extraction and reconstruction of the codec network play vital roles in the image fusion.Consequently, we applied a two-stage training strategy to improve the fusion efficacy.

Training of the auto-encoder network
The encoder-decoder network is shown in Fig. 5a.The input image is decomposed to obtain high-frequency detail features, medium-frequency features, and low-frequency base features.All features are adaptively summed up and sent to the decoder for the final reconstruction.
The encoder network is trained using the loss function L Total defined as follows, where I pixel ,I ssim ,I cont denote the pixel loss, the structure similarity loss, and feature contrast loss.I out ,I in denote the input images and reconstructed images.I HF ,I LF represents the encoded high-frequency feature and low-frequency feature.�•� 2 F is the Frobenius norm, SSIM(, ) is the structural similarity index which quantifies the structural similarity between the output image and input image 31 .COS_SIM(, ) is the cosine similarity used to enhance the complementarity of the encoded high-frequency and low-frequency information.α, β are the hyperparameters for balancing different items' weights.
(1)  In the training process, we adopt a mutually compensating loss function that uses the Frobenius norm to bound the luminance loss of the fused image to the visible and infrared images.However, the Frobenius norm only provides a coarse-grained distribution constraint for model learning, which will enlarge the intensity difference.As the Sobel gradient operator can measure the fine-grained texture information of an image and the L1-norm has the good sparsity for detail preservation 28 , additional L1-norm loss on Sobel gradient operator is compensated for the loss of visible texture.In short, considering the compensation of different constraints, the joint loss L IRCAFN can simultaneously gain the good intensity distribution and detail information with the assistance of the Sobel gradient operator.L IRCAFN defined as follows, where I pix_s denote the total pixel loss from infrared and visible images, and I comp denote the compensation loss between the fused image and the visible images.I f ,I ir ,I vi denote the fusion image, infrared image, and visible image, �•� 2 F is the Frobenius norm.∇ denotes the Sobel gradient operator 28 , which measures the gradient differ- ence between the fused image and the visible images.�•� 1 is the L1-norm.δ and are balance hyperparameters.

Experiments
In this section, comprehensive experiments are conducted on prominent databases to verify the effectiveness and superiority of the IRCAFN.All experiments are conducted with Pytorch (version 1.7.1) on a computer with Windows 10 Operating System, Intel Core i5-10400F Processor, 16 GB Memory and GeForce RTX3070 GPU.

Datasets and metrics
The FLIR dataset is chosen as the training set for both stages.In the test phase, we used TNO 32 , FLIR 33 , and NIR 34 .The details are shown in Table 1.
Six quality assessments are chosen for performance evaluation: entropy (EN), standard deviation (SD), spatial frequency (SF), visual information fidelity (VIF), average gradient (AG), and the sum of the correlations of differences (SCD).The goal of image fusion is to combine images from several scenes to create a single, more informational, and expressive image.The fused image's pixel grayscale difference and gradient distribution are measured using SD and SF, which are directly related to the image's contrast and sharpness.EN and AG are used to define the fused image's amount of information and detail information.The complementary information (3)  www.nature.com/scientificreports/gathered from the various source images is essential to the success of image fusion.Quantifying the amount of information transferred from each source image to the fused image, SCD is the sum of the source image and fused image difference image to source image correlation.VIF is used to quantify the information fidelity between the fused image and the source image.A higher value of VIF indicates that the fused image is more in line with the human eye's visual perception system.Overall, the increase in the value of the above metrics can represent the improvement in image fusion effects.More details of the definition of those metrics can be found in the reference 1 .

Implementation details and network configuration
In the first training stage, the batch size and epoch are set to 30 and 120, respectively.The learning rate is 1e−2 for the first 96 epochs and it is decreased to 1e−3 for the rest epochs.α and β in Eq. (1) are set to 4.8 and 0.09.For the learnable parameters η and θ in high-frequency and low-frequency feature extraction branches, we set them to 0.1 and 0.03, referring to the AUIF network 15 .As for the number of repeated BCL/DCL units in the highfrequency and low-frequency feature extraction branches, we experimentally verified that H n and L n are 7 and 12.After the training is completed, we use the simple addition fusion rule for fusion to verify the fusion effect.
In the second training stage, the batch size and epoch are set to 18 and 150.The learning rate is 1e−2.The hyperparameters δ and of Eq. ( 4) are set to 3.4 and 0.5.We employ the pretrained encoder-decoder model generated in the first training phase, using IRCAFN as the fusion strategy for the high-frequency detailed features.As for the low-frequency and medium-frequency feature fusion strategies, we choose a simple addition fusion strategy.

Effectiveness of the scale decomposition
Qualitative analysis.The image in the scale decomposition stage of the frequency domain and the corresponding spectrogram are shown in Fig. 6. Figure 6 directly reflects the result of feature decomposition, which decomposes the source image into high-frequency, medium-frequency, and low-frequency bands.There is some valuable information in the medium-frequency information, which is easily seen in the resultant images and spectrum plots of the bandpass filter.
Quantitative analysis.To objectively evaluate the effectiveness of the scale decomposition, the results of the AUIF method were used as the experimental baseline.Therefore, an addition fusion strategy is employed on the same TNO dataset, the results are listed in Table 2. L n ,M n ,H n represent the number of convolution units of high- frequency, medium-frequency, and low-frequency feature decomposition networks, respectively.DRGN + PSA denotes that the network structure consisting of DRGN and PSA is used for the medium-frequency feature extraction branch.The first row in Table 2 presents the replication results of AUIF.From the first and second rows in Table 2, almost all evaluation metrics were improved to varying degrees after supplementing the information on medium-frequency characteristics.Especially, the common improvement of EN, SD, and SF evaluation indexes indicates better definition and richer information in fusion images.This finding verified the practicability of our proposed decomposition network with three scales in the frequency domain.
Different scales of information should alter to the corresponding feature extraction network.Accordingly, comparison experiments on feature extraction networks are conducted in the framework of setting the frequency domain scale decomposition.
First, the number of convolutional units in each feature decomposition network make variable impacts on the fusion results.The results show that the best fusion effect is achieved when the number of convolutional units in the high-frequency, low-frequency, and medium-frequency feature extraction branches satisfies L n < M n < H n .L n ,M n ,H n represent the number of convolution units of high-frequency, medium-frequency and low-frequency feature decomposition networks, respectively.2, the numbers of convolutional units of the high-frequency and low-frequency information extraction networks are set to 7 and 12.The proposed network obtains the maximum boost in all metrics.The proposed network obtains the biggest improvement in all metrics, further confirming the validity of the three-scale decomposition for image fusion.
Subsequently, relevant ablation experiments are conducted to confirm the superiority of DRGN and PSA.The results of the ablation study are given in Table 3.When DRGN is used only in the medium-frequency feature extraction branch, the evaluation indexes of SF, AG, and SCD are improved, which indicates a higher definition of the image.With the addition of the PSA module, nearly all metrics are enhanced, reflecting the PSA module's ability to suppress interfering features in the medium-frequency domain.
The similar conclusion can be drawn from the comparison of qualitative results in ablation experiments.The results are shown in Fig. 7, where Fig. 7a is the infrared image, Fig. 7b is the visible image, Fig. 7c is the fusion image without DRGN module, Fig. 7d is the fusion image without DRGN and PSA modules, Fig. 7e is the fusion image without the PSA module, Fig. 7f is the fusion image with DRGN and PSA modules.Compared with other fusion results, the relative luminance intensity in Fig. 7c,f is enhanced, which enables tiny details of low illumination to be preserved.Therefore, it illustrates the excellent effectiveness of the DRGN network for extracting mid-frequency features.In addition, Fig. 7f has the clearest detail preservation, such as the chimney on the house and the tiny clouds, which shows the PSA module suppresses the redundant information in the middle-frequency feature to strengthen the quality of the fused images.Overall, it can be inferred that the optimal fusion can be obtained by combining DRGN and PSA.

Effectiveness of two-stage training strategy
In the first stage of training, we use traditional I pixel and I ssim loss functions that improve the feature extraction and image reconstruction capabilities of the proposed model.Since filtering operations cannot be trained to improve feature decomposition in the multi-scale decomposition stage.To fully describe the complementarity of high-frequency and low-frequency features, the proposed model is notably supplemented with a cosine similarity constraint as I cont .In the second training stage, I pix_s constrains the pixel loss of the fused image.Furthermore, a compensation loss I comp based on the Sobel gradient operator is integrated to reduce the loss of detail texture in the fused image and achieve a balance between infrared content and visible information.
To present the feasibility of the two-stage of training strategies, the quantitative experimental results are given in Table 4, and the qualitative experimental comparison results are shown in Fig. 8. Qualitative analysis.For the test results of the TNO dataset in Table 4, a comparison between the first and second rows illustrates the improvement in meaning information and fidelity of the fused images with the addition of contrast loss.More importantly, fused results for the FLIR and NIR datasets show a considerable improvement in almost all evaluation metrics.The second stage test comparison results are shown in rows 2 and 3 of the data section for each dataset in Table 4.We can see that almost all evaluation metrics were enhanced for all three datasets, which indicates that the fused images provided richer information and a clearer visual effect.In summary, the results of the ablation experiments on loss functions demonstrate the effectiveness of the integration of contrast loss, which promotes the feature disentanglement of high-frequency and low-frequency information.Overall, the quantitative experimental results demonstrate the reasonability of the two-stage training strategies.
Quantitative analysis.The test results in Fig. 8c,d exhibit missing cloud shapes and missing backpack detail textures.The results illustrate that both visible and infrared images appear to have different degrees of information loss after the one-stage training.In contrast, the results of two-stage training in Fig. 8e can retain more detailed information from both visible and infrared images.On the other hand, the test results of the NIR dataset in Fig. 8e illustrate that the two-stage training results can enhance the contrast of the fused images, which makes the textures of the distant mountains clearer.The main reason is that in the multiscale decomposition stage, the introduction of the medium-frequency information brings a certain amount of redundant information while increasing the expected useful information.Notably, some important high-frequency detail features may be overwhelmed by redundant information.The comparison results explain the trained IRCAFN can enhance detailed information while suppressing redundant information.Hence, this work develops a two-stage training strategy.In the first stage, the multi-scale decomposition, feature extraction, and reconstruction capabilities of the AE network are trained.In the second stage, the IRCAFN network is trained as a fusion network for highfrequency detail information.The effectiveness of the two-stage training strategy is validated by the visualized results in Fig. 8.  www.nature.com/scientificreports/

Experiments on the fusion strategy
As the same amount of redundant information is introduced when adding medium-frequency branches, insufficient enhancement for the NIR/FLIR dataset can be gained when using DRGN and PSA medium-frequency extraction networks.Moreover, the overall image brightness of the NIR and FLIR dataset is higher than that of the TNO dataset, and a direct summation fusion strategy will cause the image brightness to be too large, which affects the fusion effect.This phenomenon illustrates that a suitable fusion strategy plays an essential role in image fusion.The traditional fusion algorithms, which are commonly used, as shown in Table 5, are applied to find the optimal fusion strategy for different information extraction branches.
Qualitative analysis.The essence of the fusion strategy is using specific weighting elements to fuse the information from different source images.Clearly, the results of group 1 in Fig. 9 show that only the L1-normal and addition fusion strategies obtain sufficient information from the infrared and visible images.Unfortunately, the L1-normal method results in losing the detail content of the roof when the same fusion strategy is applied to features of different frequencies.Therefore, further experiments are based on the addition fusion strategy.Group 2 shows that, with low-frequency features, only the addition fusion strategy can balance the fused image with the appropriate brightness while retaining more detailed information.The enlarged detail image in Group 3 confirms the importance of the addition fusion strategy for the mid-frequency feature.The results exhibit optimal image contrast and detail retention features when using the proposed IRCAFN as a high-frequency detail feature fusion strategy.Therefore, the experimental results achieved the best visual results when the fusion strategies of IRCAFN, addition, and addition were applied to the high-frequency, medium-frequency, and lowfrequency features, respectively.
Quantitative analysis.Our experiments are based on the additive strategy, which has been proven to be a suitable fusion strategy for such feature decomposition by the AUIF network.From the first set of results for the three datasets in Table 6, the addition fusion strategy performs best in the three datasets when the same fusion strategy is implemented for different information.It is reasonable that different fusion strategies should be adopted for different scales.A series of experiments are presented and the results are shown in Table 6.As shown in Table 6, the test results in three datasets were evaluated comprehensively, and the FLIR dataset and the NIR dataset achieved an overwhelming lead when the addition fusion strategy is applied to the low and medium frequency information, while the TNO data achieved an average index.www.nature.com/scientificreports/Considering the existing experimental findings above, almost all evaluation metrics reach best when the low and medium frequency information is based on the addition fusion strategy and the high-frequency feature implements the proposed IRCAFN fusion strategy.Besides, the experimental results also illustrate that the proposed IRCAFN network is more suitable for representative high-frequency details.
To verify the generalization performance of IRCAFN, three prevalent datasets were employed for evaluation.The specific details of the three datasets TNO, FLIR, and NIR are listed in Table 1.It contains day and night scenes, with three types of contents: persons, stuff, and scenery.To fully illustrate the comparison results, red and green boxes are used to subjectively mark and enlarge the selected area.
Firstly, the proposed fusion method maintains the balance of meaningful information between different source images.Fusion images biased toward visible images lead to fusion results containing rich texture information but losing salient infrared information.For example, in the first column of comparison images in Fig. 10, the fusion results of NestFuse, UNFusion, Res2Fusion, CUFD, LRRNet, Cddfuse and PSFusion all lose the cloud detail information in the infrared images.Conversely, other unexpected results, the fusion image contains excessive infrared information with the visible image detail information lost.In the second column of Fig. 10, the results of NestFuse, UNFusion, Res2Fusion, CUFD and Cddfuse do not preserve the information of bench objects in visible images.The comparison of the results can illustrate that the proposed fusion method retains the visible image detail information while highlighting the infrared image saliency target, keeping a well balance between them.In extreme exposure conditions similar with the third column in Fig. 10, the proposed method still permits clearly seeing texture on the figure's pants and the tree branches behind the traffic light.
Furthermore, the proposed fusion method preserves more information.In columns 5 and 6 of Fig. 10, the competitors have lost detailed information of the peaks and grasses to some degree.By contrast, these details are evidently preserved in both our fused images and the PSFusion results.Especially, the comparison image of the detailed information of the distant peaks in column 5 underlines that the proposed method can retain more information about the tiny dim textures.
Finally, our fusion image with high contrast shows clearer details and more consistent with human eye perception.The first column of the DIDFuse appears to have an unnatural brightness distribution, and the RFN-Nest and GANMcC results have a blurred situation.Although the results of DenseFuse, Dual Branch, AUIF and PSFusion exhibit clear details and saliency of the infrared targets, our method demonstrates a higher contrast ratio, which provides clearer details in the images.In addition, the luminance distribution of our results is more uniform with a wider distribution, as visualized in columns 3 and 4 of Fig. 10.So, our visual effect performs  www.nature.com/scientificreports/better compared to other methods.The distant mountain in column 5 and the tree in column 6 in Fig. 10 both demonstrate that our fusion results are more natural looking and more in line with the human visual perception system.
In general, the fusion results of the proposed method exhibit more detailed information, higher clarity, and are more compatible with the human visual perceptual system.Quantitative analysis.We compared the aforementioned methods quantitatively on three datasets using the above evaluation metrics.The test results are shown in Table 7.In terms of EN, SD, SF, AG, and SCD metrics, the result of our model on three datasets achieves nearly the best or second index.As for the VIF metric, our results achieved a medium standing.Specifically, the larger EN and SCD show that our fusion images retain richer details.It means that the fusion images not only contain rich detail information but also have high contrast and the best visual quality.The best values are obtained in the metrics SF and AG, which represent images containing more detailed edges and higher clarity.Overall, our approach is suitable for IVIF tasks in variable scenes.

Conclusions
To improve the performance of infrared and visible image fusion, this work develops an efficient deep learning fusion model in triple frequency bands.Especially, a medium frequency feature extraction branch is intended to address the complete information preservation, which is based on the DRGN and PSA modules.Considering the specific characteristics of different spectrums, we fuse the features of different frequency bands adaptively, while an interactive residual coordinate attention fusion network is applied for high-frequency feature fusion.More importantly, the contrastive learning is leveraged to train the multiscale decomposition network for enhancing the complementarity of information at different frequency spectra.With the complete multi-scale decomposition and adaptive fusion strategy, more informative contents can be represented in the feature extraction stage and the fusion efficiency can also be enhanced.Both qualitative and quantitative experiments confirm the superiority of our approach over state-of-the-art methods.In the future, we will explore a fully adaptive fusion network based on contrastive learning, which can further boost the effectiveness in different frequency bands for IVIF tasks.

Figure 1 .
Figure 1.The framework of the proposed fusion network.

Figure 2 .
Figure 2. Illustration of the IRCAFN model.(a) The architecture of the IRCAFN.The Fusion network denotes the IRCAFN.(b) BCL and DCL with the same structure and different parameters.(c) The architecture of the PSA module.(d) The architecture of the SE weight module.(e) The architecture of the Dense Block.

Figure 4 .
Figure 4.The structure of the Interactive Residual Coordinate Attention Fusion Network (IRCAFN).

Figure 5 .
Figure 5.The structure of the two-stage training.

Figure 6 .
Figure 6.Frequency domain multi-scale decomposition images and corresponding spectrograms.

Figure 7 .
Figure 7. Test results of the ablation experiment for DRGN.The red and green boxes mark and magnify the selected area.

Figure 8 .
Figure 8. Test results of two-stage training.Red and green boxes mark and enlarge the selected area.

Figure 9 .
Figure 9.Comparison of the results of fusion strategies.Red and green boxes mark and enlarge the selected area.

Figure 10 .
Figure 10.The qualitative comparison results.From top to bottom: infrared images, visible images, results of other methods, and our method.

Table 1 .
The details of experimental datasets.

Table 2 .
Results of frequency domain multi-scale decomposition network experiment.The optimal and second values are emphasized in italics and bold respectively.

Table 3 .
Ablation experiments of DRGN and w/o denote WITHOUT.The best and second values are marked in italics and bold respectively.

Table 4 .
Ablation experiment of two-stage training strategy.Optimal values are marked in bold.

Table 5 .
Typical fusion algorithms.I f , I ir ,I vi are the feature maps of infrared image, visible image, and fusion image, respectively.

Table 6 .
Comparison of experiments of fusion methods.The best and second values are marked in italics and bold respectively.

Table 7 .
Quantitative results in test datasets.The best and second values are marked in red and black respectively, those are highlighted in bold.