Abstract
Deep hashing method is widely applied in the field of image retrieval because of its advantages of low storage consumption and fast retrieval speed. There is a defect of insufficiency feature extraction when existing deep hashing method uses the convolutional neural network (CNN) to extract images semantic features. Some studies propose to add channelbased or spatialbased attention modules. However, embedding these modules into the network can increase the complexity of model and lead to over fitting in the training process. In this study, a novel deep parameterfree attention hashing (DPFAH) is proposed to solve these problems, that designs a parameterfree attention (PFA) module in ResNet18 network. PFA is a lightweight module that defines an energy function to measure the importance of each neuron and infers 3D attention weights for feature map in a layer. A fast closedform solution for this energy function proves that the PFA module does not add any parameters to the network. Otherwise, this paper designs a novel hashing framework that includes the hash codes learning branch and the classification branch to explore more label information. The likebinary codes are constrained by a regulation term to reduce the quantization error in the continuous relaxation. Experiments on CIFAR10, NUSWIDE and Imagenet100 show that DPFAH method achieves better performance.
Introduction
Recently, a great number of media data have been extensively used in various industries such as computer vision and network security^{1,2}. Image retrieval in computer vision is the focus of current research. It is an urgent problem that quickly retrieve the similar image from a large data set. Due to the advantages of fast query speed and low storage cost, deep hashing method^{3,4,5,6,7} is widely applied in the field of image retrieval. The purpose of deep hashing is to convert highdimensional images to lowdimensional binary codes by using a hash function, thereby preserving similar information of original images.
In the early image retrieval methods, textbased image retrieval (TBIR)^{8,9} follows the traditional text annotation technology to implement retrieval by text matching. In contentbased image retrieval (CBIR)^{10,11}, with the help of the computer to explore image content features and take it as clues to detect other images with similar features from image database. However, TBIR and CBIR need a great quantity manual operation and computational resources. On the contrary, deep hashing methods^{12,13,14} have obvious advantages by utilizing CNN as a features extractor. Existing deep hashing are divided into dataindependent and datadependent. In the dataindependent hashing^{15}, the hash codes are obtained by randomly mapping matrix and the accuracy of hash functions cannot be guaranteed. Datadependent hashing^{16,17} explore multiple aspects of images such as shape, texture and colors to generate hash codes with discrimination ability. This study uses the datadependent methods to learn highquality hash codes.
Most current hashing methods commonly use the shallow CNN to explore highdimensional semantic features and map them to hash codes via a hash function. However, the feature learning part of these methods have defects that features extraction is insufficiency and imbalance. Meanwhile, the hashing learning part cannot make full use of label information and produce insurmountable quantization errors, which significantly affect the accuracy of hash codes. Therefore, some scholars suggest adding channelwise and spatialwise attention mechanism to backbone network^{18,19}. Such attention modules usually cause two problems. First, the flexibility of learning attention weights is hampered because they can only extract images features along channels or spatial dimensions. Second, their structures are composed of complicated factors, it will increase the complexity of the training model and cause over fitting.
To optimize the above problems, this paper is encouraged by 3D attention module^{20} and semantic hierarchy preserving deep hashing^{21}. This paper designs a parameterfree attention (PFA) module which defines an energy function that consider the weights of both channel and spatial dimensions. This module makes the network learn more differentiated neurons without adding parameters, and the highlevel semantic features of the images can be fully explored through refine those neurons. Specifically, ResNet18 is chosen as backbone network. As shown in Fig. 1, the whole process is mainly divided into four steps. First, the pairs of images are fed into the Convolution layer and the Maximum pooling layer to generate feature map. Second, the feature map is processed by PFA module, which considers both the 1D channelwise weights and the 2D spatialwise weights and directly generated 3D weights. Third, this paper performs the operation of elementwise sum on PFA output and feature map and input the result to the backbone network to extract image features. Finally, in order to make efficient use of semantic label information, two branches containing classifier layer and hashing layer is designed. Combining the pairwise loss and quantization loss generated by the hashing layer and classwise loss generated classifier to obtain hash codes with discriminative ability.
In short, the contributions are as follows:

1. DPFAH is an endtoend learning framework which perform simultaneous feature representation and binary codes learning. A lightweight module is introduced to extract rich semantic features and avoid over fitting in the training process.

2. The PFA module is embedded in ResNet18 network to improve the feature representation. It explores an energy mathematical formula to calculate the 3D weight and derives a closedform solution that speedup the weight calculation. No parameters are added to the network during the whole process.

3. A novel deep hashing framework is designed by DPFAH, which includes hashing learning and classification. This method can use the label information to eliminate discrepancy and generate more accurate hash codes. Experimental results on three datasets have verified DPFAH.
The remaining content of this paper is as follows. “Related work” is related work. “Deep Parameterfree attention hashing” describes the details of DPFAH. “Experiments” is the results of experiments and analysis. “Conclusions” summarizes the work of this study.
Related work
Deep learning is applied in many fields for its advantages of a solid learning ability and good portability. Network security fields use neural networks to detect malware^{22,23} and programs^{24}, The field of artificial intelligence can be conducive to the intelligent estimation of traffic time by deep learning methods^{25}. This paper focuses on the research of hashing algorithm based on deep learning. Deep hashing is widely applied in image retrieval system due to its own advantages. For example, the function of searching images by image is realized through deep hashing in many shopping software. Therefore, how to obtain hash code with strong accuracy for each image has become a research hotspot. In this section, the existing several unsupervised hashing approaches and supervised hashing approaches are introduced.
Unsupervised hashing
Unsupervised hashing^{26,27,28,29,30} only utilizes the unlabeled data points to learn hash function that map high dimensional feature to compact hash codes. The similarity matrix is usually constructed in the process of feature learning. Many scholars have carried a lot of study on the perspective of constructing similarity matrix. Specifically, Sheng et al. proposed^{28} the descriptors of data are represented by the output of fullconnected layer and used to design the similarity matrix. The network is optimized by calculating the loss between the similarity matrix and pairwise hash codes. By observing the law of features distribution, Yang et al. proposed^{29} the cosine distance of pairs data can be evaluated by Gaussian distributions. They set a distance threshold in the steps of constructing the similarity matrix, the data points are defined as similar if the cosine distance of data points smaller than threshold, vice versa. On this basis, Jiang et al. proposed^{30} the cosine distance was used directly to guide the construction of similarity matrix, and encouraged by^{31} , they chose the gradient attention to optimize the network. Although unsupervised hashing retrieval faces great challenges due to without labels information, these methods contribute to the development of image retrieval.
Supervised hashing
Compared to unsupervised hashing, supervised hashing methods try to explore data labels as supervised information to calculate similarity matrix. Early on, Xia et al. proposed^{32} to learn semantic features and hash codes separately, and there is no feedback between them. Recent supervised hashing usually designs an endtoend learning framework to learn features and hash codes simultaneously such as^{31,32,33,34}. On this basis, Cao et al.^{18} selected a \(tanh\) activation function that make the network output is continuous hash codes. To avoid the discrete limit imposed on likebinary codes, Su et al. proposed^{35} the greedy rules by updating the parameters toward the possible optimum discrete solution. In order to solve the problem of imbalanced distribution of data labels, Jiang et al.^{36} introduced a soft concept that quantified pairwise similarity as a percentage by using labels information. Meanwhile, Cao et al.^{37} proposed to weight the similarity matrix of training pairs and the Cauchy distribution is utilized instead of \(sigmoid\) function to calculate the loss. These methods are improvements in the loss function, but they ignore the problem of insufficient image features extraction. Hence, Li et al.^{19} embedded channel attention and spatial attention into CNN to obtain sufficient semantic features. Yang et al.^{34} improved the feature map in the dual attention module and combined it with the backbone network. However, these modules can aggravate the complexity of the network model and affect the speed of training. Motivated by^{38}, this paper introduces a lightweight attention module based on ResNet18 and design a new classwise loss, which suitable for learning more accurate hash codes.
Deep parameterfree attention hashing
In this section, the detail of DPFAH method is described, including research motivation, the definition of letters and formulas, the architecture of network, PFA module and the process of optimizing network.
Research motivation
Recently, there are some defects in deep hashing method that need to be deal with: (1) shallow network cannot fully extract the semantic feature information of images, some channelbased or spatialbased attention modules can increase the complexity of model and lead to over fitting; (2) the process of relaxing hash codes can produce inevitable quantization error.
In order to solve the problem of insufficient feature extraction, some scholars consider adding attention mechanism modules to the network, which will increase the complexity of network computing and algorithm time complexity. Based on the above considerations, the goal of this paper is to design a lightweight module that can extract image features without adding any parameters to the network, and a regulation term constrained hash codes is proposed to reduce the quantization error.
Problem formulation
In the similarity retrieval, given a dataset with \(n\) images are represented as \(X={\{{x}_{i}\}}_{i=1}^{n}\), where \({x}_{i}\) represents the \(ith\) image. The label of \(X\) is denoted as \(Y={\{{y}_{i}\}}_{i=1}^{c}\), where \({y}_{i}\) is the labels of the \(ith\) image and \(c\) is the number of classes. Therefore, the similarity matrix \(S=\{{s}_{ij}\}\) is defined as:
The target of deep hashing is to learn a hash function \(F(\theta ;{x}_{i})\) that project \({x}_{i}\) to \({b}_{i}\in {\{1,+1\}}^{l}\), where \(\theta \) represents the parameters of CNN and \(l\) is the length of hash codes. Therefore, each image \({x}_{i}\) is mapped to \(l\)dimensional vector \(U={\left\{{u}_{i}\right\}}_{i=1}^{n}\) passing through \(F\) model, where \({u}_{i}\) is the \(l\)dimensional vector of the \(ith\) image. To reduce quantification loss, inspired by^{34}, \({u}_{i}\) is processed by a piecewise function as follow:
Finally, \({b}_{i}=sign(f\left({u}_{i}\right))\) is used to map \(l\)dimension \({u}_{i}\) to \(l\)bit \({b}_{i}\), the \(sign(.)\) is defined as follow:
Network architecture
Figure 1 shows the framework of DPFAH, which includes three main parts. DFPAH utilizes ResNet18 as backbone network, in order to fully improve the salient features representation ability and does not increase the computational complexity of the model. This paper has drawn a simple and parameterfree module into network, which can explore neurons in each channel or spatial location to learn more discriminative cues. In addition, the last layer of basic residual network is the classification layer that assigns data to the same class. On this basis, the hashing branch is designed parallel to the classification branch. The classwise loss generated by the classification branch will positively affect the hashing branch when the parameters are updated by back propagation.
PFA module
The existing attention modules consider the channelwise attention or spatialwise attention respectively. For channelwise attention, the importance of each channel is firstly calculated from the perspective of channels, and then the channel with high importance is assigned greater 1D weights. For spatialwise attention, the importance of features at each location is calculated from a spatial point of view, and then the location with higher importance is assigned greater 2D weights. These modules can increase the computational overhead when computing the 1D or 2D attention weights. Hence, this paper introduces a lightweight attention module (PFA) that can directly calculate 3D weights. As shown in Fig. 2, first, the mean of \(\overline{X }\) of feature maps \(X\) is obtained and calculate the square of \(X\) and \(\overline{X }\) to get the variance. The variance is then divided by the feature map to obtain the variance of each channel, which is used to determine the variance of each channel and the importance of each spatial. Finally, the sigmoid function is used to restrict the result, and then multiplied with the original feature map \(X\). In addition, the PFA module can focus on the primary areas close to the image label. As shown in Fig. 3, the second line represents the distribution of features extracted using the ResNet18 network, and the third line represents the PFA module is added to the network. The label of the first image is dog, only using the ResNet18 network to extract features will pay attention to many noises outside the label. After adding PFA, feature activations are mainly distributed around dog. It has the same effect on the second image. The third and fourth image focuses on more feature activations information about labels after adding PFA. Hence, the effectiveness of the PFA module is proved by the visualization of feature activation shown by GradCAM^{39}.
Thanks to the PFA introduces an energy function that derives a closedform solution, it does not add parameters to the network. Inspired by neuroscience theories^{40}, the neurons with the most information are usually the ones that show different firing patterns from those around them, and then those important neurons should be given higher priority. The simplest means to discover these neurons is to compute the linear relationship between one target neuron and the others. Consequently, an energy function for each neuron is defined as follows:
where \(t\) is target neuron and \({x}_{i}\) is surrounding neurons in each channel of feature map \(X\in {\mathbb{R}}^{C\times H\times W}\), \({w}_{t}\) and \({b}_{t}\) are weight and bias, \(i\) is the \(ith\) spatial dimension, \(M\) is the number of neurons on a channel and \(M=H\times W\), \(\widehat{t}={w}_{t}t+{b}_{t}\) and \({\widehat{x}}_{i}={w}_{t}{x}_{i}+{b}_{t}\) are linear transforms of \(t\) and \({x}_{i}\). \({y}_{t}\) and \({y}_{0}\) is the output of target neuron and surrounding neurons respectively and \({y}_{t}\ne {y}_{0}\). The minimum value is gained by Eq. (4) when \({y}_{t}=\widehat{t}\) and \({y}_{0}={\widehat{x}}_{i}\). In a channel, the linear separability between target neuron and other neurons can be obtained by calculating the minimum value of Eq. (4). For simplicity, this paper adopts \({y}_{t}=1\) and \({y}_{0}=1\), add a regularization term to optimize the function. The energy function is transformed as follows:
There are \(M\) energy functions on each channel, which are quite complex in calculation by using iterative. Luckily, Eq. (5) has a fast closedform solution with respect to \({w}_{t}\) and \({b}_{t}\) as follows:
where \({\mu }_{t}\)=\(\frac{1}{M1}{\sum }_{i=1}^{M1}{x}_{i}\) and \({\sigma }_{t}^{2}=\frac{1}{M1}{\sum }_{i=1}^{M1}{{(x}_{i}{\mu }_{t})}^{2}\) represents the mean and variance of surrounding neurons, respectively. Thanks to the solutions of Eqs. (6) and (7) are calculated on a single channel. This supposes that all features in a single channel follows the same distribution. The mean and variances of all neurons can be computed according to this suppose. This method considerably reduces the calculation cost. Therefore, the minimal energy can be computed as follows:
where \(\widehat{\mu }=\frac{1}{M}{\sum }_{i=1}^{M1}{x}_{i}\) and \({\widehat{\sigma }}^{2}=\frac{1}{M}{\sum }_{i=1}^{M1}{{(x}_{i}\widehat{\mu })}^{2}\). From Eq. \((8)\), it can be concluded that the greater the difference between the target neuron and the surrounding neurons, the lower the energy function and the more stable the model will be. Although \({e}_{t}^{*}\) can represent the importance of each neuron, this method needs to calculate a large number of covariance matrix. Hence, this paper utilizes a scaling operator instead of an addition for feature refinement as follows:
where \(E\) group all \({e}_{t}^{*}\) across channel and spatial dimensions. Adding a \(sigmoid\) function to prevent the value of \(E\) from being too large.
Model formulation
Input a pair of images \({x}_{i}\) and \({x}_{j}\) into the network to generate hash codes \({b}_{i}\) and \({b}_{j}\). The Hamming distance between \({b}_{i}\) and \({b}_{j}\) is defined as \({D}_{H}=\frac{1}{2}(l\langle {b}_{i},{b}_{j}\rangle )\), where \(\langle {b}_{i},{b}_{j}\rangle \) is the inner product and \(l\) is the length of hash codes. It can be seen that there are opposite changes between inner product and Hamming distance. The larger \({D}_{H}\), the smaller \(\langle {b}_{i},{b}_{j}\rangle \), and vice versa. Hence, the inner product is used instead of hamming distance to judge the similarity of pairwise images.
Given the set \(B=[{b}_{1},{b}_{2},\dots ,{b}_{n}]\) of hash codes. The Maximum Likelihood estimation of \(B\) for dataset \(X\) is defined as follows:
where \(P\left(SB\right)\) represents the likelihood function. For each image pair, \(P\left({s}_{ij}{b}_{i},{b}_{j}\right)\) is the conditional probability of \({s}_{ij}\) under the given premise of \({b}_{i}\) and \({b}_{j}\), which is calculated as follows:
where \(\sigma \left(\cdot \right)\) is \(sigmoid\) function defined as \(\sigma \left(x\right)=\frac{1}{1+{e}^{x}}\) and \({b}_{i}=sign\left({u}_{i}\right)\). The reason why this paper uses \({u}_{i}\) instead of \({b}_{i}\) is that \({b}_{i}\) will cause a discrete optimization problem in Eq. (11). \({u}_{i}\) is the continuous likebinary codes output by the network, which can avoid this problem.
Learning hash codes by combing Eqs. (10) and (11) as follows:
Equation (12) is the negative log likelihood loss function that shows the inner product of similar images should be as large as possible, the inner product of dissimilar images should be as small as possible. In other words, the hash codes of similar images are similar, and vice versa. Consequently, the hash codes preserve the similarity relation of the images in the original space.
In addition, there is an inevitable quantization error when \({u}_{i}\) is quantized to \({b}_{i}\). To solve this problem, inspired by^{9}, this paper has made the following improvements to \({u}_{i}\):
where \(ReLU\left(x\right)=\mathrm{max}(0,x)\) is the Rectified Linear Unit. This paper follows the optimization policy proposed by^{34}, which relax \({u}_{i}\) to \([\delta ,\delta ]\) and \(\delta \) is set to 1.1.
Finally, in the classification layer, the output nodes of the network are determined by \(c\) that is the number of categories in the dataset. The loss between the output of the classification layer and the label \({y}_{i}\) is defined as:
Additionally, \({o}_{i}\) is the realvalued classification layer outputs of the \(ith\) image. By calculating Eq. (14), the generated hash codes by hashing layer saves classification information at the same time.
Overall, combing Eqs. (12), (13) and (14), the total loss of the framework model is expressed as:
Learning
The network parameters are optimized by calculating the gradient of the loss function and completing the back propagation. To learning a hash function for mapping images to hash codes, \(\theta \) stands for the parameters of all feature layers, \(\varphi ({x}_{i};\theta )\) denotes the output of network, \({W}^{T}\epsilon {\mathbb{R}}^{512\times l}\) is the transpose of the weight matrix and \(v\in {\mathbb{R}}^{l\times 1}\) represents bias vector. A fully connected layer is employed to connect feature representation and hashing learning. It is set:
In the DPFAH model, the parameters to be optimized are \(\theta \), \(W\), \(v\) and \({b}_{i}\). The control variables method is adopted to optimize the parameters. Among them, \({b}_{i}\) can be directly optimized:
Before optimizing the parameters \(\theta \), \(W\) and \(v\), this paper calculates the derivative of \({L}_{all}\) with respect to \({u}_{i}\) and \({o}_{i}\) by Eq. (15) as:
where,
Then, this paper updates the parameters \(W\) and \(v\) by using back propagation:
When optimizing network parameters, \({l}_{3}\) has a certain impact on parameter during back propagation, according to Eqs. (18) and (20), the gradient of \(\theta \) is calculated as:
The training process of the DPFAH model is exhibited in Algorithm 1.
Experiments
In this section, the DPFAH model is measured on three datasets. This paper compares the evaluation indexes of the DPFAH with the latest approaches.
Datasets
(1) CIFAR10 is a singlelabel public dataset, which include 60,000 images belonging to 10 classes, and each class have 6000 images. In this experiment, the training set is composed by selecting 500 images at random in each class, the testing set is formed by 100 images in each class. The remaining images are treated as the database. (2) NUSWIDE is a multilabel public dataset including 269,648 images, this experiment selects 195,834 images belonging to 21 categories from them. Specifically, 100 images from each category form the testing set and the rest of images serve as the dataset. This experiments randomly select 500 images in each class as training set from the dataset. (3) Imagenet100 is a singlelabel public dataset with 138,503 images and each image belongs to one of 100 classes. In experiment, the testing set is formed by 5000 randomly selected images, and the rest of the images serve as the database. At the same time, 130 images from each class of the dataset are chose as training set. In addition, the above three datasets are opensource datasets. All the procedures were performed in accordance with the relevant guidelines and regulations.
Evaluation metrics and settings
There are four evaluation metrics in the experiment to measure the performance of DPFAH: mean average precision (mAP), precisionrecall curves (PR), precision curves within Hamming distance 2 (P@H = 2) and precision curves of the first 1000 retrieval results (P@N). In addition, this paper selects mAP@ALL for CIFAR10, mAP@5000 for NUSWIDE and mAP@1000 for Imagenet100. In order to prove the performance of DPFAH, the methods of DBDH^{14}, DSDH^{5}, DHN^{4}, LCDSH^{6}, HashNet^{18}, IDHN^{7}, DFH^{13} and DSH^{3} are selected for comparative experiment.
To make the experimental results objective and impartial, all comparative experiments are carried out on ResNet18 network and the Pytorch framework. Moreover, the parameter information of ResNet18 in each layer is shown in Table 1. Specifically, \(p\) is the size of the convolution kernel, \(s\) and \(k\) represent the stride and padding, respectively, and \(l\) is the length of hash codes.
In experiment, all comparative approaches use the same training set and testing set. The optimizer uses the root mean square prop (RMSProp), the mini batch size is set as 128, the learning rate is set as \(5\times {10}^{5}\) and the weight decay is set as \(1\times {10}^{5}\). The environment configuration is shown in Table 2.
Hyperparameter analysis
In Eq. (15), this paper uses two hyperparameters \(\eta \) and \(\zeta \) to weigh the impact of classification loss and quantization loss on network optimization. The values of \(\eta \) and \(\zeta \) are determined by experimental results, as shown in Tables 3 and 4. This paper selects singlelabel dataset CIFAR10 and multilabel dataset NUSWIDE for parameter adjustment. Experiment fixes \(\eta =10\) when adjusting \(\zeta \). Similarly, experiment fixes \(\zeta =0.1\) when adjusting \(\eta \).
As shown in Table 3, the value of mAP is the largest on the two datasets when \(\zeta =0.1\). The mAP decreases significantly when \(\zeta =0.05\) on CIFAR10, and mAP on NUSWIDE is also decreasing slightly. Compared with \(\zeta =0.1\), the mAP value of \(\zeta =0.5\) decreased by 2.3% and 0.8% on average respectively on CIFAR10 and NUSWIDE. When ζ = 1, map values decreased by an average of 2.4% and 1.8% on two datasets. Therefore, it is concluded that when the hyperparameter \(\zeta \) of classwise loss is 0.1, the experimental result is better.
Figure 3 shows mAP on different \(\zeta \) more intuitively, the mAP curves reach the peak when the value of a is 0.1. As the value of \(\zeta \) becomes larger or smaller, the value of mAP will decrease slightly. Therefore, this paper chooses \(\zeta =0.1\) to achieve the optimal experimental effect.
As shown in Table 4, When \(\eta =10\), mAP reaches its maximum value. On CIFAR10 and NUSWIDE, compared with \(\eta =10\), the mAP value of \(\eta =1\) decreased by 1.9% and 0.3% on average, the mAP value of \(\eta =5\) decreased by 2.1% and 0.1% on average, and the mAP value of \(\eta =15\) decreased by 2.0% and 0.6% on average respectively. Therefore, when the hyperparameter \(\eta \) of quantization loss is 10, good results can be obtained in the experiment.
Similarly, as shown in Fig. 5, the value of mAP is higher than the others when \(\eta =10\), and the mAP curves reach the peak on CIFAR10. On NUSWIDE, when \(\eta \) takes 5 and 10, the mAP at 48 bit and 64 bit are close, but the mAP value of \(\eta =10\) is significantly better than \(\eta =5\) at 16 bit and 32 bit. Therefore, this paper also sets \(\eta \) as 10 to achieve optimal experimental effect.
Empirical analysis
In order to fully extract image features without increasing network complexity, this paper adds the PFA module to ResNet18 network, which can extract 3D weights of features. Compared with the common attention mechanism module, the structure of PFA is simple and parametersfree. Meanwhile, to improve the discrimination and accuracy of hash codes, this paper designs classification branches in the network. Equation (13) is designed to reduce quantization errors. Equation (14) is the classwise loss generated by the classification layer. Equations (13) and (14) are integrated to \({L}_{loss}\) in ablation experiments. As shown in Table 5, DBDH is selected as the baseline and the length of hash codes is 48 bit on CFIAR10 dataset. DBDH indicates the baseline model utilizing AlexNet network. DFPAH1 chooses ResNet18 as backbone instead of AlexNet. On this basis, DFPAH2 shows that PFA Module has been added to the network. DFPAH3 adds \({L}_{loss}\) to the network. The symbol √ indicates adding corresponding module.
As shown in Table 5, PFA module is added on the basis of DFPAH1, and the mAP value is increased by 2.95%, which proves that PFA module improves the accuracy of image retrieval. The mAP value of DFPAH3 is up to 0.98% higher than DFPAH2, showing the effectiveness of \({L}_{loss}\).
Figure 6a intuitively shows the PR curves added with PFA module and \({L}_{loss}\), which is significantly higher than the baseline model. Figure 6b displays the precision of returning the first 1000 images, DFPAH3 is obviously better than others. Hence, the above ablation experiments verify the effectiveness of PFA module and \({L}_{loss}\).
Visualization of hash codes by tSNE
Figure 7 shows the tSNE Visualization of the hash codes learned by DPFAH and the baseline DBDH on CIFAR10 dataset. As shown in Fig. 7a, the hash codes generated by DPFAH show clear discriminative structures where the hash codes in different categories are well separated, while the hash codes generated by DBDH do not show such clear structures. This verifies that by introducing the PFA module and \({L}_{loss}\) for hashing, the hash codes generated through DPFAH are more discriminative than that generated by DBDH. Therefore, DPFAH method effectively increases the spacing between inter classes and reduces the gap intra classes, making the generated hash codes compact and effectively enhancing the representation ability.
Results analysis
As shown in Table 6, it shows the mAP results of all comparative experiments on the CIFAR10, NUSWIDE and Imagenet100. Experiments select the length of hash codes from 16 to 64 bit. The mAP of DPFAH have reached 83.82%, 84.45%, 85.22% and 85.49% on the CIFAR10, which improved by an average of 3.57% compared to the baseline model. On the NUSWIDE dataset, the mAP of DPFAH in different hash codes length achieves 82.98%, 84.90%, 85.41% and 85.80%. Compared with the classic methods DHN on the CIFAR10, DPFAH have improved by 6.87%, 5.74%, 6.53% and 5.83% respectively. On the NUSWIDE dataset, DPFAH achieves 1.90%, 4.21%, 6.87% and 6.70% growth compared with DHN on different bits. On the Imagenet100 dataset, the effect of DPFAH is the most obvious in three datasets, compared with baseline model DBDH, DPFAH has achieves 30.62%, 37.94%, 15.86% and 14.74% on different bits. Hence, a large number of experiments show that the model trained by DPFAH has higher robustness.
The curve of PR is an evaluation index with precision and recall as variables. Recall in the curve is set as the abscissa and precision is set as the ordinate. If the PR curve of one algorithm is completely surrounded by another algorithm, it can be asserted that the performance of the latter is better than that of the former. Therefore, the performance of the algorithm is judged by the area enclosed by the PR curve. Figure 8 shows the PR curves on dataset CIFAR10. As can be seen from the Figure 8a–d, the curves of DPFAH method are significantly higher than all comparative methods. In particular, when the length of hash codes is 16bit, the enclosed area is much larger than DSDH, which has the best performance among all comparative methods.
Because NUSWIDE is a multilabel dataset and the calculation process is relatively complex, the improvement on NUSWIDE is not as obvious as that on CIFAR10, but it is still the best of all methods. As shown in Figure 9, the mAP of DPFAH is the highest compared with the other eight comparison algorithms. In Figure 9a–d, DPFAH is higher than DSDH that has the best performance among all methods.
Figure 10 shows the PR curves of 16, 32, 48 and 64 bits on Imagenet100 dataset. The PR curve of DPFAH method is significantly higher than that of other comparison methods, especially on Fig. 10b–d In Fig. 10a, when recall is greater than about 0.6, the precision of DPFAH is less than that of DHN. When recall is less than 0.6, the precision of DPFAH is much higher than that of DHN. It can be seen from the overall PR curve siege area that DPFAH is significantly greater than DHN.
To achieve the aim that the hamming ranking only needs \(\mathrm{\rm O}(1)\) time searches, the evaluating indicator P@H = 2 is important for the retrieval of hash codes. Figure 11 shows the result of P@H = 2 on three datasets, the method DPFAH obtains the highest precision in experiment. With the increase of hash code length, the precision also increases steadily, which shows that DPFAH model is more stable than the methods of DSH, IDHN and DHN on CIFAR10, NUSWIDE and Imagenet100.
Another evaluation metric is the curves of P@N. The precision of the first 1000 images are selected in this experiment. Figure 12 shows the result of P@N on CIFAR10 dataset, DPFAH method has achieved better precision than the other methods. Specifically, in Fig. 12a, the curves P@N of DPFAH is significantly higher than DHN and DSDH. In Fig. 12b–d, although the growth rate of DPFAH is not as obvious as Fig. 12a, the best precision is still obtained on 32bit, 48bit and 64bit.
Figure 13 shows the P@N curves on NUSWIDE, as can be from Fig. 13a,b, the P@N curves of all methods is relatively stable with the number of returned images increases. Compared with other algorithms, DPFAH still achieves the highest precision.
Figure 14 shows the P@N curves on the Imagenet100 dataset. As can be seen from Figure 14b–d, when the length of the hash codes is 32, 48 and 64bits, the effect of DPFAH is obviously better than the other methods. With the increase of the number of images, the precision shows a stable trend, but the best results are still obtained in all comparison algorithms.
Visualization show
In Fig. 15, this paper visualizes the top 10 returned images of DPFAH for eight query images on Imagenet100. The first row shows the label of the query images, the second row is query images, the retrieval results of DPFAH are shown at other rows. The red boxes are used to mark the false retrieval results.
Conclusions
Existing image retrieval methods based on deep hashing have the defects of imbalance and insufficiency when existing hashing methods extract image features. Some scholars propose to employ channelwise or spatialwise attention mechanism into the network, which will add many parameters to the model and increase the computational complexity. Hence, this paper introduces a PFA module and propose DPFAH method. PFA module based on wellestablished suppression theory and define an energy function that determine the importance of each neuron. This module does not add any parameters to the network and directly extracts 3D weight information of feature map. In addition, to generate accurate hash codes that retain the similarity information of the original image, this paper designs a classification branch to optimal network. The effectiveness of DPFAH method is proved by a large number of experiments. In particular, the evaluation index mAP increased by 2.95% when the PFA module is added in network. Hence, a better image retrieval model is obtained by DPFAH method.
Data availability
The CIFAR10, NUSWIDE and Imagenet100 datasets are openly available at: http://www.cs.toronto.edu/kriz/cifar.html(accessed on 8 April 2022), http://lms.comp.nus.edu.sg/research/NUSWIDE.html (accessed on 8 April 2022) and https://imagenet.org (accessed on 8 April 2022).
References
Qiao, C., Brown, K., Zhang, F., & Tian, Z.H. Federated adaptive asynchronous clustering algorithm for wireless mesh networks. in IEEE Transactions on Knowledge and Data Engineering. 3119550. (2021).
Lu, H. et al. DeepAutoD: Research on distributed machine learning oriented scalable mobile communication security unpacking system. in IEEE Transactions on Network Science and Engineering. (2021).
Liu, H. & Wang, R. Deep supervised hashing for fast image retrieval. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2064–2072 (2016).
Zhu, H. et al. Deep hashing network for efficient similarity retrieval. Proc. AAAI Conf. Artif. Intell. 30, 1 (2016).
Jiang, Q. Y., Cui, X. & Li, W. J. Deep supervised discrete hashing. IEEE Trans. Image Process. 27, 5996–6009 (2018).
Zhu, H., Gao, S. Locality constrained deep supervised hashing for image retrieval. in Proceedings of the International Conference on Artificial Intelligence. 3567–3573. (2017).
Zhang, Z. et al. Improved deep hashing with soft pairwise similarity for multilabel image retrieval. IEEE Trans. Multimed. 22, 540–553 (2019).
Yan, X., Zhu, F. & Yu, P. S. Featurebased similarity search in graph structures. ACM Trans. Database Syst. 31, 1418–1453 (2006).
Cheng, H.D. & Shi, X.J. A simple and effective histogram equalization approach to image enhancement. Digital Signal Process. 158–170. (2004).
Liu, D., Shen, J., Xia, Z. & Sun, X. A contentbased image retrieval scheme using an encrypted difference histogram in cloud computing. Information 8, 96 (2017).
Zheng, L. & Yang, Y. A decade survey of instance retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1224–1244 (2018).
Cheng, S., Wang, L. & Du, A. Deep semanticpreserving reconstruction hashing for unsupervised crossmodal retrieval. Entropy 22, 1266 (2020).
Li, Y. & Pei, W. Push for Quantization: Deep Fisher Hashing. arXiv preprint arXiv:1909.00206 (2019).
Zheng, X., Zhang, Y. & Lu, X. Q. Deep balanced discrete hashing for image retrieval. Neurocomputing 403, 224–236 (2020).
Paulevé, L., Jégou, H. & Amsaleg, L. Locality sensitive hashing: A comparison of hash function types and querying mechanisms. Pattern Recognit. Lett. 31, 1348–1358 (2010).
Bai, X. et al. Datadependent hashing based on pstable distribution. IEEE Trans. Image Process. 23, 5033–5046 (2014).
Lv, N. & Wang, Y. Deep hashing for motion capture data retrieval. in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2215–2219. (2021).
Cao, Z. et al. HashNet: Deep learning to hash by continuation. in Proceedings of the IEEE International Conference on Computer Vision. 5608–5617. (2017).
Li, X. et al. Image retrieval using a deep attentionbased hash. IEEE Access. 8, 142229–142242 (2020).
Yang, L., Zhang, R.Y., Li, L. & Xie, X.H. Simam: A simple, parameterfree attention module for convolutional neural networks. in International Conference on Machine Learning. 11863–11874. (2021).
Zhe, X. et al. Semantic Hierarchy Preserving Deep Hashing for LargeScale Image Retrieval. arXiv:1901.11259 (2019).
Chai, Y.H. et al. Dynamic prototype network based on sample adaptation for fewshot malware detection. in IEEE Transactions on Knowledge and Data Engineering. (2022).
Luo, C. C. et al. A novel web attack detection system for internet of things via ensemble classification. IEEE Trans. Indus. Inf. 17, 5810–5818 (2020).
Sun, Y. et al. Honeypot identification in softwarized industrial cyberphysical systems. IEEE Trans. Indus. Inf. 17, 5542–5551 (2021).
Qiu, J. et al. NeiTTE: Intelligent traffic time estimation based on finegrained time derivation of road segments for smart city. IEEE Trans. Indus. Inf. 16, 2659–2666 (2020).
Weiss, Y. & Torralba, A. Spectral hashing. NIPS 1, 4 (2008).
Liu, W. et al. Hashing with graphs. in Proceedings of the 28th International Conference on Machine Learning. (2011).
Jin, S., Yao, H. & Sun, X. Unsupervised semantic deep hashing. Neurocomputing 351, 19–25 (2019).
Yang, E. et al. Semantic structurebased unsupervised deep hashing. in Proceedings of the 27th International Joint Conference on Artificial Intelligence. 1064–1070. (2018).
Jiang, S., Wang, L. & Cheng, S. Unsupervised hashing with gradient attention. Symmetry. 12, 1193 (2020).
Huang, L.K., Chen, J. & Pan, S.J. Accelerate learning of deep hashing with gradient attention. in Proceedings of the IEEE/CVF International Conference on Computer Vision. 5271–5280. (2019).
Xia, R. & Pan, Y. Supervised hashing for image retrieval via image representation learning. in Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 28. (2014).
Li, W.J. & Wang, S. Feature Learning Based Deep Supervised Hashing with Pairwise Labels. arXiv:1511.03855 (2015).
Yang, W. et al. Deep hash with improved dual attention for image retrieval. Information 12, 285 (2021).
Su, S., Zhang, C., Han, K. & Tian, Y.H. Greedy hash: Towards fast optimization for accurate hash coding in CNN. in Proceedings of the 32nd International Conference on Neural Information Processing Systems. 806–815. (2018).
Zhang, Z., Zou, Q. & Wang, Q. Instance Similarity Deep Hashing for MultiLabel Image Retrieval. arXiv:1803.02987 (2018).
Cao, Y. et al. Deep Cauchy hashing for hamming space retrieval. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1229–1237. (2018).
Zhe, X., Chen, S. & Yan, H. Deep classwise hashing: Semanticspreserving hashing via classwise loss. IEEE Trans. Neural Netw. Learn. Syst. 31, 1681–1692 (2019).
Selvaraju, R., Cogswell, M. & Das, A. GradCAM: Visual explanations from deep network via gradientbased localization. in IEEE Conference on Computer Vision and Pattern Recognition. 618–626. (2017).
Webb, B. S., Dhruv, N. T. & Solomon, S. G. Early and late mechanisms of surround suppression in striate cortex of macaque. Neuroscience 25, 11666–11675 (2005).
Funding
This research was funded by the Tianshan Innovation Team of Xinjiang Uygur Autonomous Region under Grant 2020D14044.
Author information
Authors and Affiliations
Contributions
Conceptualization, W.Y.; methodology, W.Y.; software, W.Y. and S.C.; validation, S.C. and L.W; formal analysis, L.W. and S.C.; data curation, W.Y.; writing original draft preparation, W.Y.; writingreview and editing, L.W. and S.C. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yang, W., Wang, L. & Cheng, S. Deep parameterfree attention hashing for image retrieval. Sci Rep 12, 7082 (2022). https://doi.org/10.1038/s41598022112175
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598022112175
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.