Introduction

Recently, various advanced medical imaging technologies have been applied in modern clinical analysis with the advancement of medical care1. Hospitals are generating a large number of multi-modal neuroimages every moment, therefore, it is necessary to establish an effective neuroimage cross-modal approximate nearest neighbor retrieval system to assist clinicians in navigating the data. Neuroimage cross-modal retrieval aims to provide doctors with similar neuroimages from different modalities that have been diagnosed. An effective neuroimage cross-modal retrieval system can reduce the error rate of clinical diagnosis for novice doctors and improve the efficiency of clinical diagnosis for skilled physicians.

The remarkable achievements have been made in large-scale data processing based on deep neural network in computer vision2,3,4,5, Internet of Things (IoT)6,7,8, nearest neighbor retrieval9, 10, and intelligent networks11, 12. The nearest neighbor retrieval methods are solved by learning discriminative representations in the common space, which can be roughly classified into cross-modal hash retrieval and cross-modal real-value retrieval by classifying the types of values in the common space10, 13. Cross-modal hashing is an efficient method to embed high-dimensional heterogeneous modal feature descriptors into a low-dimensional Hamming space. Due to the trade-off between retrieval efficiency and storage cost, learning to hash has been widely used in approximate nearest neighbor retrieval of large-scale multi-media data, in particular, using cross-modal hashing to assist doctors in effective clinical diagnosis has also attracted increasing attention from researchers.

Since features of different modalities usually belong to various data distributions and are generated from different manifold spaces. Therefore, a basic challenge of cross-modal retrieval is to bridge the modality gap. Most existing cross-modal hashing methods have been available to bridge the heterogeneous modality-gap14,15,16, but there are still two challenges leading to the limitation of retrieval accuracy: (1) ignoring the continuous similarity of samples on stream shape; (2) lack of discriminability of hash codes with the same semantics. Our research argued that (1) is the reason for (2) and (2) is the result of (1). Therefore, we propose a Deep Consistency-Preserving Hash Auto-encoders model, called DCPHA, based on the multi-manifold property of multi-modal hash codes distributed in Hamming space. In addition, we define the continuous similarity of heterogeneous and homogeneous samples on Riemann manifolds from the perspective of multiple sub-manifolds, respectively, and propose two constraints, i.e., multi-semantic consistency and multi-manifold similarity-preserving. And we prove theoretically that the multi-manifold similarity-preserving constraint has manifold preserving invariance.

The main contributions of our work can be summarized as follows:

  1. (1)

    We propose a Deep Consistency-Preserving Hash Auto-encoders model, called DCPHA, based on the multi-manifold property of the feature distribution for neuroimage cross-modal retrieval. DCPHA is an end-to-end model consisting of asymmetric auto-encoders and two semantics-preserving attention branches.

  2. (2)

    We propose multi-semantic consistency and multi-manifold similarity-preserving constraints based on the multi-manifold property of multi-modal hash codes. And it is proved theoretically that the multi-manifold similarity-preserving constraint has manifold preserving invariance.

  3. (3)

    Without loss of generality, we comprehensively evaluate the DCPHA on four benchmark datasets and implement detailed ablation experiments to validate the effectiveness of the DCPHA. The extensive experiments demonstrate the advantages of the proposed DCPHA compared to 15 advanced cross-modal retrieval methods.

Deep consistency-preserving hash auto-encoders

In this section, the proposed model DCPHA is described in detail, including formulations, deep architecture and objective function. The deep architecture of DCPHA is shown in Fig. 1. The DCPHA model consists of asymmetric auto-encoders and two semantics-preserving attention branches. The encoder is used to extract features from neuroimages of different modalities, and the decoder is designed to map the features into Hamming space by a non-linear transformation. The semantics-preserving attention branches work in the encoding and decoding stages respectively to ensure that both the learned features and the hash codes have semantics-consistency. And two constraints, i.e., multi-semantic consistency and multi-manifold similarity-preserving, are embedded in the learning of hash codes to obtain high-quality hash codes with discriminative.

Figure 1
figure 1

The proposed DCPHA model consists of an asymmetric auto-encoders and two semantics-preserving attention branches. The encoder is used to extract features from neuroimages of different modalities, and the decoder is designed to map the features into Hamming space by a non-linear transformation. Best view in color.

Notations and definitions

In this subsection, the notations and definitions mentioned in the following equations are introduced. Without loss of generality, we suppose that there are \({\mathscr {N}}\) multi-modal sample sets in the sample space \(\psi\), \(\psi =\left\{ X_i\right\} , i\in \left[ 1,{\mathscr {N}}\right]\). Each of multi-modal sample sets \(X_i\) consists of different medical scan imagings from the same subject (e.g. MRI and PET), \(X_i=\left\{ x_i^m\right\} , m\in \left[ 1,{\mathscr {M}}\right]\), where \({\mathscr {M}}\) denotes the number of different medical scan imagings. \(x_i^m\) denotes the i-th subject of the m-th modality, assuming dimension \({\mathscr {Z}}\). Since the samples within the same multi-modal sample set originate from the same subject, they naturally share the same semantic, which is the reason why our method is appropriate for neuroimages. A one-hot vector \(\ell _i\) is assigned to each multi-modal sample set, \(\ell _i=\left[ l_1,l_2,\cdots ,l_c,\cdots ,l_C\right]\), where C denotes the number of categories. When the multi-modal sample set \(X_i\) belongs to the c-th category, \(l_c=1\) and the rest is 0. DCPHA consists of an asymmetric encoder and decoder. The purpose of the encoder is to learn the features \(f_i^m\) of sample \(x_i^m\), assuming that the dimension of \(f_i^m\) is \({\mathscr {D}}\), where \({\mathscr {D}}\ll {\mathscr {Z}}\). The decoder is designed to map the features \(f_i^m\) into Hamming space by a non-linear transformation. Let the hash code of sample \(x_i^m\) is \(h_i^m\), \(h_i^m\in \left\{ -1,1\right\} ^{\mathscr {K}}\), and our goal aims to learn an end-to-end non-linear hash function \({\mathscr {F}}\) to extract features of multi-modal medical imaging and encode them into high-quality hash codes with semantics-consistency and similarity-preserving, \(h_i^m={\mathscr {F}}\left( x_i^m;\theta \right)\). The terms, notations, definitions and types involved in this work are comprehensively shown in Table 1.

Table 1 The terms, notations, definitions and types involved in this work.
Figure 2
figure 2

The visualization of multiple sub-manifolds in local sample space. Points A1, A2 and B1 are from two different manifolds. The green solid line connects two homogeneous manifold samples, i.e. the similarity between two homogeneous manifold samples, and the yellow solid line links two heterogeneous manifold samples, i.e. the similarity between two heterogeneous manifold samples. Best view in color.

The previous works17,18,19,20 has illustrated that multi-modal data contain multiple sub-manifolds. The visualization of multiple sub-manifolds in local sample space is shown in Fig. 2. We define the sub-manifold similarity and multi-manifold similarity from local and global respectively, as follows.

Definition 1

Heterogeneous manifold similarity. A local manifold similarity calculation definition. Assuming that there are \({\mathscr {M}}\) modal neuroimages in the sample space, and each modality contains \({\mathscr {N}}\) samples, then the heterogeneous manifold similarity \({{\textbf {S}}}_{{\textbf {H}}}\) is defined for any two samples of different modalities as Eq. (1):

$$\begin{aligned} {{\textbf {S}}}_{{\textbf {H}}} = \begin{bmatrix} S_H(h_1^m,h_1^n) &{} \cdots &{} S_H(h_1^m,h_{\mathscr {N}}^n) \\ \vdots &{} \ddots &{} \vdots \\ S_H(h_{\mathscr {N}}^m,h_1^n) &{} \cdots &{} S_H(h_{\mathscr {N}}^m,h_{\mathscr {N}}^n) \\ \end{bmatrix}_{{\mathscr {N}}\times {\mathscr {N}}} \end{aligned}$$
(1)

with

$$\begin{aligned} S_H\left( h_i^m,h_j^n\right)= & {} e^\frac{-D^2\left( h_i^m,h_j^n\right) }{\tau } \end{aligned}$$
(2)
$$\begin{aligned} D\left( h_i^m,h_j^n\right)= & {} \left\{ \begin{array}{ll} \sqrt{1-e^{-d\left( h_i^m,h_j^n\right) }}, &{} l_i=l_j \\ \sqrt{e^{-d\left( h_i^m,h_j^n\right) }}, &{} l_i\ne l_j \end{array} \right. \end{aligned}$$
(3)

where \(S_H\left( h_i^m,h_j^n\right)\) denotes the similarity of the heterogeneous manifold between the i-th sample of the m-th modal and the j-th sample of the n-th modal and the calculation method is shown in Eq. (2). \(\tau\) is the heat kernel constant. \(D(\cdot )\) in Eq. (3) is the modified distance metric based on the standard euclidean distance \(d(\cdot )\).

Definition 2

Homogeneous manifold similarity. A local manifold similarity calculation definition. In the sample space, the homogeneous manifold similarity \({{\textbf {S}}}_{{\textbf {I}}}\) between samples from the same modality is defined as Eq. (4):

$$\begin{aligned} {{\textbf {S}}}_{{\textbf {I}}} = \begin{bmatrix} S_I(h_1^\cdot ,h_1^\cdot ) &{} \cdots &{} S_I(h_1^\cdot ,h_{\mathscr {N}}^\cdot ) \\ \vdots &{} \ddots &{} \vdots \\ S_I(h_{\mathscr {N}}^\cdot ,h_1^\cdot ) &{} \cdots &{} S_I(h_{\mathscr {N}}^\cdot ,h_{\mathscr {N}}^\cdot ) \\ \end{bmatrix}_{{\mathscr {N}}\times {\mathscr {N}}} \end{aligned}$$
(4)

with

$$\begin{aligned} S_I(h_i^\cdot ,h_j^\cdot )=\frac{{h_i^{\cdot }}^T \cdot h_j^\cdot }{\Vert h_i^{\cdot }\Vert \Vert h_j^{\cdot }\Vert } \end{aligned}$$
(5)

where \(S_I(h_i^\cdot ,h_j^\cdot )\) denotes the homogeneous manifold similarity between the i-th sample and the j-th sample from the same modal. The calculation method is the dot product between \(\ell _2\) normalized \(h_i^{\cdot }\) and \(h_j^{\cdot }\) (i.e. cosine similarity) as shown in Eq. (5).

Definition 3

Multi-manifold similarity. A global manifold similarity calculation definition. Assuming that there are \({\mathscr {M}}\) modal neuroimages in the sample space and each modality contains \({\mathscr {N}}\) samples, then the multi-manifold similarity \({{\textbf {S}}}_{{\textbf {M}}}\) is defined as Eq. (6):

$$\begin{aligned} {{\textbf {S}}}_{{\textbf {M}}} = \begin{bmatrix} S_I^1 &{} \cdots &{} S_H^{1,{\mathscr {M}}} \\ \vdots &{} \ddots &{} \vdots \\ S_H^{{\mathscr {M}},1} &{} \cdots &{} S_I^{\mathscr {M}} \\ \end{bmatrix}_{{\mathscr {M}}{\mathscr {N}}\times {\mathscr {M}}{\mathscr {N}}} \end{aligned}$$
(6)

where \(S_I^1\) denotes the homogeneous manifold similarity between the samples from the 1-th modality, and \(S_I^{{\mathscr {M}}}\) similarly. \(s_H^{1,{\mathscr {M}}}\) denotes the heterogeneous manifold similarity between the 1-th modal sample and the \({\mathscr {M}}\)-th modal sample, and \(S_H^{{\mathscr {M}},1}\) similarly.

Objective functions and theory

In this subsection, the theoretical derivation of the proposed multi-semantic consistency and multi-manifold similarity-preserving constraints is presented. Alexey et al.21 propose that the criterion for a good feature representation should ensure that the mapping from the input image \(x_i^m\) to the feature \(f_i^m\) should satisfy two requirements: (1) There must be at least one feature that is similar for images of the same semantics. (2) there must be at least one feature that is sufficiently different for images of different semantics. However, the previous works14,15,16 can over-satisfy both requirements for hash codes, because these works ignore the fact that samples with the same semantics have contiguous similarity on manifold. Constructing the similarity matrix directly using semantic labels leads to samples with the same semantics being encoded into the same hash code, causing the lack of discriminability between hash codes with the same semantics. To solve the problem, we propose a multi-semantic consistency loss and a multi-manifold similarity-presering loss. The multi-semantic consistency ensures that hash codes with different semantics are discriminative. On this basis, multi-manifold similarity-preserving defines continuous similarity among samples in terms of multiple sub-manifolds, ensuring that hash codes with the same semantics have discriminability as well.

Multi-semantic consistency

The multi-semantic consistency constraint is to align the intermediate features generated by the encoder and the hash codes generated by the decoder with the high-level semantics of the input samples to guarantee that the final generated hash codes with different semantics have case-level discriminability, which is calculated as follows.

First, the sample \(x_i^m\) is learned by encoder to feature \(f_i^m\), \(f_i^m=Decoder\left( x_i^m\right)\), and the feature \(f_i^m\) is passed through the Semantic Preserving Attention Branch (SPAB) to obtain the feature prediction classification label \(y_i^m\), \(y_i^m=SPAB_E(f_i^m)\). The decoder is designed to map the features into Hamming space by a non-linear transformation. The \(f_i^m\) is fed into the decoder to obtain the hash code \(h_i^m\), \(h_i^m=Decoder\left( f_i^m\right)\). The hash code \(h_i^m\) is input into the SPAB which works in the decoding stage to obtain the hash code prediction classification label \(r_i^m\), \(r_i^m=SPAB_D(h_i^m)\). The multi-semantic consistency loss is shown in Eq. (7), where \(\parallel \cdot \parallel _{\mathscr {F}}\) denotes the Frobenius normalized.

$$\begin{aligned} {\mathscr {J}}_1=\sum _{m=1}^{\mathscr {M}}\sum _{i=1}^{\mathscr {N}} \parallel y_i^m-\ell _i\parallel _{\mathscr {F}}+ \parallel r_i^m-\ell _i\parallel _{\mathscr {F}}+ \parallel y_i^m-r_i^m\parallel _{\mathscr {F}} \end{aligned}$$
(7)

Multi-manifold similarity-preserving

With the basis of multi-semantic consistency, multi-manifold similarity-preserving defines continuous similarity among samples in terms of multiple sub-manifolds, ensuring that hash codes with the same semantics have discriminability as well. According to previous work18,19,20, in the sample space, neuroimages of different modalities are distributed in different sub-manifolds. They are aggregated into a sophisticated multi-manifold structure. Based on the statements of Definition. (1)(2)(3), we derive the following optimization equation as Eq. (8):

$$\begin{aligned} {\mathscr {J}}_2=\sum _{m,n=1}^{M}\sum _{i,j=1}^{N}\left( \log \left( 1+e^{S_M\left( h_i^m, h_j^n\right) }\right) -I\left( \ell _i, \ell _j\right) \times S_M\left( h_i^m, h_j^n\right) \right) \end{aligned}$$
(8)

where \(S_M\left( \cdot ,\cdot \right)\) denotes the multi-manifold similarity. \(I\left( \cdot ,\cdot \right)\) is an indicator function that has a value of 1 if \(\ell _i=\ell _j\) and 0 otherwise. Other notations and the corresponding explanations can be found in Table 1. \({\mathscr {J}}_2\) is the similarity-preserving loss which is defined on multi-manifolds, allowing samples with the same semantics are decoded into hash codes with discriminative.

Belkin22 used the correspondence between the Laplace and the Laplace-Beltrami operator on manifold, and the connections to heat equation, and proposed a non-linear dimensionality reduction method from Riemann space to Euclidean space (i.e. Laplacian Eigenmaps). The objective function as follows Eq. (9):

$$\begin{aligned} {\mathscr {L}}_{laplacican}=\sum _{m,n=1}^{M}\sum _{i,j=1}^{N}\frac{1}{2}S_{i,j}\times \parallel H_i^m-H_j^n\parallel ^2_2 \end{aligned}$$
(9)

where \(H_i^m=\frac{h_i^m}{\parallel h_i^m\parallel }\), i.e. standardized feature vector.

Theorem 1

Subject to \(log\left( 1+e^{S_M\left( h_i^m,h_j^n\right) }\right) =2S_M\left( h_i^m,h_j^n\right)\), then \({\mathscr {J}}_2\) is equivalent to \({\mathscr {L}}_{laplacican}\), i.e. \({\mathscr {J}}_2\) has manifold preserving invariance.

The procedure of the theoretical proof of Theorem 1 is placed in the supplementary material. It indicates that minimizing Eq. (8) is a standard manifold embedding problem formulated by equivalent to Eq. (9). Multi-manifold similarty-preserving term essentially provides a measure of sub-manifold similarity-preserving. Therefore, Eq. (8) can be a reasonable explanation for multi-manifold similarity-preserving.

Combining Eqs. (7)(8), the objective function of DCPHA is:

$$\begin{aligned} {\mathscr {J}}={\alpha {\mathscr {J}}}_1+{\beta {\mathscr {J}}}_2+\sum _{m,n=1}^M\sum _{i,j=1}^N \parallel h_i^m-{{\textbf {1}}}\parallel _1+\parallel h_j^n-{{\textbf {1}}}\parallel _1 \end{aligned}$$
(10)

where \(\alpha\) and \(\beta\) are the contribution weight parameters of \({\mathscr {J}}_1\) and \({\mathscr {J}}_2\), respectively. The third is a regularization term, which is used to avoid gradient vanishing23.

Refinement learning and optimization

The network structure of DCPHA consists of asymmetric auto-encoders and two semantics-preserving attention branchinges which working in the feature encoding and hash decoding stages, respectively. The encoder adopts a standard CNN network structure. The decoder uses a light-weight fully-connected networks. The semantics-preserving attention branch is a linear multi-layer perceptron model. Therefore Eq. (10) is a non-convex function with multiple parameters. We used a stochastic gradient descent method and iterative learning strategy with Adam optimizer24 to learn the parameters and update the network. The complete training of DCPHA consists of three steps: (1) Pre-training encoder, (2) Pre-training decoder and (3) Fine-tuning DCPHA, which are described in detail below.

figure a
figure b
figure c

Experiment

Implementation details

All experiments were conducted on a Tesla V100-SXM2 GPU using same setting. To ensure impartiality and objectivity, all comparison models adopt AlexNet as the backbone network for feature extraction. All comparison models, except that the backbone network adopts the same configuration, are all original code implementations. The batchsize is 20 and the iterations is 500. The initial learning rate is set to \(10^{-6}\).

Datasets

ADNI225 contained 579 subjects with T1-weighted sMRI and 500 subjects with PET. we adopt a single slice and strong pairing data preprocessing method. Finally, we collected 300 pairs (600 images) of sMRI and PET neuroimages as datasets.

OASIS326. We collected MRI T1-weighted and PET images of 300 subjects from the OASIS3 dataset, with a total of 600 images as the dataset. We strongly matched two different modal images of the same subject to form a cross-modal paired dataset for training. We divided the above datasets into training-set and test-set in the ratio of 8/2. The datasets generated and analysed during the current study are available from the corresponding author on reasonable request.

Compare with 15 advanced methods

In this experiments, We used the mean average precision (mAP) scores of all returned results with cosine similarity as a quantitative metric. The mAP scores jointly consider ranking information and precision and are widely used performance evaluation criteria in cross-modal hash. We report the mAP scores of the compared methods for two different cross-modal retrieval tasks: (1) retrieving PET samples using T1-wighted MRI queries (M\(\rightarrow\)P) and (2) retrieving T1-wighted MRI samples using PET queries (P\(\rightarrow\)M). On the premise of objectivity and impartiality, the comparison experiments on ADNI2 and OASIS3 datasets are shown in Tables 23, respectively. From the results, DCPHA achieves state-of-the-art performance on the test-set of each dataset. The detailed analysis is as follows.

Table 2 The mAP scores of cross-modal retrieval on ADNI2 with different lengths of hash codes.

The results of neuroimage cross-modal retrieval on ADNI2 using mAP scores are shown in Table 2. As can be seen from the table, the proposed DCPHA outperforms 15 advanced counterparts. Regarding the average mAP score of 128 bits hash codes on ADNI2 dataset, DCPHA outperforms several sub-optimal models DIHN, DPN, and CSQ by \(6.86\%\), \(5.16\%\), and \(3.85\%\) respectively. In other words, our method can significantly improve the performance of neuroimage cross-modal retrieval. For further comparison, the precision curve is plotted in Fig. 3. The experimental results are consistent with the retrieved mAP results in Table 2, where DCPHA has the best performance.

Table 3 The mAP scores of cross-modal retrieval on OASIS3 with different lengths of hash codes.
Figure 3
figure 3

The precision curves of DCPHA and comparisons on ADNI2 dataset.

Figure 4
figure 4

The precision curves of DCPHA and comparisons on OASIS3 dataset.

We evaluated DCPHA on OASIS3 dataset for cross-modal retrieval. The mAP scores of the retrieval are shown in Table 3. The experimental results show that DCPHA has the highest retrieval mAP scores in several metrics compared to 15 advanced retrieval methods. The proposed DCPHA improves \(2.53\%\) over the best counterpart DSH from the average mAP score of 16 bits hash codes. Although the average mAP score of DSH with 32 bits hash codes is higher than DCPHA, the model constructed based on the multi-manifold property of data distribution has a great advantage in processing the multi-modal task method has a great advantage. The performance of proposed method is more stable on different lengths of hash codes. We plotted the precision curve to investigate the effectiveness of different methods for cross-modal retrieval on OASIS3 dataset as shown in Fig. 4. From the visualization, it is observed that the proposed DCPHA also outperforms all the compared methods, which is consistent with the retrieved mAP results.

The further analysis of DCPHA

In this subsection, we will further analyze our proposed DCPHA from ablation experiments, hyper-parameters sensitivity analysis and comparison on natural image benchmark dataset.

Ablation experiments

The objective function of DCPHA is mainly composed of a multi-manifold similarity-preserving loss and a multi-semantic consistency loss. In order to research the contribution of these components to the model in more detail, we developed and evaluated two variants of DCPHA. i.e. DCHA and DPHA. DCHA only uses the multi-semantic consistency loss as the objective function and DPHA only uses the multi-manifold similarity-preserving loss as the optimization objective. Table 4 shows the results of ablation experiments on ADNI2. We found that both multi-manifold similarity-preserving and multi-semantic consistency contribute to the final retrieval performance of the model. DCHA obtains better performance when the hash codes is shorter, and the mAP score of DPHA is higher when the hash code is longer, which shows that optimizing the two objective functions at the same time is better than only optimizing one of them.

Table 4 The mAP of ablation experiments on ADNI2 with different lengths of hash codes.

Hyper-parameter sensitivity analysis

The objective function of DCPHA contains two hyper-parameters \(\alpha\) and \(\beta\), and we investigate the effect of the hyper-parameters that control the weight ratio between the losses in Eq. (10). First, we fix the length of the hash code mathcal K to 32. Then, we keep \(\alpha\) and \(\beta\) in the range of [0.1, 1] to calculate the MAP score. The result is shown in the Fig. 5. It is clear that different hyperparameters yield different performance. Considering from the average MAP, we finally chose \(\alpha =0.3\) and \(\beta =1\) as hyper-parameters for the ADNI2 dataset. By using the same scheme, we can obtain the optimal values of hyper-parameters for \(\alpha =0.1\) and \(\beta =1.0\) on the OASIS3 dataset.

Figure 5
figure 5

The mAP scores on ADNI2 with hyper-parameters in the range of [0.1, 1].

Experiments on natural image benchmark datasets

In order to further measure the fitting and generalization ability of DCPHA, we conduct comparative experiments with 10 advanced cross-modal retrieval methods on the natural image benchmark datasets MIRFLICKR25K. In our experiments, we follow the dataset partition and feature exaction strategies from36, 42. In this experiment, we report the mAP scores of the compared methods for two different cross-modal retrieval tasks: 1) retrieving text using image queries (I\(\rightarrow\)T) and 2) retrieving images using text queries (T\(\rightarrow\)I). The experimental results obtained in “I\(\rightarrow\)T” and “T\(\rightarrow\)I” tasks on MIRFLICKR25K are shown in Table 5. Since our proposed multi-semantic consistency and multi-manifold similarity preserving constraints based on the multi-manifold property of multi-modal hash codes, DCPHA achieves a significant performance improvement on the multi-label benchmark dataset, i.e., MRIFLICKER25K.

Table 5 The mAP scores of cross-modal retrieval on MIRFLICKER25K with different lengths of hash codes.

Conclusion and future work

In this paper, we proposed a deep consistency-preserving hash auto-encoders model, called DCPHA, based on the multi-manifold property of hash codes distributed in Hamming space to solve the problem of lack of discriminability of hash codes with the same semantics. Specifically, DCPHA consists of a pair of asymmetric auto-encoders and two semantics-preserving attention branches that work in the feature encoding stage and hash decoding stage, respectively. In addition, two constraints, namely multi-semantic consistency and multi-manifold similarity-preserving, were embedded in the learning of hash codes. We theoretically demonstrated that our proposed multi-manifold similarity-preserving has manifold preserving invariance. As the experimental results show, the proposed DCPHA can obtain state-of-the-art performance on simple medical multi-modal image datasets (i.e., ADNI2) and multi-label natural image datasets (i.e., MIRFLICKER25K). In future work, we will build a medical multi-modal database, including diagnostic reports, audio, and construct a multi-modal hash method to accomplish mutual retrieval of data from multiple sources. And we will further explore the impact of multi-view on the generation of hash codes for multi-modal samples.