Cross-modal semantic autoencoder with embedding consensus

Cross-modal retrieval has become a topic of popularity, since multi-data is heterogeneous and the similarities between different forms of information are worthy of attention. Traditional single-modal methods reconstruct the original information and lack of considering the semantic similarity between different data. In this work, a cross-modal semantic autoencoder with embedding consensus (CSAEC) is proposed, mapping the original data to a low-dimensional shared space to retain semantic information. Considering the similarity between the modalities, an automatic encoder is utilized to associate the feature projection to the semantic code vector. In addition, regularization and sparse constraints are applied to low-dimensional matrices to balance reconstruction errors. The high dimensional data is transformed into semantic code vector. Different models are constrained by parameters to achieve denoising. The experiments on four multi-modal data sets show that the query results are improved and effective cross-modal retrieval is achieved. Further, CSAEC can also be applied to fields related to computer and network such as deep and subspace learning. The model breaks through the obstacles in traditional methods, using deep learning methods innovatively to convert multi-modal data into abstract expression, which can get better accuracy and achieve better results in recognition.

Parameter settings. The spatial dimensions of WIKI, TVGraz, NUS-WIDE, and MIRFLICKR are set to 10, 20, 10, and 40, respectively. We constantly adjust parameters within the range of 0.001, 0.01, 0.1, 1, 10 to analyze the performance of CSAEC. For several other methods, we set the parameter values according to the corresponding data set. For the data set, we randomly divide it into parts, one of which is the test data, and the rest is the unlabeled pool for active selection. The random data partition is repeated for ten times and average results over them are reported as the final model evaluation.
Complexity analysis. We set n ≥ d . The complexity of eigenvalue decomposition is O(n 3 ) . When n is large, we can get the results with iterative algorithms to prove the precision of our proposed methods. The largest d eigenvalues may exit with different datasets. Obviously, the size of feature dimension influence the complexity. We just caculate with O(knd 3 ) and k is the number of iterations.

Mean average precision (MAP) results of different methods. Mean Average Precision (MAP) is
used to evaluate the validity of the retrieval results of different methods. R is the threshold for Precision-Recall (PR) Curves. Assuming that there are some positive examples in the datasets, we can get the corresponding values r. For each value of r, we can calculate the maximum precision when r > R . In order to verify the performance of CSAEC, two types of directional cross-pattern retrieval tasks were performed: image-text query and text-image query. If the labels of the two types of data points are the same, the information is considered to have the relevance.
The methods are compared on the WIKI dataset. It can be observed from Table 1 that the performance of the CSAEC method in this paper has improved significantly.
Average ranks by each algorithm provide avaluable comparison.Let r j i(m) denotes that the rank of jth of m algorithm applied to the ith dataset. Then the average rank of k algorithm can be expressed as R j m = 1 n i r j i(m) . Then establish a null assuming that all algorithms have strong similarities, which states that the ranks R j should be equaled. The Friedman test testifies whether the calculated average ranks should have significantly difference from the mean rank expected under the null hypothesis. When R = all , the Friedman statistic can be calculated as With four data sets and six algorithms, F F is distributed according to the F-distribution with (6 − 1) = 5 and (6 − 1) (4 − 1) = 15 degrees of freedom. The p-value calculated with F(5, 15) distribution has proved the null hypothesis can be rejected at a high level of significance. The reason may be that CSAEC uses the embedding matrix while preserving the original features and semantic information. Semantic information provides interactive information between modalities and information within each modality, while original feature information takes into account of the similarity between modalities.
As can be seen from Table 2, on TVGraz dataset, CSAEC also achieved the best results for the two types of retrieval tasks. Our method improves the performance of image query text tasks better than text query images. Compared with other methods, the query results are improved. Table 3 shows the MAP of each method on the NUS-WIDE dataset. The LGCFL and CSAEC methods perform better than CCA because both consider semantic information. The NUS-WIDE dataset is larger than the WIKI and MIRFLICKR datasets, so the semantic information has more interaction in NUS-WIDE, and similar information between different modal data can be found as much as possible.
On the MIRFLICKR dataset, it can be seen from Table 4 that the MAP value of this method is better than other methods, and the effect of JFSSL is second. The CSAEC method has the ability to retain both original .392 4 · 6 − 11.392 = 2.71   www.nature.com/scientificreports/ features and semantic information, and learns the feature code vector of the semantic tag space. This shows that CSAEC and JFSSL are effective for querying spatial information with labels.
Precision-Recall (PR) curves of different methods. It can be seen from Fig. 1 that for the image-text query task, the overall CSAEC query effect exceeds almost all other methods. On the MIRFLICKR dataset, the minimum accuracy of each method is higher. On the NUS-WIDE dataset, the performance advantage of CSAEC is more obvious. Overall, CSAEC improves the performance of image query text tasks. For text-image query tasks, CSAEC has higher recall rate than the other methods on the four benchmark data sets.
Parameter sensitivity. In Fig. 2, we analyze the impact of parameters. On the WIKI and NUS-WIDE datasets, the two parameter values are adjusted within the range of 0.001, 0.01, 0.1, 1, 10, and their changes are shown in Fig. 3. It can be seen that when the parameters change, the effect of CSAEC will be different, and its query performance is more sensitive than other methods. When the range is from 0.001 to 1, this method can get better results.
Loss analysis. Figure 3 shows the convergence loss curve of the method in this paper. We perform CSAEC over 10 iterations on all datasets. It can be seen that on WIKI and NUS-WIDE, as the number of iterations increases, the loss value continues to decrease. After fewer iterations, the loss has been reduced and stabilized, and the method is considered to be convergent in the end.

Discussion
The research on cross-modal retrieval technology has attracted much attention and is beginning to be put into practice. In addition, the semantic gap between the low-level features and high-level semantic features in the multi-modal dataset is a huge challenge. The bottleneck in accuracy and quality lies in the key factors. Researchers work on the construction of similarity constraints through category labels, but the methods are limited. Study the special correlation between multi-modal data is of great urgency. Semantic information is significant knowledge retained during querying. Different forms of data have different feature spaces, but they have the same semantic space. Data with the same semantics are related in various forms. Semantic information can be used not only to indicate the degree of association between multiple modalities, but also to indicate the connections within each modality.
In this work, an effective cross-modal retrieval method CSAEC is proposed. By embedding mapping consensus on multi-modal data, while retaining the original feature information and semantic information, a semantic code vector is obtained. The paired encoder-decoders are linearly symmetric, returning feature projections to the original data, minimizing reconstruction errors. Parameters are introduced in the objective function with regularization sparse constraints. Experiments show that the autoencoder effectively completes the query task and improves the retrieval performance.
Cross-modal retrieval technology involves basic knowledge related to mathematics, and statistics to meet the needs of the application. Also, CSAEC can be applied to fields related to computers and networks such as deep and subspace learning. Further, CSAEC will play a great role in the field of recognition and analysis. In the next step, characteristics of the human body, such as facial expression and body movement, can be used on the deep neural network model to perform simultaneous features on multiple modal learning. Datasets can be unified to the same feature space as semantic expression through multiple nonlinear transformations. CSAEC can restore more similarities between image and text information for feature extraction. The model takes into account of different modalities and the importance of tasks for machine learning. The model breaks through the obstacles in traditional methods, using deep learning methods innovatively to convert multi modal data into abstract expression, which can get better accuracy and achieve better results in recognition.

Methods
Related work. Cross-modal similarity learning has aroused great attention in the academic community.
However, the heterogeneity of data and the existence of semantic differences makes this problem challenging. At present, the two most common measurement methods are maximizing correlation and minimizing Euclidean distance 25 . The typical methods to maximize correlation are CCA 23 and improved methods, learning a latent space that maximizes the correlation between the projection features of the two modalities. Reference 26 used CCA to obtain the shared potential space of 2D and 3D facial images corresponding to people. PLS and BLM  www.nature.com/scientificreports/ are methods to minimize Euclidean distance. Sharma and Jacobs 27 used PLS to achieve heterogeneous facial recognition in different poses, high-resolution and low-resolution facial images, and between photos and sketches. Bilinear models (BLM) are used for cross-media retrieval and heterogeneous face recognition 2 . An autoencoder is an unsupervised neural network model. It learns the hidden features of the input data, which is called encoding. Meanwhile, CSAEC reconstruct the original input data using the learned new features, which is called decoding. Autoencoders 28 are trained models for learning potential representations of a set of data. CSAEC uses training data sets to copy the input information to the output. Therefore, the underlying representation is a valid attribute. Some scholars have proposed deformation methods for autoencoders. Reference 15 correlated potential representations of two single-mode autoencoders. Kodirov et al. 16 learned the semantic code vectors of latent space. Lange et al. 29 combined the training of deep autoencoders (for learning  www.nature.com/scientificreports/ compact feature spaces) with RL algorithms (for learning strategies). RL is short for Reinforcement Learning. Tara et al. 30 used the training set to apply the AE-BN mode. The traditional autoencoder simply seeked potential representations to reconstruct the original data, and the method conducted the similarity with semantic code vectors. Inspired of related work, we improve existing methods and constructs a set of cross-modal semantic autoencoder with embedding consensus (CSAEC). The process is shown in Fig. 4. The paired image-text data is uniformly mapped to a low-dimensional embedding space, the manifold structure is retained, and the original information is converted into corresponding semantic code vectors. The consensus matrix and semantic code matrix are continuously updated. Further, by learning the image and text projection matrices, the encoders are used to associate them with corresponding semantic codes, and the decoder is reprojected back to the highdimensional data. In addition, regularization and sparse constraints are performed on the decoder. Balanced parameters are used to reconstruct the original features. As a result, the method performs effectively on the retrieval of multi-modal information. Mapping consensus mainly deals with the problem of multi-mapping disagreements. Since every data point is different, according to the mapping process, the representation of the same data point can be mapped into the latent embedding space. In this occasion, mapping conflict may occur. The reason is that the data point is unique which leads to different mapping results. The aim of mapping consensus is preserving validity of mappings and avoid mapping conflict. Considering of a fixed object N) for each value of d, where U i is the definite representation of this point in latent embedding space and ϕ d is the latent embedding mapping for d-th dimension. Embedding consensus matrix realizes the unity of each pair of image and text information mapping results, and further learns the semantic code vector. Manifold dimension reduction preserves the local geometry of the original data points. To prevent the results from being affected by the noise data, the parameter γ d i is introduced. When sum up all of the d 1, 2, . . . , D) , the D d=1 γ d i can be transformed into diag(γ i ) . So we get is a low-dimensional embedding matrix, which retains the manifold structure of the original information. Let Figure 4. The process of CSAEC. We map the datasets to an embedding space, learn projections by multimodal semantic autoencoder and reconstruct original features. (V T) is the original data matrix, U i is a lowdimensional consensus vector of embedding consensus ϕ d , W is a low-dimensional embedding matrix, C is the corresponding semantic code. Two encoders P v , P t project image and text data into low-dimensional space A, and two decoders reproject A back to high-dimensional data. www.nature.com/scientificreports/ the data is transformed into the corresponding semantic code vector by embedding the consensus matrix. To eliminate the influence of noise, when the mapping result of data (v i , t i ) is abnormal, γ d i tends to 0. The corresponding features are extracted using the original image and text information. W i can be written as W i = WE i , where E i = (e T i , . . . , e T N+(i−1)D+1 , . . . , e T N+iD ) is the feature matrix.

Sum the N components of images and text in each dimension
Denote where H is the correlation matrix between the mapping points and the original data points, and D is the diagonal matrix. Using matrix C, image and text information can be converted into corresponding semantic codes.

the final expression is
The variables in the objective function are relatively complex, and each univariate is solved by using an iterative update method.
First, fix C, U and update .
Since � = (V T)ϕ T , the objective function can be transformed into a solution for a single variable where Find the partial derivatives of ϕ T Second, fix , U and update C.

The expression becomes
The solution of C can be referenced to 31 . Third, fix , C and update U The update process is transformed into a single variable U ϕ = diag(ϕ 1 , . . . , ϕ D ), C i = −e T D+1 I D+1 diag(γ i )(e D+1 I D+1 )  www.nature.com/scientificreports/ Find the partial derivative of U Cross-modal semantic autoencoder. By mapping the image and text to the embedding consensus space, CSAEC can contain enough raw data information. V ∈ R d v ×n , T ∈ R d t ×n denote the visual and textual feature matrices, respectively, where d v and d t are the visual and textual feature dimensionalities. The following is to learn the projection matrix P v ∈ R d×d v , P t ∈ R d×d t separately: the encoder connects the image and text projection with the semantic code vector C, and the decoder is restricted so that the code vector can reconstruct the original features of the image and text. The encoder and decoder are linearly symmetric. Two encoders P v , P t project image and text data into low-dimensional space A, and two decoders reproject A back to highdimensional data. The hidden layer contains both image and text information.
For the image data, the embedding form of the automatic encoder is used to represent the information of the original features. The image-text paired representation should be unified, since in the retrieval stage, when the query information is given, the query will be sorted according to the similarity. So, we get where A ∈ R d×n represents n groups of training texts in a d-dimensional hidden space. The additional reconstruction task imposes a new constraint in learning of the projection function so that the projection must preserve all the information contained in the original textual features. For image modality, we also adopt an autoencoder to let the embeddings contain information from original visual features. We hope the representations of image-text pairs in the hidden space to be uniform. This form is a binding linear autoencoder 18 and has only one hidden layer.
For text data, to make sure of the low-dimensional ability to restore the original information points, let For each data point v i (i = 1 , 2, …, N), it can be approximated as a linear combination of all the other samples. Based on the mapping consensus we have proposed, the datasets ϕ d : In this way, the feature matrices V ∈ R d×n , T ∈ R d×n , P v ∈ R d×d , P t ∈ R d×d . Then by imposing sparsity on the matrix A and the projection matrix P v to the process of reconstruction, the optimal sparse combination matrix A and projection matrix P v can be obtained by solving the problem where a i is the ith column vector of the matrix A. As in the manifold learning methods, P v V should satisfy the orthogonal constraint. Through the sparsity constraint, the information captured by A can be used to search the relevant features and eliminate the effect of noise features. The function for structure learning is formulated as According to the expressions above, a multi-modal autoencoder can be obtained. Also, we make sure that the hidden layer contains enough semantic information. The hidden representation of the data is associated with the semantic code vector C. Considering the similarity between different modalities, we use tag information to standardize the potential representation of the autoencoder. The reference 18 has minimized the function by summing up the low-dimensional information of visual and text datasets. This method 18 relaxed the constraints and rewrite the objective of multi-modal autoencoder. In this way, the results have been improved. In retrieval phase, when a query is given, documents are sorted according to their similarity to the query. To guarantee the projected images and texts containing both semantic information and original feature information, we propose an improved autoencoder. On this basis, a regularization sparse constraint on the low-dimensional matrix A is added to obtain the final objective function where β is the weight parameter for balancing the two types of data information, and is a parameter that determines importance of semantic code vector.
We also use alternating iterative updating methods to solve the objective function separately. First, fix A and update P v , P t The solutions of the projection matrix P v , P t are similar. Let 2C 11 U T + �C 21 + � T C 12 T P v = I, P T v VL A V T P v = I, P t T P t = I, P T t TL A T T P t = I, A ii = 0 s.t. P v T P v = I, P T v VL A V T P v = I, A ii = 0 (2) min P t �P t T − A� 2 F + βtr(P t T TL A T T P t ) s.t. P t T P t = I, P t T TL A T T P t = I, A ii = 0 + 2β(V T P v P T v V + V T P v P T v VA + T T P t P T t T + T T P t P T t TA)