Multi-view clustering for multi-omics data using unified embedding

In real world applications, data sets are often comprised of multiple views, which provide consensus and complementary information to each other. Embedding learning is an effective strategy for nearest neighbour search and dimensionality reduction in large data sets. This paper attempts to learn a unified probability distribution of the points across different views and generates a unified embedding in a low-dimensional space to optimally preserve neighbourhood identity. Probability distributions generated for each point for each view are combined by conflation method to create a single unified distribution. The goal is to approximate this unified distribution as much as possible when a similar operation is performed on the embedded space. As a cost function, the sum of Kullback-Leibler divergence over the samples is used, which leads to a simple gradient adjusting the position of the samples in the embedded space. The proposed methodology can generate embedding from both complete and incomplete multi-view data sets. Finally, a multi-objective clustering technique (AMOSA) is applied to group the samples in the embedded space. The proposed methodology, Multi-view Neighbourhood Embedding (MvNE), shows an improvement of approximately 2−3% over state-of-the-art models when evaluated on 10 omics data sets.

of Latent Multi-View Subspace Clustering (LMSC) algorithm, linear LMSC (lLMSC) and generalized LMSC (gLMSC). lLMSC uses a linear correlation between each view and the latent representation, and gLMSC uses neural networks to obtain the generalized relationships between the views.
Clustering has many important applications in the field of biology and medicine 35 . The rapid development of high throughput technology makes available a large number of different omics data to study the biological problems 36 . It is possible to collect multiple omics data for the same person. Through examining the different omics data, it is possible to distinguish reliably between different categories of cancer 37 . Integration of multiple omics data can better understand the underlying molecular mechanism in complex biological processes, and therefore offers more sophisticated ways to address biological or medical issues 38,39 . Thus, compared to single data types, multi-omics methods achieve better performance. So far, a lot of approaches for data integration have been suggested in the literature. These data integration methods mainly depend on two strategies: (i) space projection method 40 , and (ii) metric (similarity measures) fusion technique 41 .
Nevertheless, these methods follow very different approaches to obtain the patterns of samples or genes from multiple data domains. Earlier methods have utilized high correlated samples in the datasets to identify the multidimensional genomic modules [42][43][44] . However, these "co-modules" can only detect the sample structures across data types and may lead to biased clustering 45 . Shen et al. proposed iCluster which obtains the cluster-structures from multi-omics data by using a joint latent variable template. Mo et al. developed iClusterPlus 46 , the iCluster extension, which uses linear regression models to learn different properties of omics data. However, the major drawback of this method is that it holds some strong assumptions which may fail to capture meaningful biological information. SNF (similarity network fusion) 41 can resolve such problems as an almost assumption-free and rapid approach and uses local structure preservation method to modify sample similarity networks for each type of data. But, SNF can only characterize pair-wise similarity (e.g., Euclidean distance) in the samples, and is sensitive to local data noises or outliers. Further, pair-wise similarity can't capture the true underlying structure in different subspaces, leading to inaccurate clustering. Nguyen et al. proposed PINS 47 , to identify clusters that are stable in response to repeated perturbation of the data. It integrates clusters by examining their connectivity matrices for the different omics data. Mitra et al. 48 proposed an ensemble-based multi-objective multi-view algorithm for classifying patient data. This method is computationally very expensive. One drawback common to all these algorithms is that they treat all omics equally, which may not be biologically appropriate. As a result, the clusters discovered are often poorly associated with patient outcomes. Thus, there is a scarcity of more effective integration approach.
For patient stratification, multiple omics data can unfold more precise structure in the samples, that are not possible to disclose using single omic data. Combined information from multiple omics improves the performance of the clustering algorithm. Some of the advantages of using multi-omics clustering are as follows: (i) it reduces the effect of noise in the data, (ii) each omic can reveal structures that are not present in other omics, (iii) different omics can unfold different cellular aspects.
Motivated by the above requirements, in this paper, we have proposed a probabilistic approach to map the high dimensional multi-omics data to a low dimensional unified embedding preserving the neighbourhood identity across the views. It is meaningful to obtain an integrated heterogeneous feature set in a probabilistic model because different properties of the data, like, variance and mean, can be combined effectively in a probability space. Under each view in the higher dimensional space, a Gaussian is centered on every sample, and the densities under this Gaussian are used to generate a probability distribution over all the potential neighbours of the sample. The different probability distributions of each sample across different views are combined by conflation 49 . The aim is to approximate this unified distribution as much as possible when a similar operation is carried out in the embedded domain. Intuitively, a probabilistic embedding framework is a more conscientious approach because it circumvents the problems of different representations and incomparable scales. Further, we have applied multi-objective clustering to cluster the obtained embedded data sets. The main advantage of this technique is that it is capable of extracting different shaped clusters present in a data set. The general overview of the proposed methodology is shown in Fig. 1.
The proposed model MvNE (Multi-view Neighbourhood Embedding) is evaluated on 10 cancer data sets and results are compared with state-of-the-art methods.
Some of the benefits of the proposed methodology are as follows: 1. MvNE combines the views in the probability space. Combination of the views in the probability space preserves various statistical properties of the individual views. 2. Conflations of normal distributions coincide with the classical weighted least squares method, hence yielding best linear unbiased and maximum likelihood estimators. The use of this method provides a weighted combination of several views which is an important criterion for view combination. Hence, it reduces the overhead of finding optimal weights for each view. 3. The proposed methodology is extended to handle the datasets having incomplete views. 4. To the best of our knowledge, the current work is the first attempt in combining multiple omics data in the probability space in biomedical domain. 5. To the best of our knowledge, conflation method for combining multiple views was never used in the literature.

Methods
Under this section, we have described the process of generating unified embedding to handle the multi-view data sets. problem statement. Given a data set X = {x 1 , x 2 , . . . , x n }, with n samples, we use 1, 2, . . . , m) , to represent the v th view of the data set with d v feature dimensions. The task is Scientific RepoRtS | (2020) 10:13654 | https://doi.org/10.1038/s41598-020-70229-1 www.nature.com/scientificreports/ to obtain an embedding in the lower dimension, Y = {y 1 , y 2 , . . . , y n } ∈ R d emb ×n , by unifying all m number of views, and categorizing it into C classes. Here, Y is optimized so that the sum of the Kullback-Leibler divergence between the two distributions (computed from higher dimension, X , and lower dimension, Y ) is minimized.

Conflation of probability.
The conflation is defined as the distribution determined by the normalized product of the probability density or probability mass functions 49 . It can be easily calculated and also minimizes the maximum loss in Shannon information in combining several independent distributions into a single distribution. The conflation of normal distributions produces the classical weighted mean squares and the maximum likelihood estimators for normally-distributed unbiased samples 49 . The traditional methods of combining probability distributions, viz., averaging the probabilities and averaging the data, are as follows: Averaging the probabilities. One of the common methods of combining the probabilities is by averaging them over every set of values, P(X) = (P 1 (X) + P 2 (X))/2 . This method has a significant disadvantage. Firstly, the mean of the combined distribution is always exactly the average of the means, independent of the relative accuracy or variance of each. It is unreasonable to weight the two distributions equally. Secondly, it generates a multi-modal distribution, whereas the desired output distribution should be in the form as that of the input data-normal, or at least unimodal.
Averaging the data. Another common method that does preserve the normality is to average the data. Here P is calculated on (X 1 + X 2 )/2 . Again, the main disadvantage of this method is that the obtained distribution has the mean exactly the average of the means of the input distributions, irrespective of the relative accuracies or variances of the two inputs (shown in Fig. 2). The variance of P is never larger than the maximum variance of P 1 and P 2 . The conflation of probabilities (denoted by symbol "&") is a method for consolidating uniformly weighted data.
If P 1 and P 2 have probability mass functions of f 1 and f 2 , respectively, then conflation is denoted as follows: In Fig. 2, we have shown the comparison between conflation, averaging the probabilities and averaging the data methods. Initially, it may seem counter-intuitive that conflation of the two distributions produces a much narrower curve. However, if the two measurements obtained from different sources are assumed equally valid, then the overlap region between the two distributions contains the real value with relatively high probability.
Generation of initial unified data set by combining all views. Initially, we generate a unified data set, X ∈ R d×n by concatenating the views, For the points not appearing in all the views, we have replaced the missing features with zeros.
After obtaining X, we have used a stacked autoencoder (SAE) to obtain an unified representation of the data set, Y init ∈ R d emb ×n . Here, d emb represents the feature dimension in the embedded domain. Recent research has www.nature.com/scientificreports/ shown that SAE consistently produces well separated and semantically meaningful representation on real-world data 50 . The SAE comprises of an autoencoder part and a decoder part. Suppose, we are having a three layer architecture: an input layer, three hidden layers and an output layer. In Fig. 3, the "feature" layer is the bottleneck i.e., the output of this layer is the required embedding. In the input layer, we provide the original sample vector as input. For example, we provide a vector of size 200 as input and want the embedding to be of size 30. Then the input and output layers will have size 200 and the bottleneck layer will have size, 30. The input vector is first squeezed to the size of 30 and then we reconstruct the 200 sized vector from this size of 30. The reconstructed vector, i.e., the value of the output layer should be similar to the input vector. For this, we have used mean squared error between the input and output vectors as the loss function. Once the network is trained, the decoder part is discarded and only the encoder part is used to generate the embedding. The details of the network is explained below.
Each layer of SAE is a denoising autoencoder, trained to reconstruct the output of the previous layer after random corruption 50 .
The denoising autoencoder is defined as follows: Here, Dropout(.) randomly sets a part of input dimension to 0. g 1 (.) and g 2 (.) are the activation functions for encoder and decoder, respectively. For training the network, the least-square loss, ||x − y|| 2 2 is minimized. As the activation function, we have used rectified linear units (ReLUs) 51 , for every encoder/decoder pair.
After training, all the encoder and decoder layers are concatenated together, to form a deep autoencoder. The schematic of the deep autoencoder is shown in Figure 3. It is a multilayer deep autoencoder having a bottleneck Example of conflation technique. The red curves are the two independent distributions, yellow curve is the probability distribution obtained by averaging the probabilities, blue curve is the probability distribution obtained by averaging the data and green curve denotes the distribution obtained by conflation technique. www.nature.com/scientificreports/ coding layer in the middle. The reconstructed network is then fine-tuned to minimize the reconstruction loss. We have discarded the decoder layer and used the encoder to generate the initial embedding, Y init .

Generation of unified probability distribution.
For each sample point, i, and its potential neighbour, j, in the view, v, the symmetric probability, p v ij , that i selects j as its neighbour is given by Eqn 6: Here, The dissimilarity, d v ij is computed by the Eq. 8. It is the scaled squared Euclidean distance between the high dimensional samples, x i and x j , in the view v.
Here σ v i is generated as such that the entropy of distribution over neighbors equals to log k 52 . Here, k is the effective number of nearest neighbors.
After obtaining the Gaussian distribution of each sample point in each view, the combined probability for each sample is generated by conflation method 49 shown in Eq. 9.
By the basic properties of conflation 53 , the obtained unified probability, p ij , is the weighted-squared-mean of the p v ij , and is normal. The obtained P is further symmetrized by the Eq. 10.
In the embedded dimension, the induced probability, q ij , that i th point will pick j th point as neighbour, is calculated by a Student t-distribution with one degree of freedom 54 , given in Eq. 11. Student t-distribution is used instead of Gaussian because the density of a point is evaluated much faster under Student t-distribution since it does not involve any exponential.
To find the optimal embedding, the sum of the Kullback-Leibler divergence between the two distributions is minimized as given in Eq. 12: Extension to incomplete multi-view data-set. In this section, we have shown how our proposed algorithm can be extended for incomplete view settings. In case of incomplete view data, all samples do not appear in every view. The only change that we have to make is in the generation of unified probability, p ij . For generating p ij under incomplete view settings, we have used the Eq. 13. When all the views are complete, Eq. 13 is reduced to Eq. 9. From the equation, it can be seen that for the samples occurring in more than one view we have used conflation, but for the points occurring exactly in a single view, we have used the original probability as generated by Eq.6.
Here, S in is the set of points occurring in more than 1 view.
when sample j occurs in only 1 view but i occurs in more than 1 view www.nature.com/scientificreports/ Rest of the methodologies are similar to those of the complete view setting. Finally, to obtain the optimal embedding, Eq. 12 is minimized.
The generation of unified view is explained with examples in the supplementary file.

Generation of the final embedding.
After obtaining an unified probability distribution P for each sample point in the data set (in Section ) and a combined data set Y init (in Section ), an unified final embedding Y is generated. At first, instead of randomly initializing Y , we initialize it with Y init . We start by calculating q ij using Eq. 11 for Y , and try to minimize the KL-divergence in Eq. 12. Using stochastic gradient descent the values of Y are optimized. Finally, the embedding Y is obtained. The working methodology of the proposed algorithm is shown in Algorithms 1 and 2.
optimization. The KL divergence between the two joint probability distributions, P and Q, is given by Eq. 12. The equation can be written as: Before performing the derivation, we define the following terms Now, for change in y i , the pairwise distances that may change are d ij and d ji , ∀j . Hence, the gradient of the cost function, C, with respect to y i is given by: is obtained by differentiating KL-divergence in Eq. 14.
For k = i and l = j, the gradient is non zero and also k =l p kl = 1 . Hence, the gradient, ∂C ∂d ij , is given by, Substituting this in Eq. 18, we have the final gradient as: Generation of clusters. Here, we have discussed about the algorithm used to generate the clusters from the embedded sample points.
Multi-objective clustering technique. Multi-objective optimization (MOO) based clustering algorithms are better in capturing clusters having different shapes 55 and can detect the number of clusters automatically from the data set. For our experiment, we have used the algorithm, Archived Multi-objective Simulated Annealing (AMOSA), similar to that used in 56 . We have used center based encoding. The centers of the clusters are encoded in a solution to represent a partitioning. The concepts of variable length encoding are used to automatically identify the number of clusters from a data set. The number of clusters in different solutions are varied over a range. The choice of the algorithm is not restricted to this, any other MOO based algorithm can be used.  57 computes the ratio between the cluster compactness and cluster separation. The minimum value of XB-index represents the optimal partitioning.
where K = number of clusters d(x r , C q ) = distance between the cluster center and the points within the cluster, d(C l , C m )) = distance between cluster centers.
PBM index 58 is defined as follows: Here, K = number of clusters, Here, x k l = l th point of the k th cluster, C k = the center of the k th cluster and n k = samples in the k th cluster. Maximum value of PBM index corresponds to the optimal number of clusters.

Results
This section gives an overview of the datasets used in our experiment, experimental setup and some of the results obtained.    64 : This uses a latent sample representation to determine the distribution of the features. It optimizes a convex objective and offers a solution that is globally optimal. 5. PINS 47 : To combine clusters of different views, it uses a connectivity matrix. The number of the clusters is chosen in such a way that the perturbation is robust. Perturbation is obtained by adding Gaussian noise to the data. 6. SNF 41 : It is a similarity-based approach that generates a similarity network separately for each view. Such networks are fused together by an iterative process. 7. iClusterBayes 65 : It uses Bayesian regularization based joint latent-variable model to detect the clusters from multi-omics data. 8. MVDA 44 : In this approach, the information from various data layers (views) is incorporated at the result stage of each single view clustering iteration. This functions by factorizing the membership matrices in a late integration manner. 9. MvNE: Our proposed multi-view clustering methodology uses conflation method to combine the views in the probabilistic domain and generates an unified embedding. We have applied multi-objective optimization algorithm, AMOSA 56 , on the embedded data sets to obtain the clusters. AMOSA automatically determines the number of clusters from the data set. From the obtained Pareto-front, we have reported the results of the solutions which have high NMI values. 10. AvgProb: As a baseline method, we have at first generated a probability distribution of the samples on each view and then combined the distributions by considering the average of the probabilities over the views. Final embedding is generated by minimizing the KL divergence between the obtained average probability and the probability in embedded domain. 11. AvgData: As a baseline method, we have generated a combined distance matrix by considering the average of distance matrices from all the three views. This probability distribution in the higher dimension is generated from this combined distance matrix. Final embedding is generated by minimizing the KL divergence between the generated probability and the probability in the embedded domain.
preprocessing of data sets. Most  experimental settings. For state-of-the-art methods, we have used the codes released by corresponding authors. For our method, empirically we have selected the size of low dimensional embedding, dim, as 80 for all the data sets. The size of the nearest neighbour, k, is set to 30 empirically for all data sets. The total iteration of gradient descent is set to 2000, initial momentum is set to 0.5 and final momentum is set to 0.9. Initially learning rate ( η ) is set to 200, and after every iteration, it is updated by adaptive learning rate scheme described by Jacobs et al. 66 .
evaluation metrics. Two measurement indices, normalized mutual information (NMI) 67 and adjusted rand index (ARI) 68 are used to compare MvNE with other approaches. Both metrics measure the differences between the real and the predicted partitions; higher values indicate more similarity with the predicted group.
Here C and E are the true class labels and cluster labels, respectively. I(.) is the mutual information, H(.) is the entropy.
parameters study. There are two main parameters in the proposed methodology, i.e., the size of the k nearest neighbors (k) and the unified embedding dimension (dim). Under this section, we have analyzed the performance of MvNE with changes in these parameters. Results on all the ten data sets are reported in Figs. 4 and 5. From Fig. 4 it is evident that, when k is too small, the probability distribution has very little information regarding the global structure of the clusters and it is too much focused on the local structure which causes the clusters to break into several sub clusters deteriorating the performance of the algorithm. If the k is too large, it fails to capture the structures of the clusters properly, causing merging of clusters and the algorithm is not stable when k is large. Empirically we have set the value of k to 30 for all data sets. Figure 5 shows that, when dim is too small, unified embedding fails to capture enough information to reflect the structure of the data set. When it is too large, the performance degrades. One of the reasons for this is the www.nature.com/scientificreports/ use of Student-t distribution with one degree of freedom. With the increase in dimension it fails to preserve the local structure of the data set because in higher dimension, the heavy tails comprise of a relatively large portion of the probability mass. Empirically we have kept the embedded dimension to 80 for all data sets. In SAE, we have kept the input layer size to the size of the input vectors, for example, like for gene expression values, we have kept input dimension size to 400. There are three hidden layers of size, 500, 80 and 500, respectively. The output layer has same size that of the input layer. The dropout value is set to 5%.
For the multi-objective clustering algorithm, we have set the parameters in accordance to 56 . We did not tune the settings of AMOSA 69 because the main focus of the paper is on generating the optimal embedding. The parameter settings are as follows: T max = 100 , T min = 0.001 , Iteration = 100 , rate of cooling= 0.9 , Min clusters= 2 , Max clusters= {samples} , SL = 50 and HL = 40 . The algorithm is executed for 20 times.
Gene marker identification. BRCA data set includes four groups of patients, i.e., LumB, LumA, Her2 and Basal. A binary classification problem is solved to identify the most significant genes from each class. Two groups, one with samples from one class and the other with samples from other classes are formed. Signal-tonoise ratio (SNR) 70 is determined for each gene after considering both classes. It is described as, Here µ is the mean and σ is the standard deviation of each class.
Genes with high SNR values correspond to high value of expression for the class to which they belong and vice versa. Finally, 5 up regulated (high SNR) and 5 down regulated (lowest SNR values) genes are selected from SNR list.  www.nature.com/scientificreports/

Discussion
In Tables 4 and 5, we have compared the NMI and Adjusted Rand Index (ARI) values over 10 omics data sets obtained by different clustering methods. In terms of NMI and ARI, our proposed methodology, MvNE, shows an improvement of 2-3% and 1.8-2.9% over all the data sets with respect to state-of-the-art algorithms, respectively. The maximum NMI and ARI values are marked in bold in Tables 4 and 5 respectively. From the results, it can be seen that iClusterBayes 65 performs poorly as it may get stuck at local optimal solutions due to complex structure.
Further in Table 6, we have also reported the macro F1 − score and Accuracy results obtained by our proposed methodology for all the 10 omic datasets.
The baseline method, AvgProb, performs poorly as average of the probabilities failed to capture the structure of the probability distribution as discussed under the "Methods". MvNE outperforms both the baseline methods, AvgData and AvgProb, showing the superiority of the conflation method.
Our hypothesis of generating subspaces by combining different views in the probabilistic domain proves effective with the results obtained.
The BRCA dataset has 4 classes, so a total of 40 genes with 20 down-regulated genes and 20 up-regulated are obtained. Fig 6 shows the heatmap plot of these genes. Here, red means higher levels of expression values, green means lower levels of expression, and black means moderate levels of expression. Fig 6 also indicates that genes known for a specific tumor class are either down-regulated or up-regulated.
We have listed the gene expression profile plot for each BRCA dataset group in Fig 7. The structure compactness shows that the clustered samples have the same form of gene expression, i.e., there is a strong continuity between them within a cluster sequence.
In Table 3, we have reported the p-values obtained by our proposed model when compared with other state-of-the-art and baseline models. These results are below 5% significance level. This shows that performance improvements obtained by our proposed model, MvNE, are statistically significant. theoretical analysis. Time complexity. The proposed methodology can be divided into three parts, viz., generation of the initial embedding using SAE, generation of the final low dimensional embedding and finally the AMOSA algorithm for clustering. The time complexity of each part of the algorithm is as follows: 1. Time complexity of SAE In our proposed approach, we have used 3 hidden layers. The time complexity of matrix multiplication is M ij * M jk is O(j * j * k). Since, we have total 4 layers (3 hidden layers and 1 output layer), so we require total 4 weight matrices to compute the output layer, say, W ji , W kj , W lk and W ml . For a training sample t, number of iterations as n and input vector of size, i; the time complexity for total forward and backward pass is typically: O(n * t * (ij + jk + kl + lm)) However, we have parallelized the SAE by using GPUs. 2. Generation of embedding from high dimensional space The time complexity of generating the low dimensional embedding for N number of samples is O (N  *  N) . So, for very large dataset, this method is very slow. In the supplementary file, we have shown results on large datasets having more than 3000 samples. But this algorithm is best suited for biological datasets where less number of samples are available. Convergence analysis. For convergence analysis of our algorithm, we have shown the error plots for all the datasets while generating the low dimensional embedding. From Fig. 8, it can be observed that for 1000 iterations, there is a monotonic decrease in the error value for all the datasets. This shows the convergence of our proposed methodology. O(n * t * (ij + jk + kl + lm))  www.nature.com/scientificreports/ conclusion In this paper, we have proposed an unsupervised probabilistic approach to generate an unified neighbourhood embedding for multi-view data sets. The proposed methodology combines the multiple omics data in the probability space and then generates an unified embedding preserving the statistical properties of each view as well as the combined neighbourhood information of the samples. Assigning equal weightage to each view is not very likely for solving patient classification problems. One of the key benefits of our proposed methodology is that it utilizes a weighted combinations of views. The conflation method used here for combining different omics data, automatically assigns high weightage to the more accurate omics data. Another advantage of the proposed methodology is that it can handle data having incomplete views, i.e., missing samples in some views. The results for incomplete views are shown in the supplementary file. However, one of the major drawbacks of the proposed methodology is the time complexity of calculating the embedding in the lower dimensions. It has very high time    www.nature.com/scientificreports/