Abstract
Cancer subtypes identification is one of the critical steps toward advancing personalized anticancerous therapies. Accumulation of a massive amount of multiplatform omics data measured across the same set of samples provides an opportunity to look into this deadly disease from several views simultaneously. Few integrative clustering approaches are developed to capture shared information from all the views to identify cancer subtypes. However, they have certain limitations. The challenge here is identifying the most relevant feature space from each omic view and systematically integrating them. Both the steps should lead toward a global clustering solution with biological significance. In this respect, a novel multiomics clustering algorithm named RISynG (Recursive Integration of Synergised Graphrepresentations) is presented in this study. RISynG represents each omic view as two representation matrices that are Gramian and Laplacian. A parameterised combination function is defined to obtain a synergy matrix from these representation matrices. Then a recursive multikernel approach is applied to integrate the most relevant, shared, and complementary information captured via the respective synergy matrices. At last, clustering is applied to the integrated subspace. RISynG is benchmarked on five multiomics cancer datasets taken from The Cancer Genome Atlas. The experimental results demonstrate RISynG’s efficiency over the other approaches in this domain.
Introduction
Cancer is a heterogeneous disease with diverse pathogeneses, and clinical features that can develop in different tissues and cell types^{1}. A cancer subtype can be defined as a subcategory of specific cancer; for example, Cervical cancer can be further grouped into Adenocarcinomas and Squamous cell carcinomas. Multiple subtypes are distinguishable based on molecular profiles, histology, or sometimes specific mutation. In personalized medicine practices, patientspecific medicines are provided rather than generic medicine. Therefore, for effective treatment of any cancer, it is crucial to identify the appropriate cancer subtype in order to provide an effective prognosis^{2}.
Nowadays, with the advancement of technologies, it has become very easy to generate highdimensional multiomics data for an individual. Multiomics data include miRNA and mRNA expressions, DNA methylation, reverse protein phase assays, and others. These datasets are publicly available in various databases like The Cancer Genome Atlas (TCGA)^{3}. Accumulation of various omics data opens up the opportunity to develop novel computational methods to integrate the tremendous amount of multiview information available for cancer subtype identification. The usual practice of identifying cancer subtypes is by clustering cancer patient data. By grouping the cancer patients based on their genetic profiles, one can better understand the pathogenic mechanisms behind the disease. This will later help in the development of subtypespecific anticancer treatments. However, several challenges exist in grouping the cancer patients and integrating multiomics data.
The multiview omics data integration and clustering of cancer patients are considerably new research areas. Few algorithms are developed to address the challenges associated with it. A decade ago, researchers used single omics data to cluster cancer subtypes. Several studies are performed using only gene expression data^{4,5,6} or DNA methylation data^{7} or copy number data^{8} to identify cancer subtypes. These algorithms perform clustering across the samples to capture the homogeneity present within the patients based on expression levels of a specific biomarker. Since acquiring cancer hallmarks requires multiple molecular alterations at multiple levels, these algorithms fail to establish the causal relationship between molecular signatures. This biological phenomenon indicates the need for algorithms that integrates multiomics data to identify cancer subtype. In this regard, integrative clusteringbased approaches are found helpful for capturing underlying molecular mechanisms working behind deadly cancer. Further, these algorithms can be categorized into two groups. The first group of algorithms identifies clusters from each omic data separately. Later, it combines these clustering results to obtain a global cluster that represents cancer subtypes^{9,10,11,12}. These forms of algorithms are known as Consensus Clustering (CC). Mostly, the CC algorithms perform final clustering on individual clusters obtained from different omic datasets using a voting mechanism. Different voting mechanisms generate different clustering solutions. The second group of integrative clusteringbased approaches first integrates the multiview omics data and then applies clustering to obtain cancer subtypes^{13,14,15,16}. Sometimes the multiview data are concatenated or stacked together, and clustering identifies cancer subtypes. Data concatenation may lead to information loss and amplifies the curse of dimensionality^{16}. On the other hand, to overcome the above mentioned limitations, a set of algorithms are developed to extract informative subspace from each of the omics datasets and then performs clustering on the integrated dataset^{14,15,16,17,18,19}.
Clustering multiview genomics data is a challenging task. One of the critical steps is selecting relevant information from all the available information sources and judiciously integrating them to obtain better clustering solutions. The multiview data from multiomics studies vary in terms of variance, scale, and unit. If the integration step is not performed correctly, the fused information may be biased towards the most variant omic view. Therefore, it becomes essential to first capture the variations present in each view and then integrate them. There are some methods available that model the variation of each view first with the help of similarity graphs and integrate them to identify clusters^{13,19,20,21}. The challenge here is finding the best possible way of integration to capture the essence of all the views from different types of genomic information available for the same set of samples. The research area devoted to this type of problem is multiview learning^{22,23,24,25,26,27}.
In this study, a novel algorithm named RISynG (Recursive Integration of Synergised Graphrepresentations) is presented. The proposed approach treats multiomics data clustering as multiview clustering, where information from multiple omics platforms is integrated to identify clinically important subgroups within cancer. In order to judiciously capture the variation present across the multiomics dataset, the proposed approach works in three steps. At first, for each view, two sample similarity matrices are computed using graph representation matrices, namely, the Gramian matrix and the Laplacian matrix. This step acknowledges the statistical diversity in the multiview omics data, which directly influences the quantification of similarity between the samples. Later, it involves the integration of representation matrices for the respective omicview using a parameterized combination function to generate synergy matrices. In the second step, the variation captured through synergy matrices for each omicview is fused. The proposed approach first arranges all the synergy matrices based on their relevance. Then, a recursive function is designed to merge each synergy matrix so that the less relevant matrix has only a slight influence on the final cluster structures. At the end of this process, the final accretive basis of the accretive subspace is obtained, whose first k eigenvectors hold the cluster structure. At last, kmeans clustering is applied on the rows of the accretive basis matrix to generate cluster labels. The efficacy of the proposed algorithm is extensively studied on five multiomics cancer datasets and compared with existing multiview clustering approaches used for cancer subtypes identification.
Proposed approach for cancersubtypes identification
This section describes the novel algorithm designed in this study to integrate multiomics data for cancer subtypes identification. The proposed method integrates multiview data using a recursive multikernel integration function. It uses the graphical representation to harness the best picture of sample similarities from each of the omic views and explores each view’s statistical property. The schematic workflow of RISynG is presented in Fig. 1. Before moving to the steps of the proposed algorithm, first, the required analytical formulations are discussed.
Gramian matrix and kernel trick
Gramian matrix, \(G=[g_{ij}]_{n\times n}\) is a Hermitian matrix, in which each element is a pairwise Hermitian inner product of the vectors in a Hausdorff preHilbert space, V = \(\{{v_{1},v_{2},v_{3}, \ldots ,v_{n}}\}\).
The Hermitian inner product space is accompanied by the geometric notions associated with the vectors, such as the length and the angle between two vectors. Since G is a Hermitian matrix, it inherits all the properties portrayed by a Hermitian matrix. A few of the relevant properties are enlisted below^{28}.
Property 1
All the eigenvalues of G are real.
Proof
Eigenvalues of a matrix are the roots of its characteristic equation. The characteristic equation for matrix G is written as:
Let the root be some complex number \(\lambda = a+ib, a,b\in {\mathbb {R}}, b\ne 0\) and I be the identity matrix of same order. Since, at this value of \(\lambda \), the characteristic equation has a nonempty kernel, there must exist a vector \(u=x+iy, x,y\in {\mathbb {R}}\) such that:
or,
Taking adjoint of this equation we get,
If \(x+iy\) and \(xiy\) were two different eigenvectors of matrix G, then their inner product \(x^2+y^2\) would have been 0 because of the mutual orthogonality among the eigenvectors. That is not possible until x and y are 0, in which case, (3) and (4) would be indifferent. That is possible only if the initial assumption is contradicted and b is allowed to be 0 for all eigenvectors x. Hence, it is proved that all the eigenvalues of G are real. \(\square \)
Property 2
G is symmetric and positive semidefinite matrix.
Proof
Pertaining to the fact that \(v_{i}\in {\mathbb {R}}^d\), the following should hold for some set of vectors x.
According to the elementary property of inner products, \(\square \)
\({\displaystyle \langle x+y,x+y\rangle =\langle x,x\rangle +\langle x,y\rangle +\langle y,x\rangle +\langle y,y\rangle \,.}\) It implies that the sum of inner products in (5) can be taken forward as,
Therefore, G is positive semidefinite matrix.
Property 3
All the eigenvalues of G are nonnegative.
Proof
Property 2 implies \(x^{\textsf {T}}{G} x\ge 0\). Substituting the value of Gx from (2) into it,
Since \(x^{\textsf {T}}x\) is positive for all eigenvectors, therefore, \(\lambda \ge 0\). Hence proved.
The previously described premise is often used in various methods of dimensionality reduction. Algorithms like Principal Component Analysis and its variants utilize kernel trick to map the observations into a higher dimension to make the data linearly separable. It is equivalent to projecting the meancentered data onto a subspace on which its variance is maximum^{29}. It is shown by Bernhard Scholkopf et al.^{30} that algorithms like KPCA use a kernel function \(\varvec{\kappa }\) to essentially learn a mapping function \(\phi \) for the input space \({\mathbb {R}}^n\) into a highdimensional Hilbert space \(\mathbf{F}\), which can be called as feature space. The process is demonstrated in (8) and (9).
Therefore, for a data point \(v=(x_1,\dots ,x_n), x_i \in {\mathbb {R}}^d\), mapping into a feature space \({\mathbb {R}}^{n+k}\) will be given by
where, the value of \(p_i\) depends upon the kernel that has been used for the mapping; however, kernels do not explicitly project the data into that high dimensional feature space; rather, it generates a Gramian matrix G of the mapped data in the aforementioned feature space \(\mathbf{F}\). Generated Gramian matrix enables the input data to be operated in that highdimensional feature space^{31}. If \(X=(x_1\dots x_n), x_i\in {\mathbb {R}}^{d}\) represent the input data. The corresponding Gramian matrix is given by
Let \(G=U\Sigma U^T\) represent the eigen decomposition of G, where U is a matrix containing the eigenvectors of matrix G, arranged columnwise in descending order of their corresponding eigenvalues, which are present in the same fashion in the diagonal matrix \(\Sigma \) as shown in (11) and (12).
Here, \(\lambda _1\ge \dots \ge \lambda _n\ge 0\) (see Property 3 of Gramian matrix), \(u_i^Tu_i=1\) for \(i\in \{1,2,\dots ,n\}\) and \(Gu_i=\lambda _i u_i\). Also note that in context of PCA Principal Components refers to the projection of the input data points onto the principal direction where the variance of the data is maximum. For PCA, the projection is given by \(y_i=U_k^Tx_i\) for all \(i\in \{1,2,\dots ,n\}\), where \(U_k\) is a matrix of first k eigenvectors of G. However, in case of KPCA, the spectrum of G itself gives the projection of X^{32}. Note that when \(\phi (v)=v\), Gramian matrix transforms into covariance matrix. Generalising both, if \(U_k\) represent k principal axes, the algorithm finds a basis of an optimal lowdimensional subspace where the \( L_2\)norm of reconstruction error is minimum^{33}. That is, for a test sample x
In addition to dimensionality reduction, principal component analysis can also be used for kclustering using a heuristic based kmeans algorithm. This is done by performing kmeans clustering in the projected space, as shown in heuristic kmeans algorithm described in^{34}.\(\square \)
Graph Laplacian
Any set of observations appear to have an emergent behaviour to evince the properties of a graph when operated in a clustering pipeline. Therefore, given a set of data points \(X=(x_1,x_2,\dots ,x_n) \in {\mathbb {R}}^{d\times n}\) and a notion of similarity between any two points \(x_i\),\(x_j\in X\), an undirected similarity graph \(S=(V,E)\) can be constructed out of them such that each vertex \(v_i\in V\) represent a data point \(x_i\), and \((v_i,v_j)\in E\) represent the edge between vertices \(v_i\) and \(v_j\). With each edge, there is an associated edge weight \(e_{ij}\) that represent the similarity between the corresponding data points. Let the similarity matrix be \(W(i,j)=[e_{ij}]_{n\times n}\). The degree \(d(v_i)\) associated with each node \(v_i\) is given by
The degrees of all the nodes/vertices can be wrapped in matrix form as shown in (15)
These matrices act as a precursor for constructing a matrix of algebraic importance, called Laplacian matrix. The data can be composed as a discrete graph form by making graph Laplacian of its continuous representations like vector space or Riemannian manifolds. Laplacian matrix has many variants, so much so, that depending on the problem and available data, authors device their own version of graph Laplacian matrix^{35}. The simplest graph Laplacian, is given by (\(DW\)). It is called unnormalise graph Laplacian matrix. However, in the proposed algorithm, the normalised graph Laplacian matrix has been used. That is,
where \(D^{1/2}=diag(d_1^{1/2},d_2^{1/2}, \dots ,d_n^{1/2})\) and I is the identity matrix of appropriate order. Considering the fact that similarity matrix is a Gramian matrix, it is apparent that Gramian and Laplacian are not much different. Laplacian can be characterised as the Gramian normalised over the degree matrix. The distinction between unnormalised and normalised graph Laplacian is better apparent in light of spectral clustering. Consider a strongly connected graph \(S=(V,E)\). The purpose of clustering is to come up with the subsets of points according to their similarity, such that the similar points lie in the same subset. It is equivalent to finding the partitions of a graph such that the edge between different partitions has minimum weights. For two disjoint subsets \(A, B\subset V\) corresponding to two different partitions, the cut size is given by
Let there be k clusters in the data. The aim of clustering is to find k such partitions \({\mathbf{A}=(A_1,A_2,\dots ,A_k)}\), such that the size of the cuts, as shown in (17), over all the partitions is minimum. That is
where \(\bar{A_i}\) is the complement of \(A_i\). This is called the mincut problem. However, solving (18) alone does not achieve reliable clustering results. For example, for \(k=2\), partitioning one vertex from the rest of the graph can also be a valid solution as per mincut. In clustering, each cluster needs to accommodate a reasonably large partition to be considered credible. Therefore, the objective function is redefined in following two ways
where \(A_i\) represent the number of vertices in partition \(A_i\) and \(vol(A_i)=\sum _{v_j\in A_i}{d_j}\).
However, solving these minimisation problems is NP hard. Laplacian matrix is an utility that can be used to approximate these minimisation problem. Consequently, unnormalised Laplacian serves in the approximation of the minimization of RatioCut, while normalised Laplacian serves in the approximation of the minimization of NCut. Therefore, the approximated objective function using normalised Laplacian is given by (21).
The above expression is minimum when \(U_k\in {\mathbb {R}}^{n\times k}\) is a matrix containing eigenvectors corresponding to k smallest nonzero eigenvalues of matrix \({\mathscr {L}}\). This matrix is used to embed the data into a k dimensional euclidean space spanned by the vectors in matrix U, in which grouping of the data points is arguably easy even with simpler techniques like kmeans. The described practice is known as Laplacian embedding. The embedded data is then subjected to kmeans clustering algorithm for cluster discovery, as shown in Normalised Spectral Clustering presented in Ref.^{36}.For a strongly connected graph with single component, the eigenvector corresponding to the trivial solution (i.e. \(\lambda =0\)) of the eigenvalue problem of matrix \({\mathscr {L}}\) is a column vector of n ones. Therefore, \({\mathscr {L}}{} \mathbf{1}_{n}=0\) where \(\mathbf{1}_{n}=(1,\dots ,1)^T\). If the graph happens to have more than one components, then the multiplicity k of eigenvalue 0 if equal to the number of connected components in the graph. Nonetheless, with respect to clustering, the eigenvector(s) corresponding to eigenvalue 0 should be omitted while performing Laplacian embedding. It can be done by introducing a minor change in the matrix.
If the eigenpairs of \({\mathscr {L}}\) are given by
then, the eigenpairs of (22) are given by
Hence, the new eigenvalue problem becomes
By modifying the matrix to L, the initial k eigenvectors can be taken right away. This trick works because of the fact that for all the pairs in \({\varvec{\Gamma }}({\mathscr {L}})\) except \((\lambda _1,f_1)\), the matrix L gets reduced to \({\mathscr {L}}\). Hence, set \({\varvec{\Gamma }}(L)\) is going to have all the eigenpairs that are in \({\varvec{\Gamma }({\mathscr {L}})}\), except \((\lambda _1,f_1)\). While at \(v=f_1=\mathbf{1}_{n}\),
Therefore, in the new set \(\varvec{\Gamma }(L)\), the rank of all the eigenvalues greater than \(\lambda _1\) gets reduced by one and \(\mathbf{1}_{n}\) becomes the eigenvector corresponding to the largest eigenvalue. Laplacian matrix has certain properties which are exploited by many clustering techniques like the one shown above. Some of the relevant properties are as following.
Property 1
For every vector \(f\in {\mathbb {R}}^n\), \({\mathscr {L}}\) satisfies the following condition
Proof
By the definition of degree, \(d_i=\sum _{j=1}^ne_{ij}\). Therefore,
Hence proved.\(\square \)
Property 2
\({\mathscr {L}}\) is symmetric and positive semidefinite matrix.
Proof
From (16), the symmetry of the matrix is fairly evident. Also, from the property 1, \({f^\prime {\mathscr {L}}f}\ge 0\) for all \(f\in {\mathbb {R}}^n\). Hence, it is provrd that \({\mathscr {L}}\) is symmetric and positive semidefinite matrix. \(\square \)
Property 3
All eigenvalues of \({\mathscr {L}}\) are nonnegative.
Proof
Property 1 implies \({f^\prime {\mathscr {L}}f}\ge 0\). Substituting \({\mathscr {L}}f=\lambda f\), we get \({{f^\prime {\mathscr {L}}f}=\lambda x^{\textsf {T}}x}\ge 0\). Since \(f^\prime f\) is positive for all eigenvectors, therefore, \(\lambda \ge 0\). Hence proved.\(\square \)
RISynG algorithm
For grouping the cancer patients into clusters, each omic view is represented as a graph using two representation matrices, that is the Gramian matrix and the Laplacian matrix. Each of the representation matrices attributes the similarity network of the samples with a notion of similarity between the samples. Consider a view \(X_m=(x_1,x_2,\dots ,x_n)\), \(x_i\in {\mathbb {R}}^{d_m}\) corresponding to mth omicsource. If \(\rho (x_i,x_j)\) denotes the distance between \(x_i\) and \(x_j\) \(\in X_m\), then the similarity \(w(x_i,x_j)\) between them is given by:
where \(\sigma \) is a free parameter adjusted as per the intrinsic properties of the data when subjected to clustering model. For the cancer data used in this study, the \(\sigma \) is given by \(\sigma ={\text {max}(\frac{\rho (x_i,x_j))}{2}}\) for all \(x_i,x_j\in X_m\). It has been assumed in the proposed method that multiviews may constitute different cluster manifolds when learnt on a particular similarity measure. Therefore, predicted clusters would be apparent, and in strong concordance with the clinical clusters if pairwise sample similarity is computed in datadependent multikernel approach. It was found that in some views correlation distance was prominently reflecting cluster manifold that concurred with the natural clusters, while some of them showed proclivity towards Euclidean distance, and the rest seemed to accommodate parts of both. All things considered, two different graph representation matrices have been formulated, Gramian matrix and Laplacian matrix, both with different measures of similarity. Let for \(X_m\), the correlation distance between \(x_i\) and \(x_j\) be given by \(\varphi _m(x_i,x_j)\) and the squared Euclidean distance be given by \(\varepsilon _m(x_i,x_j)\). If \(\hat{\varphi }_m\) and \(\hat{\varepsilon }_m\) denotes the maximum pairwise correlation distance and squared Euclidean distance respectively, then Gramian matrix \(G_m\) and similarity matrix \(W_m\) are given by
The matrix articulated in (28) is a crucial precursor for the construction of Laplacian matrix. Laplacian matrix is constructed by normalising \(W_m\) by the degree matrix \(D_m\) of its associated graph as in Eqs. (15) and (16). Hence, required representation matrices for each view \(X_m\), \(m\in \{1,2,\dots ,M\}\) are given by (27) and (29).
So obtained laplacian matrix is then modified as described in Eq. (22)
It is apparent from the discussion presented under the heading Gramian Matrix and Kernel Trick and Graph laplacian that the matrix \(U_k\) from Gramian matrix has the same role as that from Laplacian matrix. Therefore, for combining the information encoded in these matrices, a parameterised combination function \({\varvec{\Omega }}(\cdot ,\cdot )\) can be used, hence obtaining a synergy matrix of representation matrices. If \(G_m\) is the Gramian matrix and \(L_m\) is the Laplacian matrix of omicview \(X_m\), then the synergy matrix is given by:
Consequently, the corresponding objective functions, (13) and (21) also combines to optimise over \(U_k\in {\mathbb {R}}^{n\times k}\).
Some of the relevant properties of synergy matrix \(H_m\) are:
Property 1
\(H_m\) is symmetric and positive semidefinite matrix.
Proof
\(H_m\) can be called a positive semidefinite matrix if and only if \(v^TH_mv\ge 0\) for all \(v\in {\mathbb {R}}^n\). Also, from the properties of the Graph Laplacian and the Gramian, it is evident that both L and G satisfies this condition. Therefore,
In addition to that, since \(H_m\) is a summation of symmetric matrices, it is also symmetric. Hence, it is proved that \(H_m\) is a symmetric and positive semidefinite matrix.\(\square \)
Given Property 1, rest of the properties are its direct consequence.
Property 2
All the eigenvalues of \(H_m\) are real.
Property 3
All the eigenvalues of \(H_m\) are nonnegative.
Recursive multikernel integration
After generating synergy matrices for all the views of the dataset, the next step is to integrate the information obtained from each of them. However, before moving to the integration step, the proposed approach needs these matrices to be arranged based on their relative relevance for cluster discovery. It is apparent that the better views would encode the cluster structure better. As a consequence of that, they would depict better cluster validity indices as well. Therefore, the sorting of synergy matrices have been done based on cluster validity indices such as silhouette index. Suppose \(\mathbf{H}=\{H_1,\dots , H_M\}\) be the set of synergy matrices of a dataset with M views. Let the sorted set be \(\mathbf{H}^{\prime }=\{^1H, \dots , ^MH\}\), where the superscript i denotes the relevance of the corresponding synergy matrix \(^iH\), \(^1H\) being the most relevant. Additionally, let every \(^iU_k\) from the set \(\mathbf{U}=\{^1U_k, \dots , ^MU_k\}\) represent the basis of eigenspace corresponding to k smallest eigenvalues of matrix \(^iH\).
Next, a method for combination has been proposed which distills the cluster information from each of the synergy matrix one by one, in an iterative fashion. While doing that, it subtly takes care of enriching the information coming from the relevant matrices. The way that the synergy matrices has been made, it is apparent that it is their basis of the eigenspace that brings out the latent cluster structure in the corresponding view. Therefore, the proposed method uses a recursive function to exploit this fact for integration as well as enrichment of the relevant views of the dataset. The recursive formula can be written as:
Here \(\mathbf{k}_{\eta }\) is called accretive matrix of \(\eta \)th recursive step. Noncumulative operator \(\otimes \) signifies the integration operation. That is, for \(A\in {\mathbb {R}}^{n\times n}\) and \(U\in {\mathbb {R}}^{n\times k}\), where A has its k smallest eigenvectors in \(V\in {\mathbb {R}}^{n\times k}\), and U is a basis matrix, the expression \(A\otimes U\) evaluates to an accretive matrix \(A^\prime \in {\mathbb {R}}^{n\times n}\) with k smallest eigenvectors given by \(V+U\). Other eigenvectors of A are irrelevant for this discussion. Let the basis of eigenspace of \(A^\prime \) be known as accretive basis and associated subspace as accretive subspace. Also, let the accretive basis corresponding to k smallest eigenvectors of \(\mathbf{k}_{\eta }\) be given by \(\mathbf{b}_{\eta }\).
In extension to that, for enriching relatively relevant views, the proposed method uses an orthogonalisingnormalising function \({{\mathscr {N}}}(\cdot ,\cdot )\). To ensure the accumulation of only the essential cluster information, the proposed approach acquires the basis of that projection of synergy matrix eigenspace which is orthogonal to the accretive subspace at that recursive step. The idea is similar to eigenspace updation for integrative clustering as performed in Ref.^{18}. This function does not normalise the synergy matrix per se, rather, it normalises the basis of the described projection subspace. The computation starts by instantiating \(\mathbf{k}_{1}=\text { }^1H\) so that \(\mathbf{b}_{\eta }\) becomes \(^1U_k\). Therefore, at (\(\eta +1\))th recursive step (\(\eta \in \{0,1,\dots ,M\}\)), one should have accretive matrix \(\mathbf{k}_{\eta }\) and eigenspace basis \(^{(\eta +1)}U_k\) of synergy matrix \(^{(\eta +1)}H\). Subsequently, processing within orthogonalisingnormalising function \({{\mathscr {N}}}(\mathbf{k}_{\eta },^{(\eta +1)}U_k)\) renders the final basis matrix in four steps:
First, computing the basis \({\mathscr {P}}\) of the projection subspace, which is given by:
Second, computing the residual component of the synergy matrix eigenspace \({\mathscr {Q}}\) which is given by subtracting the abovementioned projected component from \(^{(\eta +1)}U_k\) as:
In the third step, \({\mathscr {Q}}\) is subjected to GramSchmidt orthogonalisation to yield the final basis \({\mathscr {R}}\). This basis cannot be integrated with the eigenspace of accretive matrix, therefore it needs to be normalised on the basis of its relevance. So, the fourth step of normalization is performed as:
Here the notation \([\cdot ]\) denotes that the subsequent operations are done in elementwise fashion. The resultant V matrix is called as orthogonalisednormalised basis matrix. After the end of the process, the final accretive matrix \(\mathbf{k}_{M}\) is obtained whose first k eigenvectors in the matrix \(\mathbf{b}_{M}\in {\mathbb {R}}^{n\times k}\) holds the cluster structure. Hence, performing kmeans on the rows of the matrix \(\mathbf{b}_{M}\) returns the cluster labels for each sample. The proposed algorithm is described in Algorithm 1.
Computational complexity
For the proposed algorithm, given M similarity matrices and Gramian matrices with n samples under study, the computation starts with constructing degree matrix \(D_m\) for each of the M views. The complexity of this step is bounded by \(O(n^2)\) for each view. In the next step, the Laplacian matrix is made with a complexity of \(O(n^3)\). Let the number of iterations (regulated through parameter \(\beta \)) to learn the synergy matrix’s best composition in steps 12 to 16 be \(t_\beta \). However, it has been found that for the datasets used in this study, the value of \(t_{\beta }=10\) suffices. Iterating \(\beta \) from 0 to 1 with an increment of 0.1 with each iteration can produce an optimal combination ratio for the representation matrices. However, here, the increment step has been referred to as \(\alpha \) for consistency. Assuming \(t_{max}\) be the highest iteration by the kmeans clustering algorithm the complexity of the aforesaid steps becomes \(O(t_{\beta }n^3+t_{\beta }t_{max}nk^2+t_{\beta }n)\). Where \(t_{\beta }n^3\) comes from the complexity of eigenvalue decomposition of synergy matrix, \(t_{\beta }t_{max}nk^2\) is for the step where kmeans clustering is performed, and \(t_{\beta }n\) is for the fmeasure calculation. Therefore, the complexity of steps formulated from 12 to 16 turns out to be bounded by \(O(t_{\beta }n^3)\). Steps 17 to 19 are doing the same processing as previously, just at the optimal value of \(\beta \). Hence, they are also bounded by \(O(t_{\beta }n^3)\). Summing up all the steps from 9 to 20 for M views, the complexity of \(O(Mn^2+Mn^3+Mt_{\beta }n^3)\) reduces to \(O(Mt_{\beta }n^3)\). Sorting can be done at O(MlogM). After that, an accretive basis is constructed as defined in the function INTEGRATE(\(\mathbf{b},\eta \)). Step 5 consists of the construction of \({\mathscr {P}}\), \({\mathscr {Q}}\) and orthogonalizednormalized matrix V. In this step, two matrix multiplication operations are bounded under the complexity of \(O(n^2k)\). GramSchmidt orthogonalization and normalization step combined has a complexity of \(O(n^2)\). Therefore, step 5 has a complexity of \(O(n^2k)\). Step 6 is matrix addition with complexity O(nk), but step 5 seem to dominate over that. In addition to that, since the function runs \((M1)\) times, the complexity from steps 21 to 23 becomes \(O(MlogM+Mn^2k)=O(Mn^2k)\). After the construction of the accretive basis, kmeans is performed, which, as explained previously, has time complexity \(O(t_{max}nk^2)\). Considering everything, the overall complexity of RISynG comes out to be \(O(Mt_{\beta }n^3+Mn^2k+t_{max}nk^2) = O(Mt_{\beta }n^3)\).
Significance of proposed algorithm
There are some aspects of the proposed algorithm that enhance its performance and make it unique from the other algorithms designed to identify cancer subtypes. Although each omicview in the cancer dataset has its distinct cluster structure, the knowledge of cancer biology suggests that no omicssource to which each view belongs can dictate the final cancer subtype alone. Instead, all the omics sources collectively manifest the cancer subtype in a sample. Therefore, multiview integration is critical to a sensible and clinically relevant clustering. The proposed approach can be broken down into three operative steps: (1) construction of representation matrices for each view, (2) construction of synergy matrix for each view, and (3) construction of accretive basis through recursive multikernel integration of synergy matrices. These steps make the proposed algorithm more effective in the following manner:

1.
Construction of representation matrices To group the cancer patients into clusters, each omicview first has to be represented as similarity graphs. These similarity graphs can be interpreted through various representation matrices like the Gramian, Laplacian, and Adjacency. Each representation matrix attributes the samples’ similarity network with a notion of similarity between the samples. The proposed method assumes that multiple information sources may constitute different cluster manifolds when learned on a particular similarity measure. Therefore, predicted clusters would be apparent and in strong concordance with the clinical clusters if pairwise sample similarity is computed in a datadependent multikernel approach^{37}. In some views, Correlation distance was prominently reflecting cluster manifold that concurred with the natural clusters. Whereas some of them showed proclivity towards Euclidean distance, the rest seemed to accommodate both. All things considered, two different graph representation matrices have been formulated, the Gramian matrix and Laplacian matrix, both with different measures of similarity.

2.
Construction of synergy matrices Representation matrices so constructed have two noteworthy aspects: (1) \(G_m\) represents a similarity graph formed using correlationbased distance. In the correlationbased distance, two objects are considered similar if the trends among their elements are highly correlated. That means the correlation distance between two perfectly correlated samples will be 0, even though they are far apart in the euclidean space of their dimension. It is instinctive to assume the omics data to behave like that. (2) Laplacian, on the other hand, preserves the intrinsic manifold structure in the data casted on a low embedding space. To integrate these representation matrices, a combination function has been devised that takes a convex combination of both the matrices. This method of combining matrices rectifies any bias created by the dissimilarity in distance measurement used while constructing the similarity graphs. The combination function defined in (31) utilises the parameter \(\beta \in [0,1]\) to capture graphs constituted by the Gramian and Laplacian. Parameter \(\beta \) can only take a positive value, making the combination a convex combination of representation matrices. This parameter’s optimal value is learnt by iterating it from 0 to 1 at some incremental step size \(\alpha \in (0,1)\). The datasets used in this study tend to pick up the optimal value of \(\beta \) at a step size of \(\alpha =0.1\). It is crucial to choose the incremental step size wisely as the number of iterations \(t_{\beta }\) is directly proportional to the algorithm’s time complexity. Because the synergy matrix will ultimately affect the cluster assignment, the best way to evaluate the appropriate value of \(\beta \) is to perform a provisional cluster validity test on the synergy matrix constructed with that \(\beta \) using a cluster validity index like silhouette index. Algorithm1, steps 15 to 19 formulate the described provisional cluster validity test using silhouette as a criterion.

3.
Construction of accretive basis After the similarity between the cancer patients is captured in a refined form with the help of synergy matrices, the next step is to integrate them. Property 1 of the synergy matrix proves that \(H_m\) is a positive semidefinite matrix. That makes the integration of synergy matrices a multikernel integration. The proposed algorithm does that by recursive multikernel integration by iteratively integrating each of the synergy matrices’ relevant subspace. Here, relevant subspace refers to that subspace of the matrix that purely encodes the cluster information, which in the case of synergy matrix is its eigenspace corresponding to k eigenvalues. Finally, an accretive basis matrix is generated. This accretive matrix is required to have more cluster information coming from relevant views. Therefore, the orthogonalizingnormalizing function is made such that the accretive basis at each recursive step gets less influenced by the irrelevant matrix.
Description of datasets
For analysing the efficiency of the proposed algorithm for identifying cancer subtypes, it is applied to five cancer datasets taken from TCGA (https://cancergenome.nih.gov/). The datasets used are Cervical cancer (CESC), Breast cancer (BRCA), Ovarian cancer (OV), Lowergrade glioma (LGG), and Stomach cancer (STAD). Different studies have identified 4 clinically important subtypes for BRCA^{9} and STAD^{38}, 3 for CESC^{39} and LGG^{40} and 2 for OV^{41}. The cancer genome is neither simple nor independent but is complicated and dysregulated by multiple levels in the biological system through genomic, epigenomic, transcriptomic, proteomic levels^{42}. miRNA, as one of the important regulators of gene expression, can be integrated with gene expression to identify the selective inhibition of translation or selective degradation^{43,44,45}. Furthermore, in terms of epigenetic regulation, histone modification or DNA methylation can serve to regulate gene expression in cancer^{46,47}. Also, protein expression data can be utilized for the diagnostic prognosis of cancer patients^{48}. Therefore, four omic views, namely, gene expression (mRNA), microRNA expression (miRNA), DNA methylation (metDNA), and reversephase protein assays (RPPA), are utilized for CESC, BRCA, and LGG datasets. For STAD and OV datasets, mRNA and miRNA expression are only considered because metDNA and RPPA information are not available for most samples. To avoid involving features with too many missing values, more than 5% of missing values in all of the omic views are removed, and the rest of the missing values are replaced with 0. Sequencebased expression data are logtransformed to make the data more or less normally distributed^{49}. Therefore the 0 entries from miRNA and mRNA expression data are replaced with 1 and then logtransformed with base 10. For metDNA datasets, beta values are considered. At last, variance filtering is applied to mRNA and metDNA omic views for all cancer datasets, and 2000 most variable genes and CpG locations were only considered. Table 1 contains a description of the final processed data used for this study. The datasets selected for benchmarking cover a wide range of sample sizes from 124 in CESC to 474 in OV datasets. TCGA contains several platforms for individual data types, the platforms having the largest number of matching samples across the omics are selected in the present study. The proposed algorithm can be applied to other largescale multiomics datasets if available; the run time will increase with the increase in sample size or the number of omic views, as shown in Fig. 2. With the increase in sample size from 124 to 474, the runtime increases from 0.22 to 0.47 s. Even though the BRCA dataset has lesser samples (398) than the OV dataset (474), the runtime for BRCA (0.56 s) is more than OV (0.47 s) because of the number of omicviews involved, which is 4 for BRCA and 2 for OV.
Experimental results and discussion
The performance of the proposed approach is compared with eleven other algorithms available for cancer subtype identification. Both twostage clustering approaches and integrative clustering approaches are considered for method comparison. The methods used for comparison are Similarity Network Fusion (SNF)^{13}, Weighted MultiView Low Rank Representation (WMLRR)^{50}, Consensus Clustering (CC)^{6,51}, Multiview clustering approach with enhanced consensus (ECMC)^{52}, SNF.CC (SNF merged with CC)^{53}, Cluster of Cluster Assignment (COCA)^{9,54}, Consensus Nonnegative Matrix Factorization (CNMF)^{55}, Selective Update of Relevant Eigenspaces (SURE)^{18}, Convexcombination of Approximate Laplacians (CoALa)^{19}, iCluster^{14}, and Multimanifold Integrative Clustering (MiMIC)^{56}.
Performance analysis on multiomics cancer datasets
The proposed approach and the abovedescribed methods are applied to five cancer datasets, namely CESC, BRCA, OV, LGG, and STAD, taken from TCGA. The sample clusters identified by these methods are evaluated based on several internal and external cluster evaluation indices. The cancer subtypes identified by these methods are also evaluated for their biological relevance. Next, the detailed comparative analysis of the proposed algorithm is discussed.
Cluster evaluation
The clusters (cancer subtypes) generated by all the methods are evaluated based on several internal and external cluster evaluation indices. These indices help get the idea of how well a method can group the samples into homogeneous clusters. Samples belonging to the same cluster should have higher similarity representing a cancer subtype, whereas samples belonging to different clusters should be highly dissimilar. How well an algorithm can capture the natural grouping present in the data can be quantified with internal validity indices. Following four internal evaluation indices are calculated in this study. Table 3, presents the internal evaluation indices for every method.

1.
Silhouette Index: It measures the consistency present in the clusters. The value lies in the range \([1,1]\). A value nearer to + 1 indicates a higher distance between the clusters, a value of 0 indicates that the sample is very close boundary between two neighboring clusters, and a negative value indicates misclassification^{57}.
$$\begin{aligned} {\mathbb {S}}_c = \frac{1}{c} \sum _{k=1}^{c}S(\Upsilon _k), \end{aligned}$$(38)where, \(S(\Upsilon _k)\) represents silhouette width of the obtained clusters, \(\Upsilon _k (k=1, \ldots ,c)\) which is calculated as: \(S(\Upsilon _k)=\frac{1}{n_k}\sum _{x_i\in \Upsilon _k}^{}s(x_i)\) where, \(n_k\) is cardinality of \(\Upsilon _k\) and \(s(x_i)\) is silhouette width of sample \(x_i\). For every sample, the silhouette width \(s(x_i)\) is estimated as: \(s(x_i)=\frac{b(i)a(i)}{max\{a(i),b(i)\}}\) Here, \(a(i) = \) average dissimilarity of \(i_{th}\) object to all other objects in the same cluster and \(b(i) = \) average dissimilarity of \(i_{th}\) object with all objects in the closest cluster.

2.
Dunn Index: A higher value represents better clustering solution^{58}. It is defined as:
$$\begin{aligned} DI = \underset{1\le i \le c}{{\text {min}}} \Big \{ \underset{1\le i \le c}{{\text {min}}} \Big \{ {\frac{\delta (C_i,C_j)}{\underset{1\le k \le c}{{\text {max}}} \small \{\Delta (C_k)\}}} \Big \}\Big \} \end{aligned}$$(39)Here, \(\delta (C_i,C_j) = \) distance between cluster \(C_i\) and \(C_j\) and \(\Delta (C_k) = \) intracluster distance within cluster \(C_k\).

3.
Davies–Bouldin Index: It is defined as the ratio of within cluster dispersion to between cluster dispersion^{59}. A lower value indicates better clustering.
$$\begin{aligned} DB = \frac{1}{C} \sum _{i=1}^{C} (D_i) \end{aligned}$$(40)Here, \( D_{i} = \max _{{j \ne i}} R_{{i,j}} \) and \(R_{i,j} = \frac{S_i+S_j}{M_{ij}}\). \(M_{i,j}\) is the separation between the ith and the jth cluster. \(S_i\) and \(S_j\) are the within cluster scatter for cluster i and j and C is the number of clusters.

4.
Xie–Beni Index: The index for crisp clustering is estimated as:
$$\begin{aligned} \text {Xie}{\text {Beni}} = \frac{1}{N} \frac{WGSS}{ {\text{min}}_{{k < \mathop k\limits^{{\prime }} }} \acute{\delta } (C_k,C_{\acute{k}})^2} \end{aligned}$$(41)Here, \(\frac{1}{N} {WGSS}\) represents the averagedsquared distance of all the points with respect to the barycenter of the cluster they belong to, and \(\acute{\delta }\) a measure of the betweencluster distance^{60}.
The class distribution of the cancer datasets used in this study is presented in Table 2. Except for the CESC dataset, all the other cancers have an imbalanced class. When clustering is applied to these datasets, there are chances that most of the samples get clustered into one group leading to good values for internal indices. Still, in reality, the clustering is not efficient. If the ground truth is available, the partitions created in such imbalanced data can be efficiently evaluated with external evaluation indices. In this study, five external evaluation indices are calculated to compare the clustering efficiency of the different algorithms. Considering a set of n objects \({{\mathbb {X}}}=\{{{\mathscr {X}}}_1, {{\mathscr {X}}}_2, \ldots ,{{\mathscr {X}}}_n\}\), suppose \({{\mathbb {C}}}=\{{{\mathscr {C}}}_1, {{\mathscr {C}}}_2, \ldots ,{{\mathscr {C}}}_R\}\) represents a partition of \({{\mathbb {X}}}\) obtained by a clustering algorithm and \({{\mathbb {K}}}=\{{{\mathscr {K}}}_1, {{\mathscr {K}}}_2,\ldots ,{{\mathscr {K}}}_C\}\) represents the ground truth or the class information. A contingency table is created to look for the overlap between the clustering result and the ground truth, where \(n_{ij}={{\mathbb {C}}}_{i}\cap {{\mathbb {K}}}_{j}\) is the common elements in cluster \({{\mathbb {C}}}_{i}\) and class \({{\mathbb {K}}}_{j}\). \(n_i\) is the number of elements in \( {{\mathbb {C}}}_{i}\) and \(n_{j}\) is the number of elements in \({{\mathbb {K}}}_{j}\). The external indices are defined as:

1.
Fmeasure (FM): The idea of precision and recall from information retrieval is merged to obtain FM. It disregards the unmatched portions of the clusters. It can attain values ranging between 0 and 1. A value nearer to 1 represents better clustering^{61}.
$$\begin{aligned} FM = \sum _{j=1}^{C} \frac{n_j}{n} \, \underset{i=1 \cdot \cdot \cdot R}{{\text {max}}}\, \left[ \frac{2 \times \frac{n_{ij}}{n_i} \times \frac{n_{ij}}{n_j}}{\frac{n_{ij}}{n_i}+\frac{n_{ij}}{n_j}}\right] \end{aligned}$$(42) 
2.
Adjusted Rand Index (ARI): A commonly used variations of the Rand index, and takes into account agreements arising by chance given a hypergeometric distribution. In the case of ARI, the lower bound, \(k\), depends on the exact data partitioning^{62}. Closer the value of ARI to 1, better is the clustering.
$$\begin{aligned} ARI = \frac{\sum _{i=1}^{R} \sum _{j=1}^{C} \left( \begin{array}{c} n_{ij} \\ 2 \end{array}\right)  {\left( \begin{array}{c} n \\ 2 \end{array} \right) }^ {1} \sum _{i=1}^{R} \left( \begin{array}{c} n_{i} \\ 2 \end{array} \right) \sum _{j=1}^{C} \left( \begin{array}{c} n_{j} \\ 2 \end{array} \right) }{\frac{1}{2} \left[ \sum _{i=1}^{R} \left( \begin{array}{c} n_{i} \\ 2 \end{array}\right) + \sum _{j=1}^{C} \left( \begin{array}{c} n_{j} \\ 2 \end{array} \right) \right]  \left( \begin{array}{c} n \\ 2 \end{array} \right) ^{1} \sum _{i=1}^{R} \left( \begin{array}{c} n_{i} \\ 2 \end{array}\right) \sum _{j=1}^{C} \left( \begin{array}{c} n_{j} \\ 2 \end{array} \right) } \end{aligned}$$(43) 
3.
Normalized Mutual Information (NMI): The interdependencies between cluster number and cluster quality can be quantified by NMI. It is estimated as:
$$\begin{aligned} NMI({\mathbb {C}},{\mathbb {K}})=\frac{{\mathscr {I}}({\mathbb {C}},{\mathbb {K}})}{[{\mathscr {H}}({\mathbb {C}})+{\mathscr {H}}({\mathbb {K}})]/2} \end{aligned}$$(44)Here, \({\mathscr {I}}\) is mutual information and \({\mathscr {H}}\) is entropy. The value ranges from 0 to 1, value nearer to 1 means better clustering^{63}.

4.
Jaccard Index: It is used to measure the similarity between two sets, that are clustering solution, and the class information. It is defined as:
$$\begin{aligned} J({\mathbb {C}},{\mathbb {K}})= \frac{{\mathbb {C}} \cap {\mathbb {K}}}{{\mathbb {C}} \cup {\mathbb {K}}} \end{aligned}$$(45)Higher the value of this index better in the clustering.

5.
Purity: For estimating Purity, the clusters are first allocated to that class which is present most frequently in the cluster. Later, the accuracy of this clusterclass allocation is obtained by dividing the number of correctly assigned objects to total number of objects^{63}. The equation for calculating Purity is:
$$\begin{aligned} Purity({\mathbb {C}},{\mathbb {K}})=\frac{1}{n}\sum _{i}max_{j}C_i \cap K_j \end{aligned}$$(46)Purity ranges from 0 to 1, a value closer to 1, better is the clustering.
Based on these five external evaluation indices, it is observed that the proposed algorithm outperforms in CESC, BRCA, LGG, and STAD datasets. OV cancer is the only case where the proposed approach cannot work that well. Suppose all the datasets are considered together to rank the clustering efficiency of all the algorithms under study, considering all the external indices. In that case, the proposed method stands first by attaining a maximum value for 20 times out of 25. The execution times reported in Table 3 show that RISynG is faster than other algorithms.
Importance of multiomics data integration
The proposed algorithm RISynG iteratively integrates the relevant subspace of each of the synergy matrices. The relevant subspace corresponds to the k largest eigenvectors of the synergy matrices that hold the cluster structure. To exhibit the significance of this iterative integration and the effectiveness of RISynG, it is compared with Spectral clustering performed on individual omics datasets. The results presented in Table 4 show that the proposed algorithm outperforms the individual omicviews in CESC, BRCA, LGG, and STAD datasets for all the external clusters validity indices. In the OV dataset, RISynG outperforms for Fmeasure, Jaccard, and Purity. However, the miRNA view performs better for ARI and NMI indices. The performance of RISynG is significantly higher than the best individual view in the case of CESC, BRCA, and LGG datasets, irrespective of any indices.
To express the cluster holding capacity of the integrated subspace obtained by the proposed approach, scatter plots for the best k dimensions are plotted. The colours in the plots indicate the ground truth (cancer subtypes). Comparative plots are also presented in Figs. 3, 4, 5, 6, and 7 to show that the integrated subspace obtained by RISynG are more informative than other subspacebased integrativeclustering approaches (SNF, SURE, CoALa, iCluster, WMLRR, and MiMIC), for most of the datasets. Comparison with the best individual omicview (CESC: mRNA, BRCA: metDNA, OV: miRNA, LGG: metDNA, and STAD:miRNA) is also presented to establish the significance of multiomics data integration performed by the proposed approach. Considering the proposed approach, the scatter plots show that the clusters are well separated in the case of CESC (Fig. 3) and LGG (Fig. 6) datasets. There is a slight overlap between the two groups in BRCA (Fig. 4), but it is better than the other methods. Whereas, for OV (Fig. 5) and STAD (Fig. 7) datasets, the overlap between subtypes is observed in the subspace obtained by all the methods.
Biological analysis
Once the cancer subtypes are obtained, the patient clusters’ molecular characteristic feature is also evaluated to establish their biological relevance. To understand the varying expression of different biomarkers in different subtypes, differential expression analysis (DEA) of miRNAs and mRNAs is performed between the correctly identified groups of patients. A comparative analysis is performed between the true positives and true negatives obtained by all the algorithms. As there are three subtypes in the case of LGG and CESC datasets; therefore, DEA is performed between three pairs (considering all possible pairs). Similarly, in the case of STAD and BRCA datasets, since there are four subtypes, DEA is performed for six pairs, and for the OV dataset, there are two subtypes; therefore, DEA is performed for one pair. R package Limma^{64} is used to perform DEA. miRNAs and mRNAs having BejaminiHochberg false discovery rate adjusted pvalue \(< 0.05\) are considered as differentially expressed. Number of differentially expressed biomarkers obtained from different groups in CESC, BRCA, OV, LGG, and STAD datasets are reported in Tables 5, 6, 7, 8, and 9 respectively. To further explore and highlight the biological knowledge and processspecific functioning of the identified sets of differentially expressed biomarkers, different types of enrichment analyses are also performed, considering the hundred most differentially expressed biomarkers in each case.
Biological enrichment analyses
The first analysis is Pathway enrichment analysis (PEA). It explores the mechanistic insight into the set of differentially expressed biomarkers. It helps identify those biological pathways enriched in a set of biomarkers more than expected by chance. The second one is Biological process enrichment analysis (BPEA). It helps characterize the relationship between genes or miRNAs by specifically annotating them to associated biological processes. It helps identify the overrepresented biological processes in our list, which can help evaluate the biological significance of the obtained cancer subtypes. Furthermore, the third one is Disease ontology enrichment analysis (DOEA). Disease Ontology (DO) helps map the relevance of cancer subtypes identified from highthroughput data to clinical relevance. In this study, the R package, clusterProfiler^{65} and DIANA Tools mirPath v.3^{66} are used for performing PEA and BPEA for genes and miRNAs, respectively, and R package DOSE^{67} is used to perform DOEA for the genes. The top 100 differentially expressed biomarkers are passed to these tools. In some cases, if the number of differentially expressed biomarkers is less than 100, then all of them are used. KEGG database is selected for PEA^{68}. All the pathway terms associated with the set of biomarkers having false discovery rate adjusted pvalue \(< 0.05\) (significant pathway terms) are only considered. Suppose any differentially expressed biomarker sets are not associated with significant KEGG pathway terms. In that case, that set is said to be not biologically relevant with respect to KEGG pathway terms. Similarly, all the biological process (BP) terms associated with the set of biomarkers having a false discovery rate adjusted pvalue \(< 0.05\) (significant pathway terms) are only considered. If any of the differentially expressed biomarker sets are not associated with significant BP terms, that set is said to be not biologically relevant with respect to BP terms. In DOEA, semantic similarities between DO terms and genes are calculated that help explore the similarities of diseases and gene functions from a disease perspective. The output of DOES has associated disease terms. A gene set is said to be enriched with DO terms if the terms obtained by its DOEA have a false discovery rate corrected pvalue \(<0.05\).
For the quantification of KPEA, BPEA, and DOEA, respective enrichment scores^{69}, and annotation ratios^{69} are calculated. The higher the value of these scores better is the enrichment; hence, the more biologically significant the differentially expressed biomarkers are, the better the cancer subtyping. Following are the equations for these scores:
Here, T denotes the number of significant pathway/BP/terms associated with a set of differentially expressed genes or miRNAs between two cancer subtypes identified by any clustering approaches. G denotes the total number of genes given to clusterProfiler for the enrichment analysis, and g denotes the gene count associated with a pathway/BP/DO term. Comparative analysis of the cancer subtypes obtained by the proposed approach and other existing algorithms are performed and the associated quantitative indices are reported in Tables 5, 6, 7, 8, and 9. Some of the differentially expressed miRNAs or mRNAs have no associated significant terms; therefore, there is no scope for calculating the quantitative indices. Also, in some cases, there are no differentially expressed biomarkers. All these cases are represented by \(*\) in the tables.
To compare the effectiveness of the proposed approach with the other algorithms in this study, the overall performance of all the methods is also evaluated. When all the five cancer datasets are considered together, the proposed approach outperforms concerning both cluster evaluation indices and biological enrichment analysis, as shown in Fig. 8. The analysis is performed by considering the success frequency (number of times a method scored the highest value for respective indices when all the cases in all the cancer types are considered). The success frequency shows that the proposed approach outperforms when cluster validity indices are considered by scoring maximum values for 21 times, followed by SNF.CC (7), SNF (6), CNMF (5), CC (2), COCA (2), and WMLRR (1). Similarly, suppose the methods are ranked considering the success frequency for quantitative indices calculated for biological enrichment analysis. In that case, the proposed approach will again stand first by scoring the maximum value 67 times, followed by SNF (21), SNF.CC (20), CC (12), CoALa (10), CNMF (9), MiMIC (7), SURE (5), WMLFF (5), COCA (4), and iCluster (1). If the cluster validity indices are looked upon individually, the proposed approach also outperforms with respect to Fmeasure, ARI, NMI, Jaccard index, and Purity. Considering the indices for biological enrichment individually, the proposed algorithm again outperforms with respect to all the indices except for AR for BPES for mRNA enrichment, where it stands second.
Overlap analysis
The hundred most differentially expressed genes between all the subtypespairs in cervical cancer that RISynG and the other methods identified are explored further for experimental support. The genes are analyzed based on the degree of overlap with known cervical cancer genes that are experimentally validated. The Cervical Cancer Gene Database (CCDB)^{70} is used for finding the overlap. It is a manually curated catalog of experimentally validated genes involved in the different stages of cervical carcinogenesis. All the upregulated and downregulated genes in cervical cancer with evidence from the published literature available in CCDB are considered for this analysis. 367 genes are reported in CCDB that are differentially expressed in cervical cancer. This list contains 185 genes from a total number of 2000 genes that are used for cancer subtype identification in this study. The statistical significance of the overlap analysis is reported in Table 10. In total, 30 genes out of 222 identified from the proposed approach overlap with cervical cancerrelated genes. This is the maximum overlap when compared with the other methods. Fisher’s exact test is used here to find the statistical significance of the contingency table created from the overlap analysis in Table 10 for different algorithms. At 95% confidence, it is observed that only the genes identified by the proposed approach have significant overlap with experimentally validated genes curated from literature with a pvalue of 0.026. Therefore, it indicates that the proposed approach has the potential to identify clinically important subtypes of cancer that have a characteristic molecular signature.
Conclusion
The present study describes a method named RISynG that efficiently identifies cancer subtypes. Cancer subtypes identification can facilitate cancer diagnosis and therapy. It is one of the vital components of the precision medicine framework. The main contributions of this study are: (1) Development of an integrative clustering method for multiview omics data. (2) Demonstration of the effectiveness of the proposed method over other methods. (3) Establishing biological relevance for the obtained results.
Data availability
The python scripts for RISynG and the preprocessed samplematched datasets are available at http://home.iitj.ac.in/~sushmitapaul/CBL/code/RISynG.zip.
References
Stingl, J. & Caldas, C. Molecular heterogeneity of breast carcinomas and the cancer stem cell hypothesis. Nat. Rev. Cancer 7, 791–799 (2007).
Liang, M., Li, Z., Chen, T. & Zeng, J. Integrative data analysis of multiplatform cancer data with a multimodal deep learning approach. IEEE/ACM Trans. Comput. Biol. Bioinform. 12, 928–937 (2015).
Tomczak, K., Czerwińska, P. & Wiznerowicz, M. The Cancer Genome Atlas (TCGA): An immeasurable source of knowledge. IEEE/ACM Trans. Comput. Biol. Bioinform. 19, A66–A77 (2015).
Therese, S. et al. Gene expression patterns of breast carcinomas distinguish tumor sub classes with clinical implications. Proc. Natl. Acad. Sci. U.S.A. 98, 10869–10874 (2001).
Bhattacharjee, A. et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma sub classes. Proc. Natl. Acad. Sci. U.S.A. 98, 13790–13795 (2001).
Monti, S., Tamayo, P., Mesirov, J. & Golub, T. Consensus clustering: A resamplingbased method for class discovery and visualization of gene expression microarray data. Mach. Learn. 52, 91–118 (2003).
Teschendorff, A. E., Miremadi, A., Pinder, S. E., Ellis, I. O. & Caldas, C. An immune response gene expression module identifies a good prognosis subtype in estrogen receptor negative breast cancer. Genome Biol. 8, R157 (2007).
Zhang, W., Feng, H., Wu, H. & Zheng, X. Accounting for tumor purity improves cancer subtype classification from DNA methylation data. Bioinformatics 33, 2651–2657 (2017).
Network, C. G. A. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).
Network, C. G. A. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).
Hoadley, K. A. et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell 158, 929–944 (2014).
Gabasova, E., Reid, J. & Wernisch, L. Clusternomics: Integrative contextdependent clustering for heterogeneous datasets. PLos Comput. Biol. 13, e1005781 (2017).
Bo, W. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11, 333–337 (2014).
Shen, R., Olshen, A. B. & Ladanyi, M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25, 2906–2912 (2009).
Ronglai, S. et al. Integrative subtype discovery in glioblastoma using iCluster. Gynecol. Oncol. 7, e35236 (2012).
Zhang, W. et al. Integrating genomic, epigenomic, and transcriptomic features reveals modular signatures underlying poor prognosis in ovarian cancer. Cell Rep. 4, 542–553 (2013).
Wu, D., Wang, D., Zhang, M. Q. & Gu, J. Fast dimension reduction and integrative clustering of multiomics data using lowrank approximation: Application to cancer molecular classification. BMC Genom. 16, 1–10 (2015).
Khan, A. & Maji, P. Selective update of relevant eigenspaces for integrative clustering of multimodal data. IEEE Trans. Cybern. 1–13 (2020).
Khan, A. & Maji, P. Approximate graph laplacians for multimodal data clustering. IEEE Trans. Pattern Anal. Mach. Intell. (2019).
Xu, T. et al. Identifying cancer subtypes from miRNATFmRNA regulatory networks and expression data. PLoS One 11, e0152792 (2016).
Jiang, L., Xiao, Y., Ding, Y., Tang, J. & Guo, F. Discovering cancer subtypes via an accurate fusion strategy on multiple profile data. Front. Genet. 10, 20 (2019).
Long, B., Yu, P. S. & Zhang, Z. A General model for multiple view unsupervised learning. In Proceedings of the 2008 SIAM International Conference on Data Mining 822–833 (SIAM, 2008).
Xia, T., Tao, D., Mei, T. & Zhang, Y. Multiview spectral embedding. IEEE Trans. Syst. Man. Cybern. Part B Cybern. 40, 1438–1446 (2010).
Zhou, D. & Burges, C. J. Spectral clustering and transductive learning with multiple views. In Proceedings of the 24th International Conference on Machine Learning 1159–1166 (ACM, 2007).
Zhang, C. et al. Generalized latent multiview subspace clustering. IEEE Trans. Pattern Anal. Mach. Intell. 42, 86–99 (2020).
Li, X., Zhang, H., Wang, R. & Nie, F. Multiview clustering: A scalable and parameterfree bipartite graph fusion method. IEEE Trans. Pattern Anal. Mach. Intell. 44, 330–344 (2022).
Gao, Q. et al. Enhanced tensor RPCA and its application. IEEE Trans. Pattern Anal. Mach. Intell. 43, 2133–2140 (2021).
Jha, V. N. Study on Hermitian, SkewHermitian and unitary matrices as a part of normal matrices. Int. J. Open Inf. Technol. 4, 2307–8162 (2016).
Collins, M., Dasgupta, S. & Schapire, R. E. A generalization of principal component analysis to the exponential family. In NIPS’01: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic 617–624 (2001).
Schölkopf, B., Mika, S., Smola, A., Rätsch, G. & Müller, K.R. Kernel PCA pattern reconstruction via approximate preimages. In International Conference on Artificial Neural Networks 147–152 (Springer, 1998).
Raykar, V. C. Spectral Clustering and Kernel Principal Component Analysis are Pursuing Good Projections. Project Report (2004).
Schölkopf, B., Smola, A. & Müller, K. R. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10, 1299–1319 (1998).
Welling, M. Kernel principal components analysis. Adv. Neural. Inf. Process. Syst. 15, 70–72 (2005).
Mantao, X. & Franti, P. A Heuristic kmeans clustering algorithm by kernel PCA. In 2004 International Conference on Image Processing, 2004. ICIP ’04., vol. 5, 3503–3506 (2004).
von Luxburg, U. A Tutorial on Spectral Clustering (2007). arXiv:0711.0189.
Ng, A. Y., Jordan, M. I. & Weiss, Y. On spectral clustering: Analysis and an algorithm. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, 849–856 (MIT Press, 2001).
Gönen, M. & Alpaydın, E. Multiple kernel learning algorithms. J. Mach. Learn. Res. 12, 2211–268 (2011).
Network, T. R. Clinical significance of four molecular subtypes of gastric cancer identified by the Cancer Genome Atlas Project. Clin. Cancer Res. (2017).
Network, T. R. Integrated genomic and molecular characterization of cervical cancers. Nature 543, 378–384 (2017).
Network, T. R. Comprehensive, integrative genomic analysis of diffuse lowergrade gliomas. N. Engl. J. Med. 372, 2481–2498 (2015).
Matsuno, R. K. et al. Agreement for tumor grade of ovarian carcinoma: Analysis of archival tissues from the surveillance, epidemiology and end results residual tissue repository. Cancer Causes Control 24, 749–757 (2013).
Huang, T., Yang, J. & Cai, Y. D. Novel candidate key drivers in the integrative network of genes, micrornas, methylations, and copy number variations in squamous cell lung carcinoma. BioMed Res. Int. (2015).
Borel, C. et al. Identification of cis and transregulatory variation modulating microRNA expression levels in human fibroblasts. Genome Res. 21, 68–73 (2011).
Lu, J. & Clark, A. Impact of microRNA regulation on variation in human gene expression. Genome Res. 22, 1243–1254 (2012).
Liu, F., Dong, H., Mei, Z. & Huang, T. Investigation of miRNA and mRNA coexpression network in ependymoma. Front. Bioeng. Biotechnol. 8, 177 (2020).
Dudziec, E., GogolDöring, A., Cookson, V., Chen, W. & Catto, J. Integrated epigenome profiling of repressive histone modifications, DNA methylation and gene expression in normal and malignant urothelial cells. PLoS One 7, e32750 (2012).
McMahon, K. W., Karunasena, E. & Ahuja, N. The roles of DNA methylation in the stages of cancer. PCancer J. (Sudbury, Mass.) 23, 257–261 (2017).
Kim, T., Jeong, H. & Sohn, K. Topological integration of RPPA proteomic data with multiomics data for survival prediction in breast cancer via pathway activity inference. BMC Med. Genom. 12, 1–14 (2019).
Zwiener, I., Frisch, B. & Binder, H. Transforming RNAseq data to improve the performance of progonostic gene signatures. PLoS One 9, e85150 (2014).
Sun, Y., OuYang, L. & Dai, D.Q. WMLRR: A weighted multiview low rank representation to identify cancer subtypes from multiple types of omics data. IEEE/ACM Trans. Comput. Biol. Bioinf. 18, 2891–2897 (2021).
Wilkerson, M. D. & Hayes, D. N. ConsensusClusterPlus: A class discovery tool with confidence assessments and item tracking. PLoS One 26, 1572–1573 (2010).
Cai, M. & Li, L. Subtype identification from heterogeneous TCGA datasets on a genomic scale by multiview clustering with enhanced consensus. BMC Med. Genom. 10, 65–79 (2017).
Xu, T. et al. CancerSubtypes: An R/bioconductor package for molecular cancer subtype identification, validation and visualization. Bioinformatics 23, 3131–3133 (2017).
Cabassi, A. & Kirk, P. D. W. Multiple Kernel Learning for Integrative Consensus Clustering of Omic Datasets. arXiv preprint (2019).
Brunet, J. P., Tamayo, P., Golub, T. R. & Mesirov, J. P. Metagenes and molecular pattern discovery using matrix factorization. PNAS 101, 4164–4169 (2004).
Khan, A. & Maji, P. Multimanifold optimization for multiview subspace clustering. IEEE Trans. Neural Netw. Learn. Syst. 1–13 (2021).
Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Comput. Appl. Math. 20, 53–65 (1987).
Bezdek, J. C. & Pal, N. R. Cluster Validation with Generalized Dunn’s Indices. In Proceedings 1995 Second New Zealand International TwoStream Conference on Artificial Neural Networks and Expert Systems. IEEE Xplore 190–193 (1995).
Davies, D. L. & Bouldin, D. W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1, 224–227 (1979).
Xie, X. & Beni, G. A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 13, 841–846 (1991).
de Souto, M. C. P. et al. A comparison of external clustering evaluation indices in the context of imbalanced data sets. In 2012 Brazilian Symposium on Neural Networks (2012).
Hubert, L. J. & Arabie, P. Comparing partitions. J. Classif. 2, 193–218 (1985).
Qiang, W., Yong, D., Xinwang, L., Qi, L. & Shijie, L. Multiview clustering with extreme learning machine. Neurocomputing 214, 483–494 (2016).
Smyth, G. K. Limma: Linear models for microarray data. In Bioinformatics and Computational Biology Solutions Using R and Bioconducter, vol. 214, 397–420 (Springer, 2005).
Yu, G., Wang, L., Han, Y. & He, Q. clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS J. Integr. Biol. 16, 284–287 (2012).
Vlachos, I. S. et al. Deciphering microRNA function with experimental support. DIANAmiRPath v3.0. Nucleic Acids Res. 43, W460–W466 (2015).
Yu, G., Wang, L. G., Yan, G. & He, Q. Y. DOSE: An R/Bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics 31, 608–609 (2015).
Kanehisa, M. & Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Paul, S. & Madhumita. RFCM3: Computational method for identification of miRNA–mRNA regulatory modules in cervical cancer. IEEE/ACM Trans. Comput. Biol. Bioinform.17, 1729–1740 (2020).
Agarwal, S. M., Raghav, D., Singh, H. & Raghava, G. CCDB: A curated database of genes involved in cervix cancer. Nucleic Acids Res. 39, D975–D979 (2011).
Acknowledgements
This work is partially supported by the seed grant program of the Indian Institute of Technology Jodhpur, India (Grant no. I/SEED/SPU/20160010). The authors acknowledge Dr.Sukhendu Ghosh, Department of Mathematics, Indian Institute of Technology Jodhpur for fruitful discussions.
Author information
Authors and Affiliations
Contributions
S.P. conceived and designed the research. M. and A.D. designed the algorithm, performed experiments, analyzed data, and interpreted the results of the experiments. All the authors drafted the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Madhumita, Dwivedi, A. & Paul, S. Recursive integration of synergised graph representations of multiomics data for cancer subtypes identification. Sci Rep 12, 15629 (2022). https://doi.org/10.1038/s41598022175852
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598022175852
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.