## Introduction

Cancer is a highly heterogeneous disease and is known to differ among patients. This heterogeneity renders one cancer type to be composed of multiple subtypes, which are characterized by different molecular features. Clinical identification of these cancer subtypes is currently one of the major challenges in cancer research. Identifying the subtypes can provide an understanding of the underlying molecular mechanisms and thereby design precise treatment strategies for efficient cancer management. In recent years, advances in high-throughput sequencing technologies have generated large amounts of data on various cancer types. For example, The Cancer Genome Atlas (TCGA) contains multi-omics data, including gene expression, mutation, methylation, and copy number, of over 34 cancer types. These multi-omics data allow improvements in cancer subtyping via computational methods1,2,3. However, most studies do not identify the cancer subtypes based on differences in the molecular systems, but they are based only on the differences in the numerical patterns of the omics data.

Network representation of molecule-to-molecule relationships is a key to understanding a fundamental molecular system, and it plays an important role in understanding each biological process and the molecular mechanisms of cancer4. This idea can also be applied to obtain an understanding of complex human diseases, and is known as network medicine where the diseases are rarely caused by single molecular defects but are more likely driven by combinations of various biological processes4,5,6,7,8,9,10,11. This concept has already been employed in recent cutting-edge research for discovering cancer-related genes12,13,14. Therefore, knowledge of such networks could also be a promising data source for cancer subtyping. Some well-known types of biological networks are gene regulatory networks and protein–protein interaction networks15. Although the importance of molecular systems has been shown in recent years, only a few studies have incorporated the knowledge of molecular networks into their clustering processes16,17,18. However, these methods do not sufficiently express the molecular systems for two reasons. First, the networks used do not contain a large number of genes that are supposed to be expressed in cells. In fact, only those genes that are already known to be involved in certain cancer types have been included in the networks16,17,18. Second, the networks used are constructed from public databases that do not include condition-dependent networks19,20. Recent studies have revealed that biological networks vary between normal and the disease states21,22. Because genetic interactions are condition-specific, the networks of particular types of cancers are different from those found in these databases.

In this study, we propose a novel method to identify cancer subtypes by incorporating differences in the molecular systems of the patients. Because a gene network involves gene–gene regulatory relationships and is a fundamental network among the various molecular networks, it can be an adequate representation of molecular systems for our purpose. Therefore, our proposed method is based on the estimated gene network from the gene expression data of patients. The main scheme of the proposed method is illustrated in Fig. 1. Briefly, our method estimates a gene network from a gene expression dataset in the tumors of patients using a Bayesian network. A numerical value is then calculated for every edge of the estimated gene network with respect to each patient’s sample. This edge value, also known as the edge contribution value (ECv), is derived by evaluating the contribution of the edge to an expression value with respect to a patient in terms of the estimated molecular system23. Therefore, differences in ECvs reflect differences in the molecular systems of particular patients. This calculation of ECvs generates a matrix of numerical values consisting of patient-specific networks. Finally, hierarchical clustering was performed using the matrix to categorize patients into subtypes. This simple clustering allows the identification of various subtypes, incorporating complex patient-specific activities of their molecular systems, which cannot be captured by the existing identification methods using the multi-omics data of patients. We used this method to analyze three cancer types from TCGA datasets, namely stomach adenocarcinoma, lung cancer including lung adenocarcinoma and lung squamous cell carcinoma, and breast cancer. Consequently, all three cancer types were grouped into three major novel subtypes, which defined their differential prognoses and distinct molecular properties. Our method identified system-based cancer subtypes using only transcriptome data, which is more accessible compared to other omics data. Additionally, the proposed method allowed for the extraction of subnetworks to explain the features of the identified subtypes. Collectively, our findings indicate that the proposed method can successfully incorporate cancer-specific gene networks and establish a novel cancer subtype identification that overcomes the limitations of other sophisticated clustering methods, which are based on gene expression data alone.

## Methods

In this chapter, we first introduce the gene network estimation method using a Bayesian network. We then explain the ECvs that allow us to quantify the patient-specific characteristics of the gene networks. Finally, the method for subtype identification of cancer patients based on their ECvs is described.

### Bayesian network with B-spline nonparametric regression

To define the transcriptomic molecular networks, or gene networks, a Bayesian network with B-spline nonparametric regression model is used24. A Bayesian network (BN) is a graphical model that represents the conditional independencies among variables as a directed acyclic graph. It can be considered as a hypothetical cause-and-effect regulatory relationship among genes. By representing the gene expressions as the random variables, we can estimate the system-level regulatory relationships from the transcriptome data. Many successful studies have reported on the use of BN for gene network analysis25,26,27,28. Assuming p genes, the joint density of the gene expressions in a BN is described as

\begin{aligned} f(x_{i1},\ldots ,x_{ip}; \theta _G) = \prod _{j=1}^{p}f(x_{ij}|pa_{ij}^G;\theta _j), \end{aligned}
(1)

where $$x_{ij}$$ represents the gene expression of the j-th gene at the i-th sample, $$\theta _G$$ is the parameter vector of the BN represented by G, $$pa_{ij}^G=(pa_{i1}^{(j)},\ldots ,pa_{i,q_j}^{(j)})$$ denotes the gene expression vector of $$q_j$$ parents of the j-th gene in the i-th sample, and $$\theta _j$$ is the parameter vector for the local density with respect to the j-th gene. The optimal structure of the network is obtained by the maximization of the posterior probability given the observed data as

\begin{aligned} p(G|X) \propto \pi (G)\int \prod _{i=1}^{n}f(x_{i1},\ldots ,x_{ip};\theta _G)\pi (\theta _G|\lambda )d\theta _G, \end{aligned}
(2)

where X is the observed data matrix, $$\pi (G)$$ is the prior probability of G, n is the number of samples in X, $$\pi (\theta _G|\lambda )$$ denotes the prior distribution of $$\theta _G$$, and $$\lambda$$ is the hyperparameter vector. The drawback of the BN is that obtaining the optimal structure for a given dataset is NP-hard. Therefore, the neighbor node sampling and repeat (NNSR) algorithm was used29.

### Identification of patients’ subtypes based on their molecular networks

Tanaka et al.23 proposed the edge contribution value (ECv) to extract subnetworks from Bayesian networks, related to specific differences observed for in vitro experiments. Briefly, the B-spline nonparametric BN assumes that the gene expression is modeled as

\begin{aligned} x_{ij} = m_1^{(j)}(pa_{i1}^{(j)}) + \ldots + m_{q_j}^{(j)}(pa_{i,q_j}^{(j)}) + \epsilon , \end{aligned}
(3)

where $$m_k^{(j)}(pa_{ik}^{(j)})$$ is a regression function using B-spline curves for the k-th parent of the j-th gene, and $$\epsilon$$ is the error term. Because a value of this regression function $$m_k^{(j)}(pa_{ik}^{(j)})$$ can be considered as a contribution of an edge from the k-th parent to the j-th gene with respect to the i-th sample, Tanaka et al.23 defined ECv as

\begin{aligned} \mathrm {ECv}_{(i)}(j_{k} \rightarrow j) = m_k^{(j)}(pa_{ik}^{(j)}), \end{aligned}
(4)

where $$j_k$$ represents the index of the k-th parent of the j-th gene. Tanaka et al.23 considered the differences of ECvs as $$\Delta$$ECv between the control and the TGF$$\beta$$-treated samples, which extract the distinctive edges with a certain threshold for $$\Delta$$ECv. These were defined as the subnetworks characterizing the EMT in lung cancer cell lines. Here, we propose an algorithm that uses ECvs to characterize the patients (samples) and elucidate cancer subtypes. Using ECvs as quantified gene networks, patients with similar molecular systems would have similar ECvs, while patients with different molecular systems would have different quantified networks. In this context, clustering based on the molecular system differences enables us to identify cancer subtypes. This requires the gene network estimation from the gene expression data of patients and the calculation of ECvs values for the estimated edges.

### Identification of cancer subtypes

Assuming E edges in the estimated gene network, the patient’s quantified network is defined as a vector of E elements $$(c_{i\,1},\ldots , c_{i\,E})$$, where $$c_{i\,v}$$ is an ECv of the v-th edge for the i-th patient. The quantified networks of all the patients are collected and used to construct the columns of a matrix, resulting in an ECv matrix whose (iv) element corresponds to an ECv of the v-th edge for the i-th patient. Using clustering, the patients of this ECv matrix are clustered according to the differences and similarities between their gene networks. As discussed by Tanaka et al.23,30, the use of edges in the estimated network is different in different patients, and such differences of usages can be represented as different edge values using our ECvs. They also discussed that our Bayesian network model can express various patterns of usage or can capture the status of the cellular systems of patients23,30. Except for ECv, there are no topological features for every edge for each patient in the network. Therefore ECvs are suitable for ranking edges and our patient-specific networks based on the ECv matrix are sufficiently different from each other to represent their differences, even though these networks are derived from the same basal gene network. The ECv matrix consists of all the edges of the network, including approximately 20,000 genes and 150,000 edges. Since the majority of the edges do not represent differences in terms of ECv, parts of the edges are selected prior to clustering. To select the edges for hierarchical clustering based on the ECv matrix, the variance of each edge among patients is used as the ranking edges to represent the differences across the samples. The top N edges showing large variances will be selected. Therefore, hierarchical clustering is performed for the ECv matrix, consisting of the selected N edges, for the categorizing of patients into the different cancer subtypes.

### Extraction of edges

Although hierarchical clustering categorizes patients into cancer subtypes, the part of the network that is affected by the clustering result is unknown. Therefore, distinctive edges with significant ECv differences need to be extracted, as in Tanaka et al.23. In their study, they extracted the edges by calculating the $$\Delta$$ECv between two conditions. However, since their method cannot be applied for more than two groups, we extended their scheme. The following proposed method allows us to extract distinctive edges with significant ECv differences using ECvs in multiple groups. Suppose that there are M groups of patients $$R_1,\ldots ,R_M$$. We define $${\tilde{\Delta }}\mathrm {ECv}$$ with respect to group $$R_r$$ out of these M groups as

\begin{aligned} {\tilde{\Delta }}\mathrm {ECv}^{R_r} (j_k \rightarrow j) = \left| \frac{1}{|R_r|}\sum _{i\in R_r} \mathrm {ECv}_{(i)}(j_k \rightarrow j) - \frac{1}{\sum _{s\ne r} |R_s|} \sum _{t \in R_s,s\ne r} \mathrm {ECv}_{(t)}(j_k \rightarrow j)\right| . \end{aligned}
(5)

The $${\tilde{\Delta }}\mathrm {ECv}$$ of every single edge is then calculated with respect to each subtype, where significant $${\tilde{\Delta }}\mathrm {ECv}$$ edges are regarded as distinctive edges of specific subtypes.

## Results

### Dataset

In this study, our proposed method was applied to TCGA RNA-seq datasets of four types of cancer: stomach adenocarcinoma (STAD)31; lung cancer, including lung adenocarcinoma (LUAD)32 and lung squamous cell carcinoma (LUSC)33; and breast cancer (BRCA)34,which project is supported by the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI). LUAD and LUSC were regarded as one dataset and referred to LUNG as they are both lung cancers, and we test whether they are split into different subtypes. These datasets were preprocessed as described in Supplementary information (Supplementary S2.1).

### Identification of patients’ subtypes based on ECv matrix

We calculated the ECv of every single edge in the estimated network for each patient, respectively. This ECv calculation generated a matrix of numerical values consisting of patient-specific molecular systems. To select the edges for hierarchical clustering based on the ECv matrix, the variances of edges were calculated as described in Method section. For TCGA datasets, we selected the top $$N=250$$ edges with the highest variances in ECv among the patients. These 250 edges corresponded to approximately 0.01% of the total number of edges. Categorizing the patients by the small number of edges supposed to be a potentially better identification method. To categorize the patients into network-based subtypes, we performed hierarchical clustering for the ECv matrix consisting of the selected edges in three types of cancer. The ECv heatmap revealed a high variance among the intrinsic ECv matrix of the samples (Fig. 2a, Fig. S1, Fig. S8). The clustering results of the ECv heatmap indicate that patients of each cancer type were categorized into three major subtypes, namely subtype 1, subtype 2, and subtype 3, according to the similarities and differences between the patient networks (Fig. 2a, Fig.S1, Fig. S8). In our dataset, $$N=250$$ was approximately the minimum number of edges that produced biologically and clinically meaningful results, as we described in Result section later. For the comparison, we also performed hierarchical clustering for the RNA-seq datasets (Fig. 2b), which is discussed in the later section.

### Extraction subtype-specific edges

The $${\Delta }\mathrm {ECv}$$ value was calculated for every single edge in each subtype across three types of cancer. Edges with a high represent significant differences between subtypes. The distribution of suggested that only limited edges showed significant differences (Fig. 2c–e, Fig. S2). Based on the distributions of $${\tilde{\Delta }}\mathrm {ECv}$$, the top 1.0% of the total edges in the estimated network were found to differ significantly. Therefore, we selected the corresponding edges from each subtype and removed any edges that were also selected for other subtypes (Fig.  2f, Fig. S3). We denoted the extracted edges as subtype-specific edges. Networks consisting of these subtype-specific edges were considered as a subnetwork characterizing the identified subtypes.

### Applications in TCGA stomach cancer datasets

Stomach cancer is one of the most common leading causes of cancer-related death worldwide35. In the original TCGA paper, the authors demonstrated that stomach cancer is a heterogeneous disease with four molecular subtypes—Epstein-Barr virus (EBV), microsatellite instability (MSI), genomically stable (GS), and chromosomal instability (CIN)31. According to the paper, the EBV subtype is enriched for high EBV burden and showed hypermethylation of the DNA, the MSI subtype showed elevated mutation rates and hypermethylation, the CIN subtype showed somatic copy-number aberrations, and the GS subtype did not show any somatic copy-number aberrations31. These subtypes were identified using iCluster and are based on the six platforms of the multi-omics molecular signature: somatic mutation, mRNA expression, miRNA expression, promoter methylation, somatic copy number alteration, and protein expression31. However, as previously mentioned31, no significance was observed between the prognoses of the subtypes (log-rank test p-value = 0.10 > 0.05) (Fig. 3a). This suggests that the multi-omics-based subtypes did not account for the clinical significance, such that subtyping may not provide an opportunity to improve therapeutic treatments. We hypothesized that multi-omics data provide limited information on tumor subtyping. Rather, the differences in molecular systems may explain the differences in patients’ prognoses.

To address this issue, we applied the proposed method to the preprocessed RNA-seq datasets of STAD. As described in Result section above, the clustering results of the ECv heatmap indicate that stomach cancer is grouped into three major subtypes: subtype 1 (113 samples), subtype 2 (76 samples), and subtype 3 (173 samples) (Fig. 2a). To determine the relationship between the existing multi-omics-based subtypes and our identified subtypes, we summarized the number of patients across them (Table 1) and found that our subtyping was different from the multi-omics-based subtypes. These findings suggest that our proposed method, based on the patient-specific molecular systems, can identify novel cancer subtypes that cannot be captured by existing methods using multi-omics data. To investigate the extent to which our proposed method identifies cancer subtypes, we conducted a survival analysis of the three identified subtypes. A better method is key for the identification of cancer subtypes and different prognoses, since patients with different molecular systems require different drug treatments. The Kaplan-Meier survival probability curves in the identified subtypes indicated that each subtype had a significantly different prognosis pattern (log-rank test p-value = 0.00011 < 0.05) (Fig. 3b).

Furthermore, to confirm whether the gene network information improves the identification of the cancer subtypes, hierarchical clustering was performed using RNA-seq data alone, without network information. The top 322 genes showing the highest variances of the RNA-seq data in STAD were selected, as the 250 edges with the ECv matrix were composed of 322 genes. Consequently, we identified three RNA-seq-based subtypes (Fig. 2b). However, these subtypes did not show any significant differences in terms of their prognoses (log-rank test p-value = 0.06 > 0.05) (Fig. 3c). Despite employing two major subtypes in the clustering result, the differences were not significant (Fig. 3d). To determine the relationship between the network-based and the RNA-seq-based subtypes, we summarized the number of patients across them (Table S2) and found that network-based subtypes were different from the RNA-seq-based subtypes. These results further suggest that our network-based method might generate a better cancer subtyping profile. Moreover to confirm whether the gene network information improves the identification of the cancer subtypes, we also applied the iNMF method. We compared our method with the iCluster method through four molecular subtypes identified by the original TCGA paper and applied another clustering method, the iNMF, as it is a successful method for cancer subtyping that can handle multi-omics data36,37. We set three clusters when performing the iNMF as we identified three subtypes in our method. However, although we performed using gene expression data alone and using multi-omics data consisting of gene expression, miRNA expression, copy number, and DNA methylation, in both cases, these subtypes did not show any significant differences in their prognosis (Fig. S4).

To characterize the network-based subtypes obtained from the ECv matrix, we highlighted the subnetworks composed of subtype-specific edges in the estimated basal network (Fig. 4a). The subtype-specific edges constituted a module in the basal network, especially in terms of subtype 3 (Figs. 2f, a). In Fig. 4a, the node layout of the basal network was arranged only using its topological structure without ECv information. These networks indicate that the extracted subnetworks form modules without being relaid. This suggests that the differences in the partial modules of the network might affect the identification of the cancer subtypes. Furthermore, to account for the properties of the identified subtypes, we verified their molecular features using gene ontology analysis. The subtype-specific networks were composed of 250, 1186, and 407 genes in subtype 1, subtype 2, and subtype 3, respectively. The ontology analysis results indicated that, according to the top five biological function terms, each subtype had a characteristic molecular feature (Fig. 5). Although most of the biological functions in the subtypes were related to development, the developmental stages or tissues varied between the subtypes. For example, “cardiovascular system development and function” was found in subtype 1, while “embryonic development” was found in subtype 2 and “cellular development” was found in subtype 3. In particular, the top five of biological functions in subtype 3, which were associated with a moderate prognosis, were completely different from those in the other subtypes. Most of the biological functions observed in subtype 1 and subtype 2 were related to development, while “cellular growth and proliferation” and “cell-to-cell signaling and interaction” were observed exclusively for subtype 3. Moreover, to demonstrate the clinical relevance of the identified subtypes, we summarized the number of patients across several clinical indices, such as sex, age at diagnosis, and tumor stage, in TCGA (Table S3-5). These results suggest that subtype 2 is characteristic of females, although the age at diagnosis and the tumor stage are not relevant to the identified subtypes (Table S3-5). Furthermore, we visualized a network composed of subtype-specific edges (Fig. 2f) and extracted the largest connected component from the visualized network in each subtype (Fig. 4b,c). The results suggested that the subtype-specific edges in subtypes 2 and 3 were composed of a large connected subnetwork. Moreover, we found SALL2, ETNK2, and APBB1 were located as the top hub genes in subtype 2. These genes are implied as cancer-related genes38,39,40, and thus may play an important role in characterizing these subtypes.

### Identification of cancer subtypes in lung cancer

To test the effectiveness of our method in other datasets, we applied it to the lung cancer dataset (LUNG). As shown in Fig. S1, the clustering results of the ECv heatmap indicate that lung cancer is grouped into three subtypes: subtype 1 (227 samples), subtype 2 (121 samples), and subtype 3 (343 samples). Then, survival analysis was conducted, similar to STAD analysis. The Kaplan-Meier survival probability curves in the identified subtypes indicate that each subtype has a significantly different pattern of prognosis (log-rank test p-value = 1.3e-14 < 0.05) (Fig. S5a). Next, the differences in the molecular features between the identified subtypes were determined. The subtype-specific networks were found to be composed of 158, 1049, and 582 genes in subtype 1, subtype 2, and subtype 3, respectively. Gene ontology analysis of the subtypes indicate that all of the top five biological functions varied between them (Fig. S6). These findings suggest that our proposed method might also work for different cancer types. Moreover, hierarchical clustering followed by survival analysis, was performed for RNA-seq data as shown in the STAD section. Consequently, three RNA-seq-based subtypes were identified and the prognoses of these subtypes were significantly different (log-rank test p-value = 1.5e-15 < 0.05) (Fig. S5b and S5c). While the patients in network-based subtype 1 were almost identical with those in RNA-seq-based subtype 1A, those in network-based subtype 2 and subtype 3 were different from RNA-seq-based subtypes (Table S6). Furthermore, the network-based and RNA-seq-based clustering could almost completely categorize LUAD and LUSC (Table S6). This may suggest that patients with various molecular features can be categorized even without network information, as LUAD and LUSC have characteristic molecular features that are significant for categorizing them using transcriptome data41,42. Gene ontology analysis indicates that network-based subtypes could reveal completely different characteristic molecular features among the subtypes. Thus, this suggests that our method can identify novel subtypes that cannot be detected using RNA-seq clustering. We also visualized a network composed of subtype-specific edges (Fig. S7a) and extracted the largest connected component from the visualized network in each subtype (Fig. S7b-d).

### Identification of cancer subtypes in breast cancer

To illustrate the generalization of our method, we applied it to breast cancer datasets (BRCA). Breast cancer is well known as a heterogeneous disease with different molecular subtypes43. As shown in Fig. S8, the clustering results of the ECv heatmap indicate that breast cancer is goruped into three subtypes: subtype 1 (192 samples), subtype 2 (464 samples), and subtype 3 (439 samples). Similar to the STAD and LUNG datasets, we then conducted a survival analysis of BRCA. The Kaplan-Meier survival probability curves for the identified subtypes indicated that subtype 3 had a different prognosis pattern and is almost statistically significant (log-rank test p-value = 0.0504) (Fig. S9). The differences in the molecular features between the identified subtypes were then determined. The subtype-specific networks were found to be composed of 1052, 463, and 953 genes in subtypes 1, 2, and 3, respectively. Gene ontology analysis of the subtypes indicated that all of the top five biological functions varied between them (Fig. S10). The results of the analysis indicated that each subtype, especially subtype 1, had a characteristic molecular feature. Next, hierarchical clustering followed by survival analysis, similar to the STAD and LUNG datasets, was performed for RNA-seq data. Consequently, three RNA-seq-based subtypes were identified and the prognoses of subtype 3A were significantly different (log-rank test p-value = 0.005 < 0.05) (Fig. S9b and S9c). To determine the relationship between the network-based and the RNA-seq-based subtypes, we summarized the number of patients across them (Table S7) and found that the patients in network-based subtype 1 were almost identical to those in RNA-seq-based subtype 1A. Consistent with the analysis of the LUNG datasets, these results suggest that patients with different molecular features can be grouped regardless of with network information or without it, and these subtypes show significantly different prognoses. According to the analysis of the LUNG datasets, network-based subtypes are similar to the RNA-seq-based subtypes in the patients with various molecular feautures; therefore, the results suggest that our method can also be applied to BRCA datasets. Furthermore, we also visualized a network composed of subtype-specific edges (Fig. S11a) and extracted the largest connected component from the visualized network for each subtype (Fig. S11b-d).

## Discussion

In this study, we proposed a novel method for the identification of cancer subtypes based on patient-specific molecular systems. The proposed method is able to identify novel subtypes with different prognoses, as well as the differences in molecular properties between stomach cancer, lung cancer, and breast cancer. Differences in molecular systems are not necessarily associated with the prognoses of patients. However, it is likely to affect the effectiveness and/or medical treatment options available for these patients. For this reason, our novel subtypes may be related to prognosis of patient.

Although many types of omics data are currently available, it remains difficult to integrate multi-omics data in research. Each type of omics data can be used to categorize cancers into various subtypes in terms of prognosis, pathological findings, and others. However, since our proposed method uses only transcriptome data, even though our gene network-based method was successful, it may not be sufficient to obtain an in-depth understanding of the molecular systems. Despite this, changes in the different layers of omics networks influence the transcriptome profile at some level. This could explain why our proposed method, based on the gene network, was able to identify novel cancer subtypes using only the transcriptome data. There, however, remains room for improvement in the method reported in this study, wherein identification using multi-omics data based on estimated systems represents an informative strategy for the identification of cancer subtypes.