Three-Dimensional Gene Map of Cancer Cell Types: Structural Entropy Minimisation Principle for Defining Tumour Subtypes

Li, Angsheng; Yin, Xianchen; Pan, Yicheng

doi:10.1038/srep20412

Download PDF

Article
Open access
Published: 04 February 2016

Three-Dimensional Gene Map of Cancer Cell Types: Structural Entropy Minimisation Principle for Defining Tumour Subtypes

Angsheng Li¹,
Xianchen Yin^1,2 &
Yicheng Pan¹

Scientific Reports volume 6, Article number: 20412 (2016) Cite this article

2801 Accesses
8 Citations
Metrics details

Subjects

Abstract

In this study, we propose a method for constructing cell sample networks from gene expression profiles and a structural entropy minimisation principle for detecting natural structure of networks and for identifying cancer cell subtypes. Our method establishes a three-dimensional gene map of cancer cell types and subtypes. The identified subtypes are defined by a unique gene expression pattern and a three-dimensional gene map is established by defining the unique gene expression pattern for each identified subtype for cancers, including acute leukaemia, lymphoma, multi-tissue, lung cancer and healthy tissue. Our three-dimensional gene map demonstrates that a true tumour type may be divided into subtypes, each defined by a unique gene expression pattern. Clinical data analyses demonstrate that most cell samples of an identified subtype share similar survival times, survival indicators and International Prognostic Index (IPI) scores and indicate that distinct subtypes identified by our algorithms exhibit different overall survival times, survival ratios and IPI scores. Our three-dimensional gene map establishes a high-definition, one-to-one map between the biologically and medically meaningful tumour subtypes and the gene expression patterns and identifies remarkable cells that form singleton submodules.

A single-cell atlas enables mapping of homeostatic cellular shifts in the adult human breast

Article Open access 28 March 2024

PERCEPTION predicts patient response and resistance to treatment using single-cell transcriptomics of their tumors

Article 18 April 2024

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Article Open access 25 March 2024

Introduction

One of the challenges of cancer treatment is targeting specific therapies to pathogenetically distinct tumour types to maximise treatment efficacy and minimise toxicity. Traditionally, cancer classification has been based on the morphological appearance of the tumour; however, this approach has serious limitations. Tumours with similar histopathological appearances can have different clinical courses and exhibit different responses to therapy. Molecular heterogeneity within individual cancer diagnostic categories is also evident in the variable presence of chromosomal translocations, tumour suppressor genes deletions and numerical chromosomal abnormalities. Cancer classification is difficult because the classification relies on specific biological insights, instead of on systematic, comprehensive, global and unbiased methods for identifying tumour subtypes.

Over the past decade, the increased availability of large-scale gene expression profiles have led researchers to propose a number of new approaches for classifying tumour types or subtypes based on gene expression analyses. Golub et al.¹ have proposed a neighbour analysis to distinct known types and a “class predictor” that assigns a new sample to a known class purely based on the gene expression profiles and have verified their methods using an acute leukaemia dataset. Alizadeh et al.² have proposed a method based on hierarchical clustering, which divides the type of diffuse large B-cell lymphomas into two subtypes. Ramaswamy et al.³ have proposed a “classifier” based on a support vector machine (SVM) and have analysed the accuracy of true type predictions for both the snap-frozen human tumour and normal tissue specimens. Yeoh et al.⁴ have analysed sets of genes that define certain subtypes. Bhattacharjee et al.⁵ have classified human lung carcinomas by using hierarchical clustering and have verified the classification by a selected set of gene expression profiles. Su et al.⁶ have analyzed differences in gene expression between human and mouse transcriptomes by using a selected set of gene expression profiles. Pomeroy et al.⁷ have proposed a classification method for tumour types of the central nervous system based on a learning algorithm and have verified their result by a selected set of gene profiles. Monti et al.⁸ have proposed a learning algorithm to classify the tumour types and have verified their results according to the similarity with the true types. Yang and Naiman⁹ have proposed a learning algorithm to select a small set of genes that are distinct to known tumour types. Haferlach et al.¹⁰ proposed a learning algorithm to classify the leukemia subtypes based on a large number of clinical samples.

Theoretical biologists have found that gene expression patterns are important for tumour classification. Ao et al.¹¹ have shown that a 20-node network may generate 32 attractors, implying that gene expression patterns provide a better classification scheme than the dominating scheme based on DNA and mutations. Wang et al.¹² have reported the identification and prediction of liver tumour subtypes from an endogenous network. Zhu et al.¹³ have discovered a hierarchical property of prostate cell types.

Community detection is to identify the natural communities of a naturally evolving network. It provides a systematic and global approach to understanding the natural structures of the real world networking systems and is one of the major topics in network theory^14,15. According to Darwin’s theory¹⁶, animals from ants to people form social groups in which most individuals work for the common good. Similarly, we may have that individuals of a naturally evolving network may form natural communities. Li et al.^17,18 have shown that the natural communities of a network maybe detected by an algorithm which follows the principle of the formation of the natural communities and that (two-dimensional) structural entropy minimisation is the principle for detecting the natural communities in networks. This progress implies that tumour types and subtypes may be identified by the community detection algorithms on the basis of structural entropy minimisation, due to the fact that natural community detection is a classification by following the principle of the organisation of the network. In the present paper, we show that structural entropy minimisation is the principle for detecting natural structures of networks. Specifically, we show that structural entropy minimisation is the principle for identifying cancer cell types and subtypes.

We establish a novel method to define tumour types and subtypes. To establish our method, we have to resolve two challenges. The first is to detect the two- and three-dimensional natural structures of a naturally evolving network and the second is to construct a network from the unstructured gene expression data of cancer cell samples such that the constructed network captures the nature and laws of the gene expression data of the cancers. We resolve the two challenges by using our new notion of high-dimensional structural entropy of graphs.

Given a network G and a natural number K, we define the K-dimensional structural entropy of G, denoted , to be the least overall number of bits required to determine the K-dimensional code of the node that is accessible from random walk with stationary distribution in G. The K-dimensional structural entropy of a graph quantitatively measures the non-determinism or uncertainty of the natural K-dimensional structure of the graph. The K-dimensional structural entropy of network G explores that the community structure of G that realises the two-dimensional structural entropy of G is the natural community structure of G and that the three-dimensional structure of G that realises the three-dimensional structural entropy of G is the natural three-dimensional structure of G. Therefore, K-dimensional structural entropy minimisation is the principle of the natural K-dimensional structure of networks for K > 1. This result demonstrates that although there are many reasons and causes for the formation of natural communities such as social groups in a society, interests of organisations and games in a competing system etc, the minimisation of non-determinism or equivalently, the minimisation of uncertainty is the unified measure for the divergent causes of the formation of natural structures of a networking system. Furthermore, the structural entropy of graphs implies that one-dimensional structural entropy minimisation is the principle of a natural network of large-scale unstructured data.

Our method consists of an algorithm, denoted , to detect the K-dimensional structure of a network to minimise the K-dimensional structural entropy of the network for each K > 1 and algorithms and to construct a cell sample network from the gene expression profiles of a cancer by minimising the one-dimensional structural entropy of graphs. We implement the experiments of our algorithms to classify tumour types and subtypes for cancers and healthy tissues. Experiments show that our method defines meaningful types and subtypes for cancer cell samples.

We also compare the classifications given by our algorithms with the two frequently used community detecting algorithms: the first is the modularity maximisation algorithm proposed by Clauset, Newman and Moore¹⁹, denoted and the second is the minimisation of expression length algorithm proposed by Rosvall and Bergstrom²⁰, denoted InforMap or .

To evaluate the classifications of the types and subtypes found by our algorithms and the existing algorithms and , we analyse the similarity between the true tumour types and the modules identified by the algorithms, establish the gene map of the modules and analyse the performance of the found types and subtypes in clinical data. The experimental results demonstrate that our algorithms establish a high-definition and one-to-one three-dimensional gene map of the submodules of the cancers identified by our new algorithms. For the diffuse large B-cell lymphoma (DLBCL), our results demonstrate that most of the cell samples within a module or submodule identified by our algorithms share similar survival times, survival indicators and International Prognostic Index (IPI) scores and indicate that the distinct modules or submodules identified using our structural entropy minimisation algorithms noticeably differ in overall survival times, survival ratios and IPI scores and that distinct modules identified by the modularity maximisation and description length minimisation algorithms exhibit undistinguishable survival times, survival ratios and IPI scores.

High-Dimensional Graph Structure Entropy

In a real world network, at every time step, various interactions, communications and operations may occur in the network and the actions in the network are often unpredictable. Thus, the non-determinism of the various actions in a network is the essence of the complexity of the network. It is certainly a dynamical complexity of the network.

Our notion of structure entropy of a graph is to measure the dynamical complexity of the graph. Intuitively, for a graph G and a natural number K, the K-dimensional structural entropy of G is the least overall number of bits required to determine the K-dimensional code of the node that is accessible from random walk with stationary distribution in G. We will gradually establish the notion.

One-dimensional structural entropy

First, we recall Shannon’s entropy function. For a probability vector p = (p₁,…, p_q), with , the Shannon entropy function of p is defined as follows:

Considering the definition of H(p₁, p₂, ···, p_n), for every i, l = − log p_i is the length of the binary representation of the number , which indicates that is one of the 2^l numbers. Therefore, we interpret −log p_i as the “self-information of p_i”, which also indicates that −log p_i is the amount of information needed to determine the code of i. Therefore, is the average amount of information required to determine the code of i that is picked according to the probability distribution p = (p₁, p₂, ···, p_n).

For a connected graph G = (V, E) with n nodes and m edges, we define one-dimensional structural entropy or the positioning entropy of G by using the entropy function H. For each node i ∈ {1,2, ···, n}, let d_i be the degree of i in G and let . Then, the stationary distribution of random walk in G is described by the probability vector p = (p₁, p₂, ···, p_n). We define the one-dimensional structural entropy or positioning entropy of G as follows:

By definition, is the average number of bits required to determine the one-dimensional code of the node that is accessible from the random walk with stationary distribution in G.

(Remark: (i) If the degree d_i of node i is 0 for some i, then the definition of is invalid. (ii) The definition of may be extended to disconnected graphs; in such cases, is the weighted average of for all of the connected-component G_i of G. Of course, we assume that for the graph of a singleton with no edge, the one-dimensional structural entropy of the graph is 0 because, a random walk cannot occur in such a graph. (iii) The definition of differs from the Shannon entropy for the randomly selection of a node in G as follows: is a dynamical notion measuring the complexity of the random walk in the graph, whereas the Shannon entropy is a static notion for a probabilistic distribution).

The one-dimensional structural entropy (or positioning entropy) is interesting for the following reasons: (i) the notion is a dynamical version of the Shannon’s entropy in graphs, (ii) positioning is a basic operation for network applications and (iii) the first step for a rigorous study on unstructured big data is perhaps to structure the data, for which one-dimensional structural entropy minimisation could be the fundamental principle. Item (iii) is extremely important, because it means that the minimisation of one-dimensional structural entropy could be the principle to identify the natural network from unstructured big data. In the present paper, we will propose such an algorithm to construct cell sample networks for cancers from the unstructured gene expression profiles.

Two-dimensional structural entropy

For a naturally evolving network G = (V, E), individuals form social groups primarily through self-organising behaviours. Therefore, self-organisation behaviours lead to a natural community structure of the network. To detect the natural communities of a network, we must understand the principle of self-organisation of naturally evolving networks.

Suppose that is a partition of V, in which eachX_j is defined as a module or a community. Using , we encode a node v by a pair (i, j) such that j is the code of the community X containing v (referred to as the global code) and i is the code of node v within its own community X (referred to as the local code). In this case, suppose that v is the node that is accessible from the random walk with stationary distribution in G. We define the number of bits required to determine the pair (i, j) of the codes of v. The following two scenarios may occur:

Case 1: v is accessible from node u in the community of v. In this case, only the local code i of v within its own community must be determined because the code of its community is already known before the random walk.
Case 2: v is accessible from a node outside v’s own community. In this case, both the local and global codes of v, or the pair (i, j), must be determined.

If is a well-defined community structure of G, then the probability of Case 2 occurring is small. In this scenario, the number of bits required to determine the two-dimensional code (i, j) of the node that is accessible from the random walk is significantly smaller than the one-dimensional structural entropy or positioning entropy of G.

The ideas presented above motivated us to define the two-dimensional structural entropy, which is also referred to as the module entropy or local positioning entropy of a graph.

For a connected graph G = (V, E), suppose that is a partition of V. We define the structural entropy of G by as follows:

where L is the number of modules in partition , n_j is the number of nodes in module X_j, is the degree of the i-th node of X_j, Vol_j is the volume of module X_j (the sum of the degrees of the nodes in module X_j) and g_j is the number of edges with exactly one endpoint in module X_j.

The two-dimensional structural entropy of a graph G is defined as follows:

where runs over all of the partitions of G.

For a network that naturally evolves in nature and society, we propose the following self-organisation hypothesis: the self-organisation of individuals in the network minimises the non-determinism of the structure of the network. Assuming the self-organisation hypothesis, the algorithm minimising the two-dimensional structural entropy of the graph produces the natural community structure of the network.

The definition of clearly reveals that minimising the non-determinism of the community structures is a principle of the natural community structures and network self-organisation in nature and society. As a matter of fact, Li, Li and Pan¹⁷ and Li et al.¹⁸ have shown that two-dimensional structural entropy minimisation is the principle for detecting the natural community structure of networks.

High-dimensional structural entropy

Real world networks generally have a hierarchical structure such that a module of a network may consist of quite a few submodules, which leads to a natural extension of the two-dimensional structural entropy to high-dimensional cases.

To define high-dimensional structural entropy, we introduce a partitioning tree of graphs. First, we consider the two-dimensional case. For a graph G = (V, E) and a partition of V, we interpret the partition by a partitioning tree of hight 2 as follows: 1) first, we introduce the root node λ and define a set of nodes T_λ = V, 2) we introduce L immediate successors for the root node denoted , where i = 1, 2, ···, L and associate the set X_i with node α_i; thus, we define and 3) for each α_i, we introduce |X_i| immediate successors denoted for all j ∈ {1, 2, ···,|X_i|} and each successor is associated with an element in X_i; thus, we define as the singleton of a node in .

Therefore, is a tree of height 2 and all of the leaves of are associated with singletons. For every node , T_α is the union of T_β for all of β values (of the immediate successors) of α and the union of T_α for all of the nodes α values at the same level of the tree is a partition of V.

Thus, the partitioning tree of a graph G = (V, E) is a set of nodes such that each node is associated with a nonempty subset of vertices of graph G and can be defined as follows:

Let G = (V, E) be a network. We define the partitioning tree of G as a tree with the following properties:

1
For the root node denoted λ, we define the set T_λ = V.
2
For every node , the immediate successors of α are for j from 1 to a natural number N ordered fromleft to right as j increases. Therefore, is to the left of written as , if and only ifi < j.
3
For every , there is a subset thatis associated with α. For α and β, we use to denote that α is an initial segment of β. For every node , we use α⁻ to denote the longest initial segment of α, or the longest β such that .
4
For every i, {T_α | h(α) = i} is a partition of V, where h(α) is the height of α (note that the height of the root node λ is 0 and for every node , h(α) = h(α⁻) + 1).
5
For every α, T_α is the union of T_β for all β’s such that β⁻ = α; thus, .
6
For every leaf node α of , T_α is a singleton; thus, T_α contains a single node of V.

We define the entropy of G by a partitioning tree of G.

For a network G = (V, E), suppose that is a partitioning tree of G. We define the structural entropy of G by as follows:

(1) For every , if , then define

where g_α is the number of edges fromnodes in T_α to nodes outside T_α,V_β is the volume of set T_β, namely, the sum of the degrees of all the nodes in T_β. (Remark: For an edge-weighted graph G = (V, E), g_α is the sum of the weights of all the edges between T_α and nodes outside T_α and the degree of a node v ∈ V in G is the sum of the edge weights of all the edges incident to v. For a non-weighted graph, we regard the weight of an edge as 1.)
(2) We define the structural entropy of G by the partitioning tree as follows:

LetG = (V, E) be a network. We define the K-dimensionalstructural entropy of G as follows:

where ranges over all of the partitioning trees of G of height K.

Our definition of the structural entropy explores the following self-organisation principle: minimising the non-determinism of a structure is the principle for the self-organisation of structures within naturally evolving networks.

In particular, the K-dimensional structural entropy of a graph G implies that the K-dimensional structure of G that minimises the K-dimensional structural entropy of G is the natural K-dimensional structure of G. Precisely, according to the definition of the K-dimensional structural entropy, we may define the natural structure of a graph as follows: Given a connected graph G, a natural number K > 1, a constant δ ≤ 1 and a K-level partitioning tree of G, we say that is a δ-natural K-dimensional structure of G, if . This definition allows us to mathematically analyse the natural structures of networks. We leave it as an open question.

Algorithm for Minimising the K-Dimensional Structural Entropy

In this section, we describe an algorithm for finding a partitioning tree of a graph G of height K to minimise the K-dimensional structural entropy of G.

Two operators, the merging operator and the combining operator, are introduced and a partitioning tree is developed by using the two operators.

First, we define the merging operator. Let be a partitioning tree and let α and β be nodes of with α < _Lβ (meaning that α is to the left of β) and α⁻ = β⁻ = γ for some γ. In addition, let and for i < j.

We define a partitioning tree below, which is obtained from via the following merging operator:

Let T_α = {x₁, x₂, ···, x_M} and T_β = {y₁, y₂, ···, y_N}, which are ordered as listed in the sets. Then,

1
Define T_α = {x₁, x₂, ···, x_M, y₁, y₂, ···, y_N}, which are ordered as they are listed.
2
Set h(α) ← h(α).
3
For each s ∈ {1, 2, ···, M}, define with .
4
For every t with M + 1 ≤ t ≤ M + N, define with .
5
Delete β.
6
For every j′ > j, if is defined, then set

Here, we use to denote the partitioning tree defined by via the merging operator above.

We define the difference between the structural entropies of G obtained from a partitioning tree and a partitioning tree obtained from through a merging operator.

For a graph G = (V, E) and a partitioning tree of G, let such that α⁻ = β⁻ and h(α) < K. Then, define , where . By definition, we have

In this case, if α and β occur such that α < _Lβ, α⁻ = β⁻ , h(α) < K and if , then is defined and written as .

According to equation (7), is locally computable.

Second, we define the combining operator. Let G = (V, E) be a graph and let be a partitioning tree of G.

For any , if:

1
α⁻ = β⁻ = γ for some γ, and
2
for any , if either or , then h(δ) < K,

then define the combining operator as follows:

-create a new node ξ with and ξ⁻ = γ,

-let the two branches with root α and β in be two branches of ξ, while maintaining the same order as in .

By definition, the Δ-function with the combining operator is as follows:

where is the tree obtained from by the combing operator .

If α < _Lβ such that α⁻ = β⁻ and if or implies h(δ) < K for every δ and . Thus, is defined and written as .

In this case, it is clear that is locally computable.

Finally, we introduce our algorithm denoted by using both the merging and combining operators. Let G = (V, E) be a graph. Suppose that {v₁, v₂, ···, v_n} is the set of all vertices in V ordered as they are listed in the set. The K-dimensional structural entropy algorithm on G proceeds as follows:

1
Define the initial partitioning tree as follows:

-Set T_λ = V with h(λ) = 0 and for every i ∈ {1, 2, ···, n}, define with .
2
If there are such that ,then
1. 1
  choose α and β such that is maximised;
2. 2
  let be the partitioning tree obtained from by the merging operation of with α and β;
3. 3
  set ; and
4. 4
  go back to step (2).
3
If there are such that , then
1. a
  choose α and β such that is maximised;
2. b
  let be the partitioning tree obtained from by the combining operation of with α and β;
3. c
  set ; and
4. d
  go back to step (2).
4
Otherwise, output the partitioning tree and terminate the program.

The algorithm outputs a partitioning tree of G. Clearly algorithm works naturally on weighted networks.

Time complexity of algorithm

For K = 2, the time complexity of is O(n²) for all graphs and is for sparse networks¹⁷, where n is the number of nodes in the graph. This algorithm is a nearly linear time algorithm for networks, which easily functions for networks that include millions of nodes. For K = 3, for every first level node α in the partitioning tree, the size of T_α does not decrease during the implementation of the algorithm. Therefore, |T_α| = M for M with 1 ≤ M ≤ n. For a fixed M and a fixed T_α of size M, the number of operations associated with the children of α is the time complexity of an with M graphs of M nodes; thus, O(M²) for general graphs and for networks. This analysis gives the time complexity with one first level node α of the partitioning tree O(n³) for all graphs and for sparse graphs. Because there are at most n first level nodes in the partitioning tree, the time complexity of is bounded by O(n⁴) for all graphs and for sparse networks, which is significantly larger than that of .

The time complexity analysis above clearly indicates that our algorithm is not a hierarchical clustering algorithm with 3 levels. Because of the time complexity of for sparse networks or O(n⁴) for all graphs, although is a polynomial time algorithm, it can, in practice, only manage graphs that contain thousands of nodes. Therefore, it is difficult to detect the natural K-dimensional structure of a network of large sizes for K > 3 (the time complexity generally increases by a factor of n² whenever the dimension increases by 1, for dimensions K ≥ 2), which poses a new issue regarding the design of better algorithms for minimising the K-dimensional structural entropy of networks for each K > 1, including for the case of K = 2. We remark that the time complexity of for networks is in fact impractical for n as large as hundreds of millions. For this reason, there is a need to design better algorithms to find the minimal two-dimensional structural entropy of networks.

Finally, we note that is a heuristic algorithm to compute the K-dimensional structural entropy of graphs and indicate that precisely computing the K-dimensional structural entropy of graphs is an extremely difficult problem that should be resolved in future computer science studies.

Remark: (i) the merging operator combines two sets X and Y into a set such that all of the nodes in Z are not distinguished and are allowed to re-group within Z in the future; (ii) the combining operator combines two sets X and Y into such that that the subtypes X and Y are kept within Z; (iii) the two operators are natural rules in real world clustering, which incorporates the idea of a mixture both bottom-up and top-down methods; (iv) our algorithm is a basic greedy strategy for minimising the K-dimensional structural entropy and new rules are required to design new algorithms to minimise the K-dimensional structural entropy of graphs; (v) we determined the K-dimensional structural entropy for small values of K because, in real world networks, hierarchical structures occur; however, the number of levels of a community within a community is indeed small.

Clearly, our algorithm not only seeks to follow the principle of self-organisation of networks, but also the natural rules in the real world as the operators of the algorithm. The algorithms are designed to explore the natural two- and three-dimensional structures of networks rather than optimise an artificially defined object function. We use this strategy because natural objects can be identified by following natural rules and algorithm has been shown to successfully detect natural communities in social networks¹⁷. The algorithm can be regarded as a deep detecting algorithm that seeks to explore the natural hierarchical structure of networks.

We use our algorithms for K = 2, 3 to identify the modules and submodules of cancers and to compare our algorithms with the two most frequently used algorithms, namely, the modularity maximisation algorithm ¹⁹ and the description length minimisation algorithm ²⁰.

Constructing Cell Sample Networks Based on Gene Expression Profiles

Suppose that v₁, v₂, ···, v_n are n samples of cells and that g₁, g₂, ···, g_N are N genes. For every pair (i, j), let a(i, j) be the expression profile of gene g_i in sample v_j. Then, for every j from 1 to n, a vector (a(1, j), a(2, j). ···, a(N, j)) occurs and represents the gene expression profiles of the sample v_j, denoted P_j. For every pair (j, j′), let W_j,j′ be the Pearson correlation coefficient between P_j and P_j′, the gene expression profiles of samples v_j and v_j′, respectively.

A cell sample network G = (V, E) is constructed on the basis of the gene expression profiles by the following algorithm, denoted .

Algorithm works with a fixed natural number k and proceeds as follows:

1
The vertices of G are the cell samples v₁, v₂, ···, v_n, that is, let V = {v₁, v₂, ···, v_n}; and
2
For every j, suppose that u₁, u₂, ···, u_k are the cell samples such that W(v_j, u₁), W(v_j, u₂), ···, W(v_j, u_k) are the highest k weights among the weights W(v_j, u) for all of the samples u, where W(v_j, u) is the Pearson correlation coefficient between the gene expression profiles of samples v_j and u. For every i from 1 to k, create an edge (v_j, u_i) with weight W(v_j, u_i).

This constructs the weighted graph G = (V, E).

In the construction of G, k is a fixed number that depends on different cell samples and gene expression profiles.

The choice of k is a challenging problem. It requires that the choice of k ensures that the nontrivial weights are maintained in the generated graph and the trivial or noisy weights are removed. Here, we realise the idea by the following algorithm, denoted .

Algorithm proceeds as follows:

1
(Noise amplifying) Fix a noise amplifier σ. Let W be the average wight among all the pairs of cell samples. Let M = σ ⋅ W be the modifier. Let H be the weighted graph of the cell samples such that for every pair (i, j) of cell samples, there is a weight W′(i, j) = W(i, j) + M. This step amplifies the noise for all the weights. The roles of this step are two-fold: if theweight W(i, j) between cell samples i and j is nontrivially high,then the modified weight W′(i, j) = W(i,j) + M is approximately the original weight W(i,j) since the modifier M is small and if the weight W(i, j) istrivial or noisy, then the modified weight W′(i, j) = W(i, j) + M is significantly amplified, which allows our algorithm to better filter the noise or trivial weights from the highly nontrivial weights. In our experiments, we choose the noise amplifier when n < 1000 and when n > 1000, where n is the number of the cell samples. This choice of σ is approximately equivalent to the following operation: Every cell sample increases a unit weight and uniformly assigns the extra unit weight to all the other cell samples. (The crucial new idea in algorithm is the introduction of the noise amplifier σ and the modifier M. The motivation is to amplify noise for an algorithm to easily identify the noises. However, it may have more implications and have some theoretical backgrounds. The exact form of the σ may vary a bit for different networks.)
2
For every k, let H_k be the weighted graph obtained from H as follows:
1. a
  The modifier M is kept for every edge.
2. b
  For every cell sample i, keep the weighted edges of the top k weights and delete all the other weights.
3
For each k, let H(k) be the one-dimensional structural entropy of theweighted graph H_k. We say that k is a stable point, if bothH(k − 1) > H(k) andH(k + 1) > H(k) hold.
4
(Minimisation of non-determinism or uncertainty) Define k to be the k′ that achieves the least one-dimensional structural entropy among all the stable points. That is, k is a stable point and H(k) is the least among the H(k′) for all the stable points k′.

This step ensures that the chosen k generates a network structure with minimum uncertainty or non-determinism.

In this study, we consider a cell sample network for 4 cancers (additional cancer types and healthy tissue are discussed in the supplementary information) using the following data.

1
Acute leukaemia from Golub et al.¹, which were obtained from acute leukaemia patients at the time of diagnosis. The data contain the expression of 7,129 genes for 38 samples that constitute 3 cell types. The three tumour types obtained from acute leukaemia patients at the time of diagnosis are as follows: 11 acute myeloid leukaemia (AML) samples; 8 T-lineage acute lymphoblastic leukaemia (ALL) samples; and 19 B-lineage ALL samples.
2
Lymphoma data from Alizadeh et al.², which contain the expression of 4,026 genes for 96 samples, which constitute 9 cell types. The 9 types consist of three different types of tumours, diffuse large B cell lymphoma (DLBCL), chronic lymphocytic leukaemia (CLL) and follicular lymphoma (FL), as well as normal B and T cells at different stages of cell differentiation, including germinal centre B, NL.lymph node/tonsil, activated blood B, resting/activated T, transformed cell lines and resting blood B. On the basis of gene expression profiling, Alizadeh et al.² have suggested dividing the DLBCL type into two subtypes, GC B-like DLBCL and activated B-like DLBCL.
3
Multi-tissues from the multi-tissue dataset from Ramaswamy et al.³, which contain the expression of 5,565 genes for 103 samples constituting 4 cell types. The tissue samples are from four distinct cancer types: 26 breast, 26 prostate, 28 lung and 23 colon samples.
4
The test data of leukemia from Haferlach et al.¹⁰, which contains 1152 samples and 1480 gene expression profiles. The samples consist of 18 subtypes.

In Figures 1, 2, 3 and 4 of the supplementary information, denoted S-Figures 1, 2, 3 and 4, we depict the curves of the one-dimensional structural entropy function H(k) for the acute leukaemia, lymphoma, the multi-tissues and the new test leukemia data, respectively.

By observing S-Figures 1, 2, 3 and 4, we have that the parameter k chosen by the algorithm for the acute leukaemia, the lymphoma, the multi-tissues and the new test leukemia data are k = 7, k = 6, k = 9 and 11, respectively. The algorithm with the chosen k = 7, 6, 9 and 11, constructs the cell sample networks of the acute leukaemia, the lymphoma, the multi-tissues and the test data leukemia, respectively.

Gene Classification Map

For a cancer with cell samples V = {v₁, v₂, ···, v_n}, suppose that g₁, g₂, ···, g_N are all of the genes. For a given gene g = g_i, we use g(v) to denote the gene expression profile of g for cell sample v ∈ V. By normalising, we ensure that the average g(v) for all values of v is 0 and the values g(v) for all samples v, are real numbers in [−1, 1] (details of the normalisation is provided in the Methods section).

Let G = (V, E) be the cell sample graph of a cancer based on the gene expression profiles. Suppose that {X₁, X₂, ···, X_L} is a partition of V identified by a community-detecting algorithm from the graph G. For a set X_i, we define g(X_i) to be the average of g(x) for all x ∈ X_i. For a gene g, we use X_g to denote the set X_i such that g(X_i) = max_j{g(X_j)}.

For every j, we define B_j to be the set of all genes g such that X_g = X_j, which are listed in decreasing order of the gene expression value.

A gene map of G according to the partition {X₁, X₂, ···, X_L} is a matrix of colour codes of L × L blocks. In the matrix, the j-th row is the set B_j ordered as defined above and the j-th column is the set X_j listed in a fixed order. All of the genes are listed in the ordering B₁, B₂, ···, B_L, in which each B_j has its own order by the gene expression profiles.

The gene map explores the cell types and the subtypes of the cancer tumours and the set of genes that determines the corresponding cell types and subtypes.

Next, we prepare to construct our gene map method and define the cancer types and subtypes.

Before implementing the experiments, we summarise the steps of our approach. Our approach for constructing the gene map consists of the following steps:

1
Construct a weighted network of cell samples on the basis of the gene expression profiles of the cell samples;
2
Identify the modules and/or submodules by community identification algorithms for networks;
3
Identify the set of genes that defines the modules and submodules of the cell sample network identified by the algorithms;
4
Compare the true tumours types with the modules and submodules identified by the community identification algorithms; and
5
Analyse the survival times, survival indicators and IPI scores of the modules and submodules of the cancers identified by all of the algorithms.

Our approach differs from the hierarchical clustering methods and various learning algorithms. Our method directly defines types and subtypes of cancers by network algorithms and is completely different from the learning algorithms that find a classifier based on some training samples and assign a new sample to a known type using the classifier.

A cell sample network is constructed on the basis of the minimisation of one-dimensional structural entropy as realised in algorithms and .

We define the types and subtypes of cell samples from all of the graphs above by our algorithms based on the K-dimensional structural entropy of graphs, which is denoted , for K = 2 and 3 and by the and algorithms, respectively.

The details on the modules identified by the algorithms are reported in the supplementary information. All of the experiments and analyses performed for the additional cancer, the lung cancer and healthy tissue are given in the supplementary information.

Similarity of the Modules Identified by an Algorithm to the True Tumour Types

For a network G = (V, E), let X and Y be two subsets of V. We define the similarity of Y to X as follows:

Suppose that and are two partitions of G.

Then, similarity of to is defined as the function such that for all j ∈ {1, 2, ···, N},

Acute Leukaemia

Similarity

Table 1 describes the similarity of the modules of acute leukaemia identified by the algorithms , , and .

According to Table 1, the similarity of the modules identified by is the same as that identified by , the weighted average similarity of the modules found by is less than that by the algorithm and larger than that by and the average weighted similarity found by is the same as that found by , whereas it is higher that that found by . According to the similarities listed in Table 1, , and all perform better than in this case.

According to S-Tables 1, 2, 3, 4 and 6, the tables in the supplementary information, we observe the following results: (1) There are 3 true types. (2) Algorithm found 5 models. (3) Algorithm found 2 modules. (4) Algorithm found 4 modules. (5) Algorithm found 6 modules in which ALL_21302B-cell and ALL_7092_B-cell are singletons.

Gene map of true types

Figure 1 shows the colour codes of the gene map of the three acute leukaemia true types: ALL-B, ALL-T and AML. Figure 1 reveals that the type ALL-B may not be a precise or refined tumour type of acute leukaemia, although the three types ALL-B, ALL-T and AML are distinguishable by three different gene expression patterns.

Gene map of the modules based on modularity maximisation

Figure 2 shows the gene map of the modules of acute leukaemia identified by the modularity maximisation algorithm . Figure 2 and S-Table 2 reveal that ALL-B is principally divided by modules 1 and 2 and modules 3 and 4 are basically the ALL-T and AML and each of the modules is defined by a set of genes.

Gene map by InforMap

Figure 3 shows the gene map of the modules of acute leukaemia identified by the algorithm InforMap . Figure 3 and S-Table 3 show that algorithm identified only two modules and none of the identified modules is clearly defined by a unique set of genes or is highly similar to a true type.

Gene map of the modules by structural entropy minimisation

Figure 4 depicts the gene map of the modules of the cell sample network of acute leukaemia identified by our algorithm .

Figure 4 and S-Table 4 show the following results: (1) Module 1 is a subtype of ALL-B (except for AML_13) and it is defined by a set of more than 800 genes. (2) Module 2 is a subtype of ALL-B and it is defined by a set of more than 2,500 genes. (3) Module 3 is exactly the type ALL-T and is defined by a set of more than 1,000 genes. (4) Module 4 is the type AML (missing AML_13) and is defined by a set of more than 1,000 genes.

These results demonstrate that both modules 1 and 2 are biologically meaningful subtypes of the type ALL-B except for AML_13.

According to Figs 1 and 4, the true types and the modules identified by our algorithm are distinguishable. However, the pictures are not highly-defined because, the gene expression profiles in the diagonal blocks are not extremely high and the gene expression profiles other than the diagonal blocks are nontrivially high.

Three-dimensional gene map

Figure 5 depicts the gene map of the refined classification of acute leukaemia provided by our algorithm .

Figure 5 and S-Table 6 reveal the following results:

1
Algorithm identified 6 modules. Each module X_i is either a true type or a subset of a true type that consists of several distinguishable submodules and indicate that all of the modules are distinguishable and each submodule Y_i,j is uniquely determined by a block of genes B_i,j, which generates a high-definition and one-to-one map from the submodules Y_i,j to gene expression patterns B_i,j for all i and j.
2
Submodule 2.2 is defined by a set of more than 1,000 genes.
3
Module 3 is defined by a set of more than 2,000 genes.
4
Module 4 is defined by a set of more than 400 genes.
5
Submodule 6.2 is defined by a set of more than 700 genes.

It is conceivable that an analysis of gene set B that defines a module X or a submodule Y may be required to treat the corresponding module X or submodule Y. However, our results show that a module X or a submodule Y usually has a large gene set B that expresses the module X or submodule Y, which could lead to difficulty in treating a tumour type. To address this issue, our three-dimensional gene map analysis suggests that for a tumour type X, we may divide X into submodules Y₁, Y₂ and Y₃. Upon analysis, the gene sets B₁, B₂ and B₃ may express Y₁, Y₂ and Y₃, respectively. This analysis could aid in treating the tumour type X. Nevertheless, we believe that it is fundamental to identify and analyse the large set of genes that express a biologically meaningful module or submodule and our three-dimensional cancer gene map can provide this method.

The results (1) to (5) demonstrate that all of the modules and submodules classified by our algorithm maybe biologically meaningful. In particular, according to (3) and (4), ALL_21302_B-cell and ALL_7092_B-cell are remarkably cells that may play essential roles in the classification, diagnosis and therapy of acute leukaemia. According to (2) and (5), submodules 2.2 and 6.2 could be extremely important in the classification, diagnosis and therapy of acute leukaemia.

Remark. All the algorithms , , and fail to correctly assign the cell sample AML_13 to the AML type. This finding implies that AML_13 could be particularly interesting.