Random Matrix Analysis for Gene Interaction Networks in Cancer Cells

Investigations of topological uniqueness of gene interaction networks in cancer cells are essential for understanding the disease. Although cancer is considered to originate from the topological alteration of a huge molecular interaction network in cellular systems, the theoretical study to investigate such complex networks is still insufficient. It is necessary to predict the behavior of a huge complex interaction network from the behavior of a finite size network. Based on the random matrix theory, we study the distribution of the nearest neighbor level spacings P(s) of interaction matrices of gene networks in human cancer cells. The interaction matrices are computed using the Cancer Network Galaxy (TCNG) database which is a repository of gene interactions inferred by a Bayesian network model. 256 NCBI GEO entries regarding gene expressions in human cancer cells have been used for the inference. We observe the Wigner distribution of P(s) when the gene networks are dense networks that have more than ~38,000 edges. In the opposite case, when the networks have smaller numbers of edges, the distribution P(s) becomes the Poisson distribution. We investigate relevance of P(s) both to the sparseness of the networks and to edge frequency factor which is the reliance (likelihood) of the inferred gene interactions.


Introduction
Human cells own DNA which codes more than 20,000 genes in their nuclei. They turn on/off transcriptions of the genes in accordance with cellular cycles, circumstances and cell types, and process various kinds of molecules such as proteins. There are always more than 10,000 kinds of proteins in a living cell. Among them, the proteins that stay alone are rare. They interact each other and join huge complex interaction networks. The protein interactions sometimes promote their functions and sometimes they inhibit each other. The proteinprotein interaction networks thus regulate the cell behaviors in accordance with their circumstances. In the nucleus of a cell, the genes that code those interacting proteins should have spatial and temporal correlations in their expression patterns. By observing the gene co-expression patterns with high-throughput experiments such as microarrays or next-generation sequencing technologies, we can study the gene interaction networks that are related to human diseases such as cancers. [25] Recent studies have revealed that there are large regions in DNA that do not code any protein, although they are highly transcribed. These transcripts are called non-coding RNA. [22] The importance of such transcripts as regulators of the gene expressions has become widely known to date from various experiments. [7], [12] It is known that the non-coding RNA bind to other transcripts selectively and thus regulate the gene expressions. [16], [30] Micro RNA, which are about 20nt subsets of the non-coding RNA, have also been observe negatively regulating the gene expressions through interactions with other RNA or even with DNA. [4] These interaction networks of various transcripts have important role in cellular cycles including cell development, proliferation, apoptosis and disease. [5] However, behaviors of such complex networks are totally unknown yet.
Relations between human diseases and modifications of the interacting molecular networks have also been extensively studied. [14], [13], [27], [15], [6] For example, several micro RNA behave as inhibitors for specific interactions in the gene network, and they act as potential oncogenes or tumor genes permitting uncontrolled proliferation of damaged cells. [17] Investigations of topological modifications of such transcripts networks in damaged cells are also important in order to discover new biomarkers or to classify the symptoms in detail. [32], [10] The high-throughput experiments on cancer cells provide huge molecular interaction networks, in which 20,000 to two million elements are involved from a single assay. Such experiments became very popular owing to wide distributions of commercial platforms. Moreover, these expression data are accessible on the internet. For example the NCBI GEO (http://www.ncbi.nlm.nih.gov/geo/) provides a public database of gene expression data. [3] By using such databases, it is even possible to perform a meta-analytical study of gene expressions for example in cancer cells.
Computational inference of gene interactions from the expression data begins with statistical methods such as clustering or principal component analysis. Stochastic procedures are inevitable due to experimental noises. Individual interactions between genes are further calculated using algorithms that are mainly based on the probabilistic graphical models. Markov network or Bayesian network models are the main frameworks in the study of gene network classifications, and there are a lot of studies on gene regulatory networks, protein-protein networks and on molecular pathways in a variety of systems. [8], [20], [28], [9], [29], [33] There are several public gene interaction network databases which are based on, for example, the mutual information (ARACNE) [21], the Bayesian network approach (SiGN-BN) [31] and much more. [24], [6] In such databases, the inferred interactions are usually provided with confidence or likelihood factors which manifest certainties of the interactions. The key point of learning the gene network classification is how to improve the choice of the likelihood factors by integrating related informations of the cells.
Also, network inference algorithms usually involve highly time consuming calculations since they follow huge iterative learning processes within stochastically chosen elements of the network. Thus the inferred network size is rather small. On the other hand, for investigations of a disease such as cancer, a macroscopic view of modifications in the huge complex network topology is required. We have to balance the amount of computing resources and the choice of adequate thresholds of the likelihood factors in various aspects through the computations. Some estimations are necessary in order to discuss whether a certain change in the thresholds is related to a topological modification of the networks.
In this paper, we discuss how sparseness of gene networks, thresholds of likelihood factors of edges and sample sizes in expression data are related to the changes in global topologies of the interacting gene networks by using the method of the random matrix theory. We also discuss possibilities of improving the huge network inference algorithms with this method.
The random matrix theory (RMT) [1] has been applied to a variety of fields not only in physics but also in biology. There are several studies in which RMT is applied in the analysis of complex networks including protein-protein interactions or gene expression correlation patterns. [18], [19] We have also studied protein-protein interaction networks in many organisms such as human, yeast, and Arabidopsis with the random matrix method and obtained a universal (system independent) behavior of P (s): the distribution of the nearest neighbor level (NNL) spacings of corresponding interaction matrices. The NNL spacings s are the spacings between two adjacent eigenvalues of the gene (or Protein-Protein) interaction matrices. The universal behavior of P (s) is called the Wigner distribution in RMT. Moreover, in the previous study for protein-protein interactions that are related to oncogenes or tumor suppressor genes in human cells, we have found that the Wigner distribution changes to the Poisson distribution. From these studies we consider that RMT gives a clue to analysis of the huge complex molecular interaction networks in cells.
In this work, we apply the same method for the Cancer Network Galaxy (TCNG) gene interaction networks. The gene interactions were computationally inferred with the non-parametric Bayesian algorithm which is named SiGN-BN. [31] Gene expression data from various cancer cells are used for the Bayesian network calculations in TCNG. In this database, each inferred gene interactions (directed edges) has a factor called the edge frequency, which indicates the reliability (or the likelihood) of the gene interaction. We study distributions of NNL spacings P (s) for the 256 gene networks in TCNG separately and investigate how the universal behaviors are modified due to such edge attributes.

The Cancer Network Galaxy (TCNG) database
The Cancer Network Galaxy (TCNG) (http://tcng.hgc.jp) is the database of computationally inferred gene interaction networks from the NCBI GEO expression data that are related to human cancer samples. 256 GEO entries are selected for the gene interaction inference calculation based on the Bayesian network model. TCNG (Release 0.14 built on Wed Mar 30 15:00:31 2016) currently stores more than 16 million edges (interactions) between 24,907 nodes (genes). The edges are given with directions and the edge frequency factors as edge attributes. Learning of Bayesian networks are heavily time and memory consuming computations. With the use of the algorithm named NNSR (the neighbor node sampling and repeat), they have obtained considerably large gene interaction networks using the RIKEN AICS K supercomputer. [31] In the Bayesian network model, directed edges between two nodes express causal relationships between them. In the case of gene interaction networks, the directions of edges may infer regulatory relationships between genes. However, as the first step, we study the gene interaction networks with the method of Gaussian orthogonal ensemble (GOE) RMT by neglecting the directions of the edges.

The random matrix theory
Since late 1950s, the random matrix theory was developed in the studies of spectra emissions from heavy nuclei by Wigner, Dyson and Mehta. [23] So far it has been applied to a large variety of fields in physics, mathematics and much more. [1] A lot of experimental studies in real systems also have been done with the RMT, such as in mesoscopic systems and quantum chaotic systems. The RMT has also been applied in various biological systems including protein-protein interaction networks, and the co-expressing gene networks in many organisms. [18] There are thee random matrix groups depending on symmetries, the Gaussian orthogonal ensembles (GOE), the Gaussian unitary ensembles (GUE), and the Gaussian symplectic ensembles (GSE). In the spectral analysis of heavy nuclei, for example, energy levels (eigenvalues) of the Hamiltonian matrices are investigated. So the symmetry of matrices is related to the symmetry of the system.
In the limit of large matrix size : N → ∞ , the distribution of spacings of adjacent eigenvalues (NNL spacings) P (s) becomes a universal function. Here the term "universal" means that the distribution is independent of any detail of the system and is only affected by its symmetry. For the above three symmetry groups, P (s) are written together as, where A = (1 + β)α and D is the average density of the eigenvalues and Γ(x) is the Gamma function. [23], [2] The β is 1 in GOE, 2 in GUE and 4 in GSE case, respectively. In the GOE case (β = 1), It is called the Wigner distribution. The Wigner distribution of NNL spacings infers that the eigenvalues have mutual correlations and repel each other. It is obvious from the small s behavior where P (0) = 0. In the opposite case, when the eigenvalues do not correlate, P (s) becomes the Poisson ( or exponential, in the continuum limit) distribution. In this case β = 0 and from Eq. (1), In many experimental studies including numerical simulations, such universal behaviors have been widely observed. Since the matrix size N is limited in real systems, we have to apply a method called unfolding which is a statistical procedure for rescaling the eigenvalues. For the details of this method, one can consult the text book. [26]

The interaction matrices for gene networks
In this study, we investigate P (s) : the distributions of NNL spacings of gene interaction matrices. The gene interaction matrix : M is evaluated as followings. From TCNG, the gene interaction networks were retrieved. Each gene interaction network is a list of interacting gene pairs. The directions of the inferred edges are omitted. The gene interaction matrix elements M ij is given by There are data on several other platforms from Agilent Technologies and Illumina Inc., etc. The numbers of inferred edges in the networks are widely distributed. The median is about 38, 000, the minimum is about 13, 000, and the maximum is about 64, 000. The edge frequencies (likelihood factors) take values from 0.2 to 1.0. In TCNG, the edges that have less than 0.2 edge frequencies have been omitted from the lists. We categorize the edges by the edge frequencies into four quartiles and P (s) are calculated separately for each of them.
Classifications and extractions of the data have been done using SQLite, and the eigenvalue calculations are done by MATLAB (R2016a) scripts. For the calculations of P (s), the eigenvalues are rescaled with the statistical method called unfolding. [26] This procedure is briefly described as follows.
We first select a part of succeeding eigenvalues and rescale them in order to set the mean density of the eigenvalues to be D = 1. From the 8000 eigenvalues, we selected about 300 succeeding eigenvalues, nearly 5% of the total number of eigenvalues. Then a staircase function N (d) is calculated, where d is the rescaled eigenvalue. By definition N (d) is number of eigenvalues less than d. We further rescale d intod, so N (d) to become a linear function ofd. For the evaluations of linear functions N (d), we used MATLAB built-in function polyfit.

The distributions of NNL spacings depend on gene network sparseness
We first show results of the gene networks where numbers of edges are less than the median 38, 000 of the 256 gene networks from TCNG. Figure 1  In Figure 2, we show distributions P (s) for nine gene interaction networks in the same class of gene network sparseness, where the numbers of the edges are less than 38, 000. For each of these gene interaction networks, the Poisson distribution is observed. It is obviously seen from the inset of Fig. 2 where the P (s) are plotted in a logarithmic scale. We seldom see the Wigner distribution of P (s) in this sparse group of gene interaction networks for all the quartiles of the edge frequencies. We also note that the Poisson distribution is only seen in several sparse networks independent of the sample size for the Bayesian network inference.   In Figure 3, we show the NNL spacings distribution P (s) for a dense gene interaction network (ID:165, NCBI GEO accession number : GSE29013), where the number of edges is 51, 702. This gene interaction network inference is done with the gene expression data of 51 samples from lung cancer cells. The interaction matrix is calculated with edges that have the edge frequency factors in the highest quartile (0.5 − 1.0). The Wigner distribution is observed.
We also show the distributions of NNL spacings for six similar size gene interaction networks in Figure 4 altogether. In all cases, the P (s) show the Wigner distribution. From eq.(2), the Wigner distribution is typically characterized by the linear behavior in the small s region, P (s) ∼ s. In the inset of Fig.4, we show this small s behavior where the linear behavior is nicely confirmed.
The half of the 256 gene interaction networks in TCNG database have more than 38K edges. The Wigner distribution of P (s) for the interaction matrices is widely observed in almost all of these gene interaction networks in the dense network group. We may say that we observe the universal (system independent) behavior of P (s) in the dense gene networks.
The Wigner distribution of P (s) is also independent of the sample size of the retrieved GEO expression data for the Bayesian network calculations. In

Variation of P(s) due to different edge frequencies
As shown in Fig.5, the numbers of samples that are used for the Bayesian network computations of gene interactions varies from 50 to 559 microarray data samples. The number of samples is, for example, the number of different conditions of the cancer cells or the number of patients whose tumor cell is taken in surgery. In SiGN-NNSR algorithm, the number of data samples recommended for the Bayesian network calculation is more than 100. However, the computation time of the Bayesian networks also grows heavily as the number of samples increases. We also might have a lot of overlooked or under-estimated gene interactions due to the experimental noise.
In order to see dependence of P (s) on the experimental (including numerical) bias, we compare P (s) for several edge frequency groups which reflect statistical reliability of the inferred edges.

Crossover from the Poisson distribution to the Wigner distribution
In Figure 6, the distribution P (s) is plotted for two groups of the edge frequencies in the case of ID:92 gene interaction network from TCNG. The total number of inferred edges is 26, 717 and the sample size is 53 of primary stage UICC II colon cancer. This network belongs to the sparse network group in which the number of inferred edges are less than 38K. As shown in Fig. 6, when the edge frequency is small, the Poisson distribution is observed. However, with the large frequency edges (more confident edges), the P (s) behavior changes to the Wigner distribution. Note that the larger frequency group (0.3 − 1.0) contains twice as many edges than the lowest (0.2 − 0.25) frequency group. We consider that this crossover of P (s) between the Poisson and the Wigner distribution is due to loss of some special edges that results in dissociation of some sub-networks. It is also important to note that the Poisson distribution of P (s) is known to be due to vanishing of off-diagonal elements of corresponding matrices in RMT. We thus consider that the Wigner distribution of the P (s) infers degeneracy or couplings of several gene interaction sub-networks. This crossover behavior of P (s) might be used as an effective sign which improves Bayesian network based algorithms for gene network inferences in the sparse case.

The Wigner distribution of P(s) is independent of the edge frequencies compared to the Poisson distribution
In Figure 7, we show P (s) of TCNG ID:236 (the same data used in Fig. 1) gene interaction network for four quartiles of the edge frequency groups separately. This network belongs to the sparse network group. The number of edges are the same in each quartile. The Poisson distribution is observed only in the largest quartile of the edge frequencies 0.5 − 1.0 . In the sparse gene networks, the degeneracy of sub-networks is seldom seen since the Wigner distribution of P (s) is rarely obtained. (Fig. 6 is one of these rare cases, for example.) Furthermore, the Poisson distribution is very sensitive to the edge frequencies (likelihood). We consider that the Poisson behavior of P (s) can be used as a measure of reliance of thresholds of the edge frequencies in sparse gene networks.
In Figure 8, we plot P (s) for the gene interaction network ID:150 for the edge frequency quartiles 0.2−0.25, 0.25−0.3, 0.3−0.5 and 0.5−1.0. In this gene network, the gene expression samples used for the edge inferences was 95 and the number of total edges is 45, 411. It is one of the dense networks. We see that convergence with the Wigner distribution is comparable in all frequency groups compared to the Poisson distribution case in Fig. 7. The agreement between the statistically evaluated P (s) and the universal Wigner distribution is almost independent of the edge frequencies in the dense networks. We consider this convergence is due to the large complexity of the gene interaction network. We might also say that loss of several edges due to a shift of the edge frequency factors does not decouple sub-networks in the dense case compared to the case in Fig. 6. The topological uniqueness of the gene network where the crossover behavior is observed in Fig. 6 (ID:92) is obvious.
We also note that in the gene interaction networks that have less than 15,000 edges in TCNG databse, P (s) show coincidence neither with the Poisson nor with the Wigner distribution for all edge frequency quartiles.

Discussion
We have studied the gene interaction networks which are computationally obtained from gene expression experiments of various human cancer cell. With the method of random matrix theory, the universal Wigner distribution of the nearest neighbor level spacing P (s) is widely observed in the dense gene interaction networks. On the other hand, in the sparse gene networks, the Poisson distribution of P (s) is obtained. The distribution P (s) are obtained for the size N ∼ 8000 interaction matrices, and the threshold of edge numbers is about 38, 000, where the change between the dense and the sparse behaviors of P (s) occurs.
In a previous study of gene expression experiments with random matrix theory, changes of P (s) behaviors have been observed by extracting edges with correlation factors. [19] As the network is decomposed into sub-networks, P (s) tends to become more like the Poisson distribution. We might say that the sparse gene networks where P (s) show the Poisson distributions manifest decoupling nature of gene interactions in cancer cells.
We have also investigated the relation between the distributions P (s) and the edge likelihood factors (or the edge frequencies) in the Bayesian network inferences of gene interactions. When the number of inferred edges is more than 38, 000 , almost all the gene networks show the Wigner distributions of P (s) independent of the edge likelihood factors. The random matrix method cannot detect the edge attributes in detail in this dense networks.
On the other hand, in the sparse gene networks, the behavior of P (s) is much more sensitive to the edge frequencies. In some of the sparse networks, even the crossover between the Poisson and the Wigner distributions in relation with the edge frequencies is observed. From this crossover behavior, we may say that the sparse gene networks have some universal structure embedded in edge attributes such as the edge likelihood factors. We may say that random matrix method is capable to filter out some edge attributes in the computational inference of sparse gene networks. Relations between the P (s) behavior and the sample size which are used in the Bayesian network inference are still unclear.
In this study, we have totally omitted directions of the edges which may infer gene regulatory relations. However, numbers of studies on systems with random matrix theory show that the universality of P (s) is independent of any detail of each interacting system. We are going to check whether this universality is maintained also in directed biological networks in the near future.