A Statistical Test for Differential Network Analysis Based on Inference of Gaussian Graphical Model

Differential network analysis investigates how the network of connected genes changes from one condition to another and has become a prevalent tool to provide a deeper and more comprehensive understanding of the molecular etiology of complex diseases. Based on the asymptotically normal estimation of large Gaussian graphical model (GGM) in the high-dimensional setting, we developed a computationally efficient test for differential network analysis through testing the equality of two precision matrices, which summarize the conditional dependence network structures of the genes. Additionally, we applied a multiple testing procedure to infer the differential network structure with false discovery rate (FDR) control. Through extensive simulation studies with different combinations of parameters including sample size, number of vertices, level of heterogeneity and graph structure, we demonstrated that our method performed much better than the current available methods in terms of accuracy and computational time. In real data analysis on lung adenocarcinoma, we revealed a differential network with 3503 nodes and 2550 edges, which consisted of 50 clusters with an FDR threshold at 0.05. Many of the top gene pairs in the differential network have been reported relevant to human cancers. Our method represents a powerful tool of network analysis for high-dimensional biological data.

It is well-acknowledged that a complex disease is rarely a consequence of an abnormality of a single gene product, but involves various pathological processes that interact in a complex network 1 . The better understanding of the effects of molecular and cellular network in disease etiology has multiple potential biological and clinical applications. It will help identify pivotal disease risk genes and pathways and provide better targets for drug development.
Previous methods for network analysis mainly focused on correlation-based metrics to measure the strength of association between gene pairs in a network [2][3][4] . However, these methods, which only explore marginal correlations, cannot distinguish direct or indirect relationships between genes. Gaussian graphical model (GGM) is a relatively more realistic way to present complex network because of its interpretation with conditional dependence between two variables after removing the effects of all other variables 5 . GGM can filter out all high correlations which are attributed to other genes, and also can lead to genes highly related in terms of partial correlations with other neighboring genes 5 . GGMs are closely linked to precision matrices, which describe the graphical structure of the corresponding Gaussian graph. It is a great challenge to construct biological networks through GGM in high-dimensional setting, in which the number of variables or features is much larger than the sample size. The basic idea behind it is that high-dimensional biological data are sparse in the sense that only a small number of genes will regulate one specific gene of interest 5 . This scenario leads to the construction of an undirected graph of conditional dependencies which is sparser than a correlation network 5 .
During the last decade, many methods for estimating GGM in the high-dimensional settings have been developed based on certain sparseness assumptions. One of the most widely used methods was the graphical Lasso (GLasso) method through the use of L 1 (lasso) regularization 6 . Cai et al. developed a constrained L 1 minimization approach to estimate sparse precision matrix 7 . More recently, Ren et al. proposed a novel method to obtain asymptotically normal and efficient estimation of large GGM under a minimal sparseness condition 8 , which is the first theoretical study to estimate partial correlations as well as p-value and confidence interval for each edge in the graph. In addition, a fast algorithm, named "FastGGM 9 ", as an exact implementation to the asymptotically normal and efficient estimation established by Ren et al. 8 , showed that the inference of partial correlation between genes becomes computationally feasible for whole-genome data sets 9 . All of these methods addressed the problem of estimating and constructing a single Gaussian graphical model. Differential network analysis, which investigates how the network of connected genes changes from one condition to another, has become a prevalent tool to provide a deeper and more comprehensive understanding of complex diseases 10 . Several recent studies have demonstrated the power of differential network analysis for elucidating fundamental and key biological responses, revealing that the architecture of gene network can be rewired during a cellular or adaptive response [10][11][12] . It is of great biological interest in many applications to estimate the precision matrices and the corresponding graphical structures over different groups or conditions. A differential network between two groups can be constructed by the difference between the two precision matrices, which is interpreted as the differences in the partial covariances of each pair of genes between the two groups. It's notable that the gene network in different groups are often similar to each other, the graphical structures would share many common edges. In the differential network, the significant connections are discovered to differentiate from one condition to the other while weak and common ones are removed 10 . A joint graphical lasso (JGL) has been proposed to preserve the common graphical structure while allowing differences across groups 13 . This method is based on maximizing a penalized log likelihood with a fussed Lasso or group Lasso penalty. However, there was no theoretical justification on the statistical convergence rate of the estimators in the method and the results were heavily dependent on the choice of tuning parameters. Recently, a pathway-based differential network analysis model (DINGO: Differential Network Analysis in Genomics) has been developed to jointly estimate the group-specific conditional dependencies by decomposing them into global and group-specific components 14 . However, the computational time involved in model fitting made it impractical to handle more than 2000 genes. Moreover, these approaches assumed that precision matrices in different groups were sparse without considering the structure of real gene network. For example, real regulatory gene network often contains hub nodes, therefore the rows and columns of precision matrix corresponding to hub nodes have many nonzero entries, possibly violating the sparsity condition 15 .
In the present study, through the GGM framework, we developed a computationally efficient test to infer the differential network structure through testing the equality of two precision matrices in the high-dimensional setting and applied a multiple testing procedure with FDR control. We evaluate our method and compare it with other estimation approaches via simulations under different parameter settings. Then we applied our method to a lung cancer dataset. Using both simulated and real data sets, we demonstrate that our method is a powerful tool for differential network analysis in high-dimensional biological data. for ith and jth individual is an independently and identically distributed sample from a Gaussian distribution Σ N(0 , ) P 1 and Σ N(0 , ) P 2 , respectively, where 0 P is a vector of P 0's and ∑ 1 and ∑ 2 is a P × P covariance matrix.

Methods
1 for d = 1, 2 be the precision matrix for X and Y, respectively.
It is known that the precision matrix (inverse covariance matrix) Ω = Σ −1 represents a GGM, where the non-zero (or zero) value for ω k,l in the (k, l)th entry of Ω represents the presence (or absence) of edges between kth and lth variable. A GGM associated with X is a graph, where the node set V = {x 1 , x 2 , …, x p } has p components and the edge set E such that any edge between x k and x l if and only if x k and x l are conditional dependent given all other variables. Similarly, a GGM associated with Y is also a graph. The methodology of the present study is based on GGM and translates the differential network analysis with a binary trait D into the statistical inference and comparison of two high-dimensional precision matrices.  8 . The efficient estimator of the individual ω ij was then used to construct fully data-driven procedures to recover the support of and to make statistical inference about latent variables in the graphical model.
Briefly, consider an index set A = {i,j} with i ≠ j. In the Gaussian setting the precision matrix can be described in terms of regression models. For n 1 × p data matrix X, we regress the ith and the jth columns X A against the remaining columns X A c . Specifically, we may write . Scaled Lasso was used for the regression to obtain the estimator β of β and the residual    . Note that scaled Lasso provides scale-free simultaneous estimation of the regression coefficients and noise level. It is a tuning-free penalized approach so that it can avoid the cross-validation procedures 8 . The estimator ω ij is asymptotically efficient, . F ij is the estimator of F ij , which is the Fisher information for estimating ω ij . The lower bound is established through Le Cam's lemma 8 . Partial correlation, which is used to measure the strength of conditional dependence, is calculated as It is worthwhile to point out that the asymptotic efficiency result is obtained without the need to assume the irrepresentability condition or the L 1 constraint of the precision matrix which are commonly required in the literature.

Hypothesis testing of differential networks. Let
1 for d = 1, 2 be the precision matrix for X andY, respectively. The difference between two precision matrices from X and Y, respectively, is called the differential network and denoted by δ ,2 . We aim to make statistical inference of each edge in the differential network, or equivalently of each δ ij by testing the hypothesis Based on the inference of GGM and asymptotic normality of ω ij , we derive the test statistics as Shown in the section of Inference of GGM, as the network graph is sufficiently sparse, the estimator ω ij is asymptotically efficient in the sense that . F ij is the estimator of F ij , which is the Fisher information for estimating ω ij . Equivalently, the asymptotic normality of the estimator ω ij has mean ω ij and variance x , , x n 1 1 } and { … y , , y n 1 2 } are independent observations from two populations. ω ij,d for d = 1, 2 be the precision matrix for X and Y, respectively. The estimator ω ij,d is asymptotically efficient that . Multiple testing with false discovery rate (FDR) control. For testing p(p − 1)/2 hypotheses, multiple testing procedures using Bonferroni correction or naive false discovery rate corrections may lose power. In order to carry out simultaneous testing on the structure of the differential network Δ with FDR control, we used the following multiple testing procedure 17 .
1. For a given pre-specified level α, Note that t is the threshold level such that H 0,ij is rejected if |W ij | ≥ t. ϕ(t) is a standard normal cumulative distribution function and = ∑ ≥ denotes the total number of rejections 17 . The ideal choice of t would reject as many true positives as possible in the meantime of controlling the FDR at the given pre-specified level α.
Simulation study. To evaluate the performance of our method, we designed realistic simulations. In the setting of n < p, we simulated two group data from multivariate normal distributions with different undirected graph structures, including hub, scale-free, and random graphs, in which some of the edges are common to both groups.
Suppose we had two groups d = 1 and 2, respectively. First, a common undirected graph structure was simulated for the two groups. A precision matrix Ω was generated using huge.generator in the R package Huge 18 . As the common graph, we considered three different graph structures, including hub, scale-free, and random graph. The off-diagonal elements of the precision matrix were set to be 0.5 and a positive number added to the diagonal elements of the precision matrix was 0.2. The number of hubs in the hub graph and the probability that a pair of nodes had an edge in the random graph were default values set in the function huge.generator 18 . The scale-free graph was generated using Barabasi-Albert algorithm implemented in the function huge.generator 18,19 . Second, set Ω 1 = Ω 2 = Ω and replace a randomly chosen entry from the zero entries in Ω 1 and Ω 2 with a uniform random sample. Third, repeat the second step γ × M times, where M is the number of nonzero entries in Ω and γ is the heterogeneity of the graphs, which is used to control the ratio of the number of individual edges to the number of common edges. Fourth, generate x i and y i for ith individual in two groups from the distributions Ω − N(0 , )  Table 1. In each scenario, the datasets were simulated with 100 replications based on four parameters including the sample size n, the number of vertices p, the level of heterogeneity γ and graph structure.

Results
Note that Ren et al. had reported better performance of the tuning-free inference methodology than the existing L 1 penalized methods in the estimation of the precision matrix 8,9 . Additionally, we compared its performance with JGL and DINGO in scenarios of different undirected graph structures, in which some of the edges are common to both groups. We evaluated the estimation of nonzero elements in true precision matrices Ω 1 and Ω 2 , as well as in true Δ. Receiver Operating Characteristic (ROC) curve was generated using the estimated p-values and the true zero/nonzero elements in the precision matrix. And then the Area Under the Curve (AUC) was averaged over 100 replications to measure the performance of the estimation. For JGL, there were two tuning parameters. We first chose optimal tuning parameters by Bayesian information criterion (BIC) and then calculated the averaged AUC for the ROC curves based on a sequence of tuning parameters. For the DINGO, ROC curve was generated based on the cutoffs of the estimated group-specific partial correlations and the differential score of the differential network.
As to the estimation of the underlying precision matrices from the simulated data matrices of Ω 1 and Ω 2 , Table 1 showed the mean AUC averaged over 100 replications for each combination of sample size n, the number of vertices p, the level of heterogeneity γ and graph structure. The AUCs in our method were much larger than those in the DINGO and JGL, suggesting that our method had more accurate estimations of the conditional dependencies than the DINGO and JGL methods. When n, p and γ were fixed, AUC in our method was smallest in scale-free graph among three graph structures, but still large enough to demonstrate the accuracy of the estimations.
For the differential network structure, our method performed much better than the DINGO method. Note that in the paper of DINGO 14 , it only focused on the performance in the estimating the group-specific component and didn't examine the performance of the differential scores proposed for determining the edges in the differential network. Here in our simulation, it showed that the differential scores performed poorly (Table 1). Besides, it was estimated that DINGO would take more than 50 hours as p > 2000 (using a Linux server with a 2.67 GHz Intel processor and 96GB of RAM). The computation time in DINGO increased exponentially as p increases. In step 2 of DINGO the group-specific component was estimated using expectation-maximization (EM) algorithm and in step 3 of DINGO differential scores were calculated from the bootstrap procedure. Both are computationally intensive. Therefore, it was impractical for DINGO to deal with genome-wide data which normally have the number of genes p > 10,000. In contrast, our test statistics, which was directly used to test the differential network structure, performed much faster than DINGO (Supplementary Fig S2). In particular, our method is still feasible even number of genes p = 8000.

Real data analysis.
Lung cancer is the leading cause of cancer death worldwide and adenocarcinoma is its most common histological subtype 20 . Here, we applied our algorithm to perform genome-scale differential network analysis for lung adenocarcinoma, aiming to find important molecular implications for lung cancer treatment. Gene expression profiling of 58 lung adenocarcinoma tumors and their matched histologically normal lung tissue samples were analyzed using Illumina HumanWG-6 v3.0 expression beadchip. We downloaded the normalized data from the NCBI GEO database, GSE32863 21 . Statistical analyses were limited to probes retained after applying the following filters: non-detectable expression in ≥90% of samples using a detection P-value cut-off of 0.01. We averaged the expression values of multiple probes matched to the same genes in each sample. After these data pre-processing, 7827 genes were remained for the subsequent analyses. First, we inferred the gene network by estimating the corresponding GGMs of genes for lung adenocarcinoma tumors and lung normal www.nature.com/scientificreports www.nature.com/scientificreports/ tissues, respectively. Then, we performed the differential network analysis by investigating the difference between two precision matrices. Table 2 listed the top 10 most significant pairs of genes in the differential network analysis. For each pair of genes in Table 2, we showed the corresponding element in the precision matrix and p-value in case and control group, respectively, as well as the corresponding partial correlation and p-values. Note that many genes have been reported previously relevant to human cancers. For example, GRN is a potent mitogen and growth factor implicated in many human cancers 22 . A somatic RRAS mutation (p.Gln87Leu substitution) had previously been reported as a rare somatic event in lung carcinoma 23 . Most gene pairs with very significantly strong partial correlations in control group didn't have significant partial correlation in case group. It indicated that the connection of genes in disease samples was weaker than that in healthy samples. It was interesting to point out that there were two special pairs, LOC284230 ~RPL23 and GDI2~LOC651816. Both pairs had significant partial correlation in both case and control, however, the direction of partial correlation in case was opposite to that in control. RPL23, a tumor metastasis-related gene, was found to induce high invasiveness of a human lung adenocarcinoma cell line 24 . GDI2 overexpression reduced lung metastasis 25 .
As a comparison, we also applied DiffCoEx, a method for identifying marginal correlation-based pattern changes, which builds on the commonly used Weighted Gene Coexpression Network Analysis (WGCNA) framework for coexpression analysis 2 . We selected the default parameters, such as the soft power and minimum module size in network construction and module detection. In total, 19 modules were detected, shown in the Supplementary Fig S3. Based on our proposed procedures, at a false discovery rate level of 0.05, it resulted in a differential network with 3503 nodes and 2550 edges, which consisted of 50 clusters, which had the maximal connected sub-networks with number of nodes ≥10.
From this differential network, we can directly know the network structure. For DiffCoEx method, however, we cannot derive any detail of the network structure within each module. It is known that analyzing network differences of gene networks between the disease and healthy conditions could be helpful for understanding the genetic mechanisms of the disease. In terms of differential network structure, our method is much better for us to understand the mechanism of the disease. By using the partial correlation approach, in contrast, our differential network analysis resulted in much sparser network. As the partial correlation quantified the correlation between two genes after controlling other genes' effects, which provided with useful information to distinguish the causal correlations in the network. Functional analysis for the differential network showed that the top significantly enriched KEGG pathways were "Metabolic pathways" (hypergeometric test, p value = 6.43e-63) and "Pathways in cancer" (p value = 1.59e-24). The full results of functional enrichment analysis of the differential network and clusters from DiffCoEx were shown in the Supplementary Tables 1 and 2, respectively, to illustrate the functional compositions of two methods. For simple illustration and visualization, we showed a large cluster from the differential network in the Fig. 1.

Discussion
Motivated by an important biological question that how the network structure of cellular interactome change from one condition to another, we derived a formal statistical test for the differential network analysis based on the inference of GGM. Our method not only provided statistical inference of the difference of edge strength between graph nodes in the differential network analysis, but also a multiple testing procedure for simultaneously testing the large number of tests with FDR control to infer the structure of the differential network. The source code of the implementation is available at Supplementary File 1.
The hypothesis testing of differential network was directly based on estimator of ω ij,1 − ω ij,2 . First, we implemented the asymptotically normal and efficient estimation of GGM. Then we performed the hypothesis testing of differential network by comparing the difference between two precision matrices from two conditions, respectively. Moreover, we performed a multiple testing procedure with FDR control for simultaneously testing p(p − 1)/2 hypotheses. The procedure for the differential network analysis we present here had the advantage in a global and unbiased manner.
First, our method provided with a rigorous statistical test for the difference of conditional dependence between two different conditions. It represented a major improvement over earlier procedure, which built two  www.nature.com/scientificreports www.nature.com/scientificreports/ global gene networks for the disease and healthy samples respectively under the GGM framework with a FDR threshold for determining the existence of edges, and then compared the topological changes with the unique edges that belonged to only one of the networks 9 . Our method directly resulted in the differential network structure with statistical estimation and inference of the difference of edge strength between graph nodes. Second, we adopted a multiple testing procedure for simultaneously testing the large number of tests with FDR control to infer the structure of the differential network, as the standard Bonferroni or naive FDR corrections would lose power. For the multiple testing problem, we proposed to threshold test statistics directly rather than using p-values as in Benjamini and Hochberg (BH) 26 , mainly because the BH method for controlling FDR required the independence between p-values, while our test statistics may be weakly dependent of each other, which is natural in GGM estimation 17 . Third, through the realistic simulation studies with different combinations parameters of sample size, number of vertices, level of heterogeneity and graph structure, we demonstrated that our method performed much better than the current available methods in terms of accuracy and computational time. Then we applied it on a real data set and successfully constructed the differential network for lung adenocarcinoma. The differential network analysis can help reveal how the architecture of gene network is rewired during a cellular or adaptive response and elucidate fundamental molecular mechanism of biological processes. Especially for cancer research, our method will be very helpful for identifying novel driver genes or pathways. In our real data analysis on the lung adenocarcinoma, we revealed a differential network with 3503 nodes and 2550 edges, which consisted of 50 clusters with a FDR threshold at 0.05. Especially, for the top gene pairs in the differential analysis, many of them have been reported relevant to human cancers. Our method can be a powerful tool of network analysis based on GGM, especially for high-dimensional biological data.
However, there were several limitations of our method. First, the inference of GGM relies on the Gaussian assumption on the data. Nowadays, high-throughput RNA sequencing (RNA-seq) is the standard tool for gene expression analysis. Analyzing RNA-seq data depends on estimates of read count variability, which are statistically modeled as the negative binomial distribution 27 . Second, currently our method can only be applied to the study of differential network analysis between two conditions. We will extend our differential network approach for sequence data as well as multiple conditions in future studies.

Conclusion
In summary, we derived a formal statistical test for the differential network analysis based on the inference of GGM, as well as a multiple testing procedure for simultaneously testing the large number of tests with FDR control to infer the structure of the differential network. Through simulation studies, we demonstrated that our method performed much better than the current available methods in terms of accuracy and computational time. Our method will be very helpful in differential network analysis for identifying novel driver genes or pathways in high-dimensional biological data.