Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# An efficient and effective method to identify significantly perturbed subnetworks in cancer

## Abstract

The identification of key functional biological networks from high-dimensional genomics data is pivotal for cancer research. Here, we introduce FDRnet, a method for the detection of molecular subnetworks in cancer, which addresses several challenges in pathway analysis. FDRnet detects key subnetworks by solving a mixed-integer linear programming problem, using a given upper bound of false discovery rate (FDR) as a budget constraint, and minimizing a conductance score to find dense subgraphs around seed genes. A large-scale benchmark study was performed on both simulation and cancer genomics data. FDRnet outperformed other methods in the ability to detect functionally homogeneous subnetworks in a scale-free biological network, to control FDRs of the genes in detected subnetworks, to improve computational efficiency and to integrate multi-omics data. By overcoming the limitations of existing approaches, FDRnet can facilitate the detection of key functional pathways in cancer and other genetic diseases.

## Main

Identifying cancer driver genes and functional pathways is a core mission of current cancer research1,2,3. It would not only provide actionable therapeutic targets, but would also greatly advance our understanding of the molecular mechanisms of carcinogenesis. Mainstay methods for detecting cancer driver genes are prevalence-based approaches, which search for genes that are mutated more frequently than expected by random chance4,5,6,7. However, several recent large-scale studies have demonstrated that most cancers exhibit extensive mutational heterogeneity, with only a few genes being frequently mutated (for example, TP53) in a high percentage of patients2,3.

Although gene-level analysis is an essential first step, it does not provide sufficient statistical power to detect rarely mutated, but functionally important, genes and typically results in a list of putative driver genes without a unified theme of biological processes8. A promising approach to overcoming these issues is to perform de novo pathway analysis by overlaying mutational data onto a protein–protein interaction (PPI) network and detecting functional modules that are significantly disrupted in cancer. With the accumulation of protein interaction information, pathway analysis has become a major research topic in systems biology9,10,11,12,13. Computationally, detecting cancer pathways is essentially a combinatorial optimization problem10, which is a more difficult task than statistical tests of mutational abundance. Although considerable efforts have been made in the past two decades9,10,11,14,15,16,17, several key issues are outstanding, including those related to detecting functionally homogeneous subnetworks in a scale-free biological network, controlling the false discovery rates (FDRs) of identified subnetworks, achieving provably optimal solutions, integrating genetic data from multiple platforms and addressing the computational complexity issue. Most existing methods are heuristic, with no guarantee of optimality, and few methods consider controlling FDRs.

In this Article, we introduce a method, referred to as FDRnet, for the detection of significantly perturbed subnetworks in cancer. We propose a definition of FDR that overcomes the hurdle of assessing the statistical significance of detected subnetworks. We formulate the subnetwork identification problem as a mixed-integer linear programming problem, using a given FDR upper bound as a budget constraint and a conductance score as an objective function, to find densely connected subgraphs around seed genes. We develop algorithmic strategies to address computational issues. To demonstrate the effectiveness of the proposed method, we conduct a large-scale benchmark study on simulation data, on breast cancer mutational and copy number data, and on lymphoma gene expression data. The new method addresses several outstanding challenges in molecular pathway analysis.

## Results

### Overview of the FDRnet method

FDRnet combines the results of gene-level analysis and a biological network to detect subnetworks with significantly perturbed genes in cancer (Fig. 1). Briefly, given a set of p values, we first perform an empirical Bayesian analysis to estimate the probabilities of individual genes being non-cancer genes (that is, local FDRs) and identify a set of seed genes defined as those with local FDRs not greater than a given FDR upper bound (Fig. 1a). For each seed gene, we perform a random walk on a biological network to compute a personalized PageRank vector and extract a local graph around the seed by attaining the nodes with the K largest PageRank scores (Fig. 1b). For each extracted local graph, we solve a mixed-integer linear programming problem to identify a densely connected subnetwork that minimizes the conductance score and has an FDR not larger than the given FDR bound (Fig. 1c). FDRnet is able to handle both weighted and unweighted biological networks. Further details are provided in the Methods, and the software and user manual for the method are available at www.acsu.buffalo.edu/~yijunsun/lab/FDRnet.html.

### Benchmarking and performance analysis using simulation data

Owing to the lack of ground-truth information, it is generally difficult to evaluate the performance of subnetwork identification methods. To overcome this issue, we designed a benchmark study using simulation data to study the algorithmic properties of FDRnet and compare its performance with existing approaches. Given that there is currently no generative model for generating biological networks, we used a real-world PPI network, the iRefIndex network18, which contains 12,128 genes and 91,808 interactions. To specify target subnetworks, we extracted from the CORUM database19 16 protein complexes that were reported to be involved in the development of breast cancer and have sizes ranging from 10 to 50 proteins (Supplementary Table 1). We employed a signal-to-noise decomposition model20 to generate p values for individual genes. Specifically, the p values of target genes (that is, genes in target networks) were randomly sampled from a beta distribution β(α, 1) and the p values of non-target genes were sampled from a uniform distribution U(0, 1). To assess the performance of an algorithm applied to data with different signal strengths, we varied the values of α, ranging from 0.01 to 0.11, where a smaller α corresponds to a larger signal strength. We compared FDRnet with five other methods: HotNet216, Hierarchical HotNet (hHotNet)17, RegMOD21, BioNet/heinz20 and ClustEx22. Supplementary Section 2 provides a brief review of the five methods and Supplementary Section 3.1 the experimental settings.

FDRnet significantly outperformed other methods in terms of the ability to detect target genes and modular structures, as quantified by F and Fsub scores (Fig. 2a,b). Here, the F score is computed as the harmonic mean of the precision and recall of a method in detecting target genes, and the Fsub score is a natural extension of F score to quantify the performance of a method in detecting target subnetworks (Methods). In terms of F scores, FDRnet and BioNet performed best, HotNet and hHotNet second, followed by ClustEx and RegMOD. However, in terms of Fsub scores, FDRnet significantly outperformed all the other methods and BioNet did not perform well. This is because a biological network exhibits a scale-free structure23, and different functional modules can be connected through hub nodes (Supplementary Fig. 1). By searching for maximum-scoring subnetworks, BioNet grouped nearly all the detected genes (~200 genes) into one subnetwork (Fig. 2c), suggesting that BioNet is not able to handle a scale-free structure.

We next compared the ability of the six methods to control the FDRs of detected subnetworks. FDRnet and BioNet are the only methods that can consistently control the FDRs of the detected subnetworks across different signal strengths (Fig. 2c and Supplementary Fig. 4). However, because we adopted the signal-to-noise decomposition model used in BioNet to generate p values for individual genes, the distribution of p values perfectly matched the assumption used in BioNet, which may not be the case for real data (Figs. 3a and 4a). While RegMOD and ClustEx did not show any evidence of controlling FDRs, HotNet2 and hHotNet controlled FDRs only when the signal strength was large (α ≤ 0.06; Supplementary Fig. 4).

Finally, we compared the computational complexities of the six methods (Fig. 5 and Supplementary Table 2). FDRnet and BioNet performed best, RegMOD second, followed by ClustEx, HotNet2 and hHotNet. Although it took FDRnet ~419 s to finish the analysis, it took hHotNet and HotNet2 ~83 h and ~47 h, respectively. FDRnet is about 700 and 400 times faster than hHotNet and HotNet2, respectively. Similar results were observed when the six methods were applied to breast cancer and lymphoma data (Fig. 5 and Supplementary Table 2). We also performed additional comparative analysis by mapping the simulation data onto two other PPI networks, namely BioGRID24 and ReactomeFI25, and observed similar results (Supplementary Section 3.3).

### Detecting significantly mutated subnetworks in breast cancer

Copy number aberrations and somatic mutations play a central role in tumorigenesis26. Here, we applied FDRnet to The Cancer Genome Atlas (TCGA) copy number and somatic mutation data2 to identify significantly mutated subnetworks in breast cancer. The mutation data contain 76,674 non-silent mutations in 18,268 genes from 978 breast cancer patients. They have been analyzed by MutSig2CV4 and each gene was associated with a p value that indicated the statistical significance of its mutation frequency. The copy number data have been processed by GISTIC2.027 and contain the discrete states of the copy numbers in 24,776 genes from 1,080 patients. We performed a series of pre-processing analyses to identify genes whose expression levels were driven by extreme copy-number states, to compute local FDR scores for the copy number data and to combine scores derived from mutation and copy number data (Supplementary Section 4.2). We mapped the obtained scores onto the iRefIndex network for the detection of significantly perturbed subnetworks.

In total, FDRnet detected 163 genes grouped into 40 subnetworks (Fig. 3c, Supplementary Table 3 and Supplementary Data 1). For comparison, we applied the other five methods using experimental settings similar to those used in the simulation study (Supplementary Table 3; details are provided in Supplementary Section 4.3). The overall results were similar to those observed in the simulation study. Specifically, FDRnet is the only method that can effectively control the FDRs of the detected subnetworks (Fig. 3a). By contrast, ClustEx and RegMOD did not perform well. ClustEx detected 1,792 genes grouped into 494 subnetworks, and RegMOD detected 458 genes grouped into 147 subnetworks. None of the subnetworks had an FDR smaller than 0.1. The average FDRs of ClustEx and RegMOD were 0.89 and 0.86, respectively. Similar to the simulation result (Fig. 2c), BioNet found only one subnetwork, consisting of 823 genes (FDR = 0.44). Although HotNet2 and hHotNet performed slightly better than RegMOD and ClustEx, a large number of the detected subnetworks had FDRs larger than 0.1 (Fig. 3a and Supplementary Table 3).

We next performed an in-depth analysis comparing FDRnet with HotNet2 and hHotNet, as the latter two represent recent methods designed specifically to detect molecular subnetworks in cancer (Supplementary Data 2 and 3 and Supplementary Figs. 13 and 14). First, we assessed the genes detected by the three methods by comparing with the COSMIC cancer gene database28. Among the 163 genes identified by FDRnet (Supplementary Data 1), 45 genes were listed in the COSMIC database, including TP53, ERBB2, CDH1, RB1, MYB, PIK3CA and PTEN. By contrast, HotNet2 identified only 21 COSMIC genes, of which 18 were also detected by FDRnet. Notably, HotNet2 missed many well-known breast cancer genes, including ESR1, PTEN, RB1, ARID1A, MDM2, MYB, CTCF, TBX3, SMAD2 and SMAD4 (Fig. 3b). Although hHotNet performed slightly better than HotNet2 and detected 29 COSMIC genes, many well-known cancer genes, including ESR1, PTEN, RB1, MYB and MED12, were absent. Moreover, among the 104 subnetworks identified by hHotNet, 90 (87%) subnetworks consisted of only two or three genes (Supplementary Fig. 14). For example, TP53, which plays a central role in tumorigenesis and interacts with multiple genes29, was grouped with NCOR1 into a two-gene network; it is difficult to declare two-gene subnetworks as functional modules. We next assessed the functional homogeneity of the subnetworks detected by the three methods, using four scoring functions (Supplementary Table 4). FDRnet outperformed HotNet2 and hHotNet, which is consistent with the Fsub scores observed in the simulation study (Fig. 2b).

To demonstrate the biological relevance of the subnetworks identified by FDRnet, a gene-ontology term enrichment analysis30 was performed (Fig. 3c and Supplementary Fig. 16). The first notable network was dominated by TP53, the most prevalently mutated gene in breast cancer2, consistent with a pivotal role in DNA repair and genome stability29. Other frequently mutated genes in breast cancer (PIK3CA, PTEN, MAP3K1, RB1) were all identified within specific subnetworks. The analysis also highlighted networks that contained the estrogen receptor (ER) and ERBB2/HER2. ER is key to maintaining the function of breast cells, and the inhibition of ER using antiestrogens (for example, tamoxifen) is a major breast cancer treatment strategy31. However, treatment resistance can occur through the loss of functional ER by ESR1 deletion or by mutation, or by changes in downstream signaling and intracellular crosstalk32. The ERBB2 oncogene is overexpressed in ~15% of breast cancer cases, typically as a result of gene amplification. As a stimulator of proliferation, ERBB2 is the target of effective monoclonal antibody therapy. Another interesting finding was the presence of NCOR1 in a number of functional groups with considerable crosstalk. As a key regulator of transcriptional activity, NCOR1 plays a central role in human biology and has been considered as a tumor suppressor gene as reduced levels promote tumor proliferation and invasion33. NCOR1 could represent a major node of interaction across several functional systems, and perturbation of such factors could have wide-ranging impact beyond that indicated by mutational prevalence alone. Finally, mutations in genes in the Mediator complex are beginning to be implicated in cancer. Key differentiation genes, including many oncogenes, have been shown to be under the control of super-enhancers that show high dependence on the Mediator complex34. Crosstalk between the Mediator complex and other subnetworks was evident (Fig. 3c), and this is supported by recent findings of a link between Mediator and DNA repair35.

### Detecting significantly perturbed subnetworks in diffuse large-B-cell lymphoma

FDRnet can also be used to analyze gene expression data. To demonstrate this, we applied FDRnet to a dataset obtained from a diffuse large-B-cell lymphoma (DLBCL) study36, which contains the expression levels of 3,583 genes from 112 patients diagnosed with the germinal center B-cell-like (GCB) subtype and 82 patients with the activated B-cell-like (ABC) subtype. It has been shown that the two molecular subtypes are associated with distinct clinical outcomes and oncogenic mechanisms37. Our goal was to identify pathways that are expressed differentially between the two subtypes by integrating the results of gene-level analysis with a PPI network. To identify differentially expressed genes, a Student’s t-test was performed and p values were calculated to measure the statistical significance of individual genes. We extracted a disease-specific network from the Human Protein Reference Database network38 by removing from the network the nodes that had no gene expression data and retaining the largest connected subnetwork. The extracted subnetwork consisted of 2,034 nodes and 8,399 interactions. To incorporate the gene co-expression information, we assigned a weight to each edge, which was set to be the absolute Pearson correlation coefficient between the expression profiles of two interacting genes21.

FDRnet identified a total of 59 genes grouped into 11 subnetworks (Fig. 4b, Supplementary Table 5 and Supplementary Data 4). As with the simulation and breast cancer studies, FDRnet effectively controlled the FDRs of all the detected subnetworks (Fig. 4a). ClustEx detected 332 genes grouped into 66 subnetworks, and RegMOD detected 19 genes grouped into nine subnetworks, but only 6% and 22% of the identified subnetworks, respectively, had an FDR smaller than 0.1 (Supplementary Table 5). The average FDRs of ClustEx and RegMOD were 0.81 and 0.45, respectively. Moreover, the subnetworks detected by RegMOD contained only two or three genes. Similar to the previous results, BioNet found only two subnetworks, one containing 806 genes and the other containing two genes, and both subnetworks had an FDR larger than 0.5. HotNet2 and hHotNet behaved quite differently for the gene expression data. Specifically, HotNet2 failed to find any significant subnetwork, and hHotNet identified 265 genes grouped into 71 subnetworks (Supplementary Table 5). However, among the 71 subnetworks detected by hHotNet, only two subnetworks had an FDR smaller than 0.1 and the average FDR was 0.77 (Fig. 4a). Similar to the result observed in the breast cancer study, the subnetworks identified by FDRnet were functionally much more homogeneous than those detected by hHotNet (Supplementary Table 6).

We next performed a gene-ontology term enrichment analysis of the subnetworks identified by FDRnet (Fig. 4b and Supplementary Fig. 17). DLBCL is thought to arise from genetic instability during the normal maturation of B cells, causing differentiation perturbation and immortalization. Studies have identified genetic alterations, some specific to DLBCL while others were observed in other cancers. Examples of pan-cancer alterations include activating TP53 mutations, which are negative prognostic factors in DLBCL patients39. Another key pathway in DLBCL is NF-κB signaling, which was found to be activated only in ABC but not GCB DLBCLs40. Notably, FDRnet was the only method that identified the NF-κB pathway. A shift to hyper-activation of NF-κB may be pivotal for initiation and maintenance of lymphoma41. The most prevalent factor across the identified subnetworks was STAT3, a key factor in the JAK-STAT transcriptional regulation pathway42. Activation of STAT3 is present in ~10% of DLBCL patients, but, as our analysis showed, STAT3 may act as a key interactive component between pathways and is therefore a promising candidate for therapeutic targeting. Another common factor in DLBCL subnetworks is B-cell lymphoma 2 (BCL2). Constitutive activation of BCL2 results in avoidance of the cellular apoptosis program and thereby can support lymphoma. Finally, BCL6 is also key to B-cell differentiation, and disruption of this process is a major component of DLBCL. BCL6 is known to involve the recruitment of histone deacetylases (HDACs) to specific promoters for the inhibition of gene expression43. Here, BCL6 was identified as a component of the chromatin binding and lymphocyte differentiation subnetworks, groups that also had strong representation of HDACs. In summary, although the identification of specific molecular alterations in DLBCL provides insights into the biology of lymphoma, the next level of understanding for actionable management of patients requires the elucidation of functional relevance and the importance of signaling network interactions. FDRnet identified all of the pathways known to be involved in B-cell lymphoma—lymphocyte differentiation, apoptosis, TP53 functions, STAT cascade and NF-κB pathways—but our method also revealed the connectivity between pathways and identified key factors that may represent multifaceted therapeutic targets.

## Discussion

In this Article, we have developed a method for subnetwork identification that addresses several challenges in pathway analysis. Given that a subnetwork can contain both cancer and non-cancer genes, it cannot be declared as a false or true discovery. Thus, conventional techniques used in multiple testing correction cannot be extended for subnetwork identification. Consequently, it has been an open problem to define FDRs for detected subnetworks. To address the issue, we defined the FDR of a subnetwork as the expected false discovery of the genes in the subnetwork. The proposed definition is intuitive, has a clear physical meaning and can be readily applied by investigators to control subnetwork sizes based on available resources. More importantly, it enables us to solve the subnetwork identification problem analytically as a budget-constrained subgraph search problem. To our knowledge, FDRnet and BioNet are the only two methods that can provide provably optimal solutions, whereas all others are heuristic methods with no guarantee of optimality. However, by searching for maximum-scoring subnetworks, BioNet typically results in a large subnetwork containing genes from multiple functional modules (Figs. 3a and 4a). By contrast, FDRnet searches for densely connected subnetworks by minimizing a conductance score, thereby overcoming the issue of detecting functionally homogeneous subnetworks in a scale-free biological network. Another advantage of FDRnet is that, except for a user-defined FDR upper bound, it does not have free parameters. In the implementation, to reduce computational complexity, we performed a random walk to extract a local graph of size K around a given seed and solved an integer-programming problem on the local graph. Therefore, the computational complexity of FDRnet is only proportional to K, instead of the size of the entire network. Because K is much smaller than the network size, our method is computationally very efficient (Fig. 5). Note that a local graph is extracted only to provide a rough solution for conductance minimization. We performed a parameter sensitivity analysis that demonstrated that the performance of FDRnet is largely insensitive to specific choices of K and random-walk parameter γ (Supplementary Section 3.2). Thus, FDRnet can be considered a parameter-free method, which makes its application easy even for researchers outside the bioinformatics community. By contrast, except for BioNet, the other four methods tested have critical parameters. Specifically, for RegMOD and ClustEx, due to the lack of ground-truth information, the parameters can only be determined manually21,22, and for HotNet2 and hHotNet the parameters are estimated through computationally expensive random permutation16,17. Furthermore, FDRnet offers a natural way to integrate multi-omics data. It is now common to characterize individual tumors using different molecular profiling techniques. Because the local FDR of a gene computed by FDRnet is the probability of the gene being a non-cancer gene, the results obtained from different data types can be combined by retaining the minimum local FDR. Conversely, for the methods using p values as input (for example, HotNet2, RegMOD and hHotNet), a simple strategy is to use the minimum p value, but p values derived from different sources are not directly comparable. In summary, FDRnet has many advantages over existing methods in terms of computational efficiency and the ability to detect functionally homogeneous subnetworks, to control FDRs of detected subnetworks, to provide analytical solutions and to integrate data from multiple platforms (Supplementary Table 7).

A limitation of the current study is that we applied the method to a single PPI network. There is currently no consensus on which PPI network provides more accurate information about gene interactions. One way to overcome this issue would be to apply the method to multiple PPI sources and extract consensus subnetworks. Also, the method is limited to detecting densely connected subnetworks, but a network contains important, high-order structures (for example, clusters of network motifs44). We could replace the conductance score with a motif-based conductance score45 while maintaining the FDR constraint to detect high-order structures. As the method is applied to larger, more complex datasets, other application-specific limitations may be revealed. Through access to source code and interaction with the developers, feedback from the community will facilitate ongoing refinement and incorporation of new features to the FDRnet approach.

Going forward, FDRnet can be used to conduct a comprehensive search for organ-specific and pan-cancer pathways by analyzing all currently available genomics data (including messenger RNA, somatic mutation, copy number, micro RNA and methylation data) of more than 30 cancers available from the TCGA project. As demonstrated, FDRnet can also be applied to gene expression data and to conveniently incorporate gene-interaction-strength information. Thus, the application of our method is not limited to cancer mutational data. Although we here focused on human disease studies, our method can also be applied to other networks (for example, social networks, ecological networks and the internet) to detect significantly disrupted subnetworks, as long as each network node is assigned a quantitative score (not necessarily p values).

## Methods

### The FDRnet method

We have developed a method, referred to as FDRnet, for the detection of subnetworks with significantly perturbed genes in disease. Briefly, given a set of p values, an empirical Bayesian analysis is first performed to estimate the local FDRs of individual genes and identify a set of seed genes. Then, for each seed, a personalized PageRank vector is computed to explore the local area of the seed, and a local graph is extracted by attaining the nodes with the K largest PageRank scores. Finally, for each extracted local graph, a mixed-integer linear programming problem is solved to identify a candidate subnetwork that minimizes the conductance score and meanwhile has an FDR not larger than a given FDR upper bound. In the following, we present a detailed description of the proposed method.

#### Notation

A PPI network is denoted as G = (V, E), where jV is a node and (i, j) E is an edge representing an interaction between genes i and j. We denote as A and D the adjacent matrix and degree matrix of G, respectively, where Aij = Aji, Aij = 1 if node i is connected with node j and 0 otherwise, Dii = ∑jAij and Dij = 0 if i ≠ j. Without causing any confusion, we sometimes use the terms node and gene interchangeably. We use a bold-faced letter to represent a row vector and a bold-faced capital letter to represent a matrix.

#### Controlling FDRs for detected subnetworks

We start by addressing the issue of controlling FDRs for subnetwork identification, which largely remains open. The difficulty is due partially to the fact that, even for a medium-sized network, an extremely large number of subnetworks of various sizes need to be evaluated, and statistical power vanishes after multiple testing correction. Previous work11,16 attempted to address the issue by treating each subnetwork as a test unit and adopting the FDR definition used for conventional multiple testing correction46. A major problem is that hypotheses thus tested are nested, because a subnetwork can be part of, or overlapped with, other subnetworks. It is difficult, if not impossible, to decouple tested hypotheses. Moreover, the FDRs of a set of detected subnetworks are not well defined. For gene-level analysis, if we select K genes at level α, it is expected that αK selected genes are false discoveries. This is not the case for pathway analysis. A detected subnetwork can contain both cancer and non-cancer genes, and thus cannot be declared a false or true discovery. The root of the problem is that while genes are the smallest units, subnetworks are not.

We propose a definition of FDR for subnetwork identification that addresses the aforementioned issues. Recall that the goal of gene-level analysis is to identify a set of genes with the FDR being controlled at a desirable level. Pathway analysis can be considered a natural extension in the sense that, instead of one set of genes, we search for multiple sets of genes, with an additional constraint that each set of genes form a connected subgraph. Thus, it is natural to require that each set of genes, regardless of its size, has an FDR not larger than a preset value. The proposed definition removes the hurdle in assessing the statistical significance of detected subgraphs and has several advantages over the conventional definition. First, it has a clear physical meaning and can be easily used by biologists to control subnetwork sizes based on available resources. Second, given a detected subnetwork, the FDR can be directly interpreted as the proportion of false discoveries among the genes in the subnetwork. This can greatly facilitate downstream experimental validation, which is typically performed on a single-gene basis. Third, and more importantly, the subnetwork identification problem can be solved analytically as a budget-constrained subgraph search problem, as detailed below.

#### Empirical Bayesian analysis to estimate local FDRs

With the above definition, the problem now becomes how to estimate the FDR for a detected subnetwork. Given a list of p values, our goal is to estimate the probabilities of individual genes being false discoveries so that the FDR of a subnetwork can be computed as the average of the probabilities of the genes in the subnetwork. To this end, the empirical Bayesian technique47 is employed. Let p = [p1, ..., pM] be a set of p values that measure the statistical significance of individual genes. We transform the p values into z values z = [z1, ..., zM], and assume that the z values follow a mixture distribution given by f(z) = π0f0(z) + (1 − π0)f1(z), where π0 is the mixture parameter, and f0 and f1 are the null and alternate density functions, respectively. The posterior probability of a gene being a false discovery given p, also called local FDR, can be estimated as fdr(z) = Pr(nullz) = π0f0(z)/f(z). For notational convenience, we set wj = fdr(zj). Let GS = (VS, ES) be a subnetwork. The FDR of GS can be computed as

$${\rm{FDR}}({G}_{\rm{S}})=\frac{1}{| {V}_{\rm{S}}| }{\sum }_{j\in {V}_{\rm{S}}}{\rm{fdr}}({z}_{j})=\frac{1}{| {V}_{\rm{S}}| }{\sum }_{j\in {V}_{\rm{S}}}{w}_{j}$$
(1)

Following the work of ref. 47, f and f0 can be estimated by fitting the standard Poisson general linear model48 to the histogram and fitting a normal distribution to the central part of the histogram of the observed data, respectively. Alternately, the local FDRs can be estimated from p values directly49, by combining the Grenander-density approach50 with empirical null modeling51. In the case where p values are not available, local FDRs can also be estimated directly from test statistics52 by dividing the entire range of test statistics into a series of bins and then estimating the local FDR in each bin by counting the numbers of test statistics from observed data and a null model that fall in the bin. This approach is particularly useful when a null model is constructed through a permutation test, where the parameterized form of null model f0 is not available.

#### Subgraph searching problem with budget constraint

Once we define a local FDR for each gene, we formulate the subgraph identification problem as a problem of finding a connected subgraph around a seed with a budget constraint (that is, FDR bound B). Here, a seed refers to a gene with a local FDR smaller than or equal to the budget. Among all the subgraphs satisfying the FDR constraint for a given seed, we can select the one with the largest size. This strategy is in spirit similar to previous work aiming to find maximum-scoring subgraphs9,10,14,15. However, this would result in a large subnetwork containing genes from multiple significantly mutated but functionally heterogeneous modules due to the scale-free structure of a biological network23,53 (Supplementary Section 1.2). To address the issue, we replace the size maximization with an objective function that reflects our preference of finding a densely connected subnetwork, where genes tend to share similar biological functions. For the purpose of this study, we use the conductance score54,55 to quantify the closeness of the functional relationships of a gene set in a subnetwork. Given a subnetwork GS = (VS, ES), the conductance score ϕ(GS) measures the fraction of the edges that point outside the subnetwork:

$$\phi ({G}_{\rm{S}})=\frac{{c}_{\rm{S}}}{2{m}_{\rm{S}}+{c}_{\rm{S}}}$$
(2)

where cS = {(i, j) E: iVS, jVS} is the number of edges on the boundary of GS, and mS = {(i, j) E: iVS, jVS} is the total number of edges in GS. Thus, a small conductance score suggests a densely connected subnetwork isolated from the rest of the network. Using the concept of conductance, we propose to solve the following problem to handle the scale-free structure of a biological network.

### Problem 1

(Seeded dense subgraph searching problem with budget constraint) Given a vertex-weighted graph G = (V, E, w) with weight w: V → [0, 1], a seed s and a budget B, find a connected subgraph GS = (VS, ES) of G, sVSV, ESE, that satisfies budget constraint $$1/| {V}_{\rm{S}}| {\sum }_{j\in {V}_{\rm{S}}}{w}_{j}\le B$$ and minimizes ϕ(GS).

By introducing a binary variable xj for each vertex jV to indicate whether j is in a solution, the above problem can be formulated as the following optimization problem:

$$\begin{array}{rl}\min &\phi ({G}_{\rm{S}})\\ \,{\text{subject}}\ {\text{to}}\,\ &\frac{1}{{\sum }_{j}{x}_{j}}{\sum }_{j}{x}_{j}{w}_{j}\le B,\\ &{x}_{j}\in \{0,1\},\forall j\in V,\\ &{x}_{s}=1,\\ &{V}_{\rm{S}}=\{j:{x}_{j}=1,j\in V\}\,\,{\text{is}}\ {\text{from}}\ {\text{a}}\ {\text{connected}}\ {\text{subgraph}}\ {\text{of}}\,\,G\end{array}$$
(3)

#### Mixed-integer linear programming problem

Solving the above optimization problem poses a serious computational challenge. To address the issue, we transform the problem into a mixed-integer linear programming (MILP) problem by linearizing both the objective function and constraints. Let A and D be the adjacent and degree matrices of G, respectively. The conductance score ϕ(GS) can be computed as

$$\phi ({G}_{\rm{S}})=\left({\sum }_{i,j}\left({D}_{ij}-{A}_{ij}\right){x}_{i}{x}_{j}\right)/{\sum }_{i}{x}_{i}{D}_{ii}$$
(4)

and problem (3) can be written into the following equivalent form:

$$\min \ z$$
(5)
$${\text{subject}}\ {\text{to}}\,\ \frac{1}{{\sum }_{j}{x}_{j}}{\sum }_{j}{x}_{j}{w}_{j}\le B$$
(6)
$$z{\sum }_{i}{x}_{i}{D}_{ii}-{\sum }_{i}{\sum }_{j}\left({D}_{ij}-{A}_{ij}\right){x}_{i}{x}_{j}\ge 0$$
(7)
$${x}_{j}\in \{0,1\},\forall j\in V$$
(8)
$${x}_{s}=1$$
(9)
$${V}_{\rm{S}}=\{j:{x}_{j}=1,j\in V\}\,\,{\text{is}}\ {\text{from}}\ {\text{a}}\ {\text{connected}}\ {\text{subgraph}}\ {\text{of}}\,\,G$$
(10)

Note that the budget constraint (6) is a nonlinear function of binary variables {xj}. By setting $${\hat{w}}_{j}={w}_{j}-B$$, the constraint can be rewritten as

$${\sum }_{j}{x}_{j}{\hat{w}}_{j}\le 0$$
(11)

By utilizing the fact that xj is binary, constraint (7) can be linearized using standard techniques56,57. Specifically, we set zi = zxi and impose the following constraints on zi:

$$\left\{\begin{array}{l}{z}_{i}\ge z-\alpha (1-{x}_{i})\\ {z}_{i}\ge \beta {x}_{i}\\ {z}_{i}\le z-\beta (1-{x}_{i})\\ {z}_{i}\le \alpha {x}_{i}\end{array}\right.$$
(12)

where α > z and β < z, which are set to be a sufficient large and small constant, respectively. To linearize product xixj, we set xij = xixj and impose the following constraints:

$$\left\{\begin{array}{l}{x}_{ij}\le {x}_{i}\\ {x}_{ij}\le {x}_{j}\\ {x}_{ij}\ge {x}_{i}+{x}_{j}-1\\ {x}_{ij}\ge 0\end{array}\right.$$
(13)

Finally, we use the single-commodity flow-based method58 to linearize the connectivity constraint (10). First, we replace each undirected edge (i, j) with two directed edges (i, j) and (j, i), and denote the new edge set as $$E^{\prime}$$. We then introduce an extra node r as the flow source with maximum total flow M, and connect it to the seed node with directed edges (r, s). To describe the flow in the graph, we associate each directed edge (i, j) with a non-negative variable yij to indicate the amount of the flow from i to j. Our goal is to represent a connected subgraph by all edges carrying flow together with the corresponding vertices, which can be achieved by considering each vertex with a positive incoming flow as a sink consuming one unit of flow. Let r0 be the residual flow in the source. The summation of the residual flow and the flow injected into the network should be equal to the total flow:

$${r}_{0}+{y}_{rs}=M$$
(14)

To ensure that only vertices in an identified subgraph have positive incoming flow, we add constraints for each edge:

$${y}_{ij}\le M{x}_{j},\forall (i,j)\in E^{\prime}$$
(15)

For each vertex, we also add a constraint to enforce flow conservation; that is, the amount of incoming flow equals the amount of outgoing flow plus the amount of the consumed flow:

$${\sum }_{i:(i,j)\in E^{\prime} }{y}_{ij}={x}_{j}+{\sum }_{i:(j,i)\in E^{\prime} }{y}_{ji},\forall j\in V$$
(16)

Finally, the total flow consumed by sinks should equal the flow injected into the graph:

$${\sum }_{j\in V}{x}_{j}={y}_{rs}$$
(17)

By using the constraints above, the flow visits all vertices in a solution, which ensures the connectivity of an identified subgraph. By replacing constraint (10) with (14) to (17), we express the connectivity constraint as a set of linear constraints and obtain the following mixed-integer linear programming problem:

$$\begin{array}{l}\min \,z\\ \,{\text{subject}}\ {\text{to}}\,\ {\sum }_{j}{x}_{j}{\hat{w}}_{j}\le 0\end{array}$$
(18)
$$\begin{array}{l}{\sum }_{i}{z}_{i}{D}_{ii}-{\sum }_{i}{\sum }_{j}\left({D}_{ij}-{A}_{ij}\right){x}_{ij}\ge 0\\ {x}_{s}=1\\ {x}_{j}\in \{0,1\},\forall j\in V\\ {\rm{Constraints}}\,(12)-(17)\end{array}$$
(19)

which can be solved by some well-implemented optimization tools (for example, CPLEX59).

#### Detecting subnetworks in an edge-weighted graph

A PPI network sometimes contains gene-interaction-strength information. The proposed method can be easily extended to detect subnetworks in an edge-weighted graph. Specifically, the edge-weight information can be incorporated into ϕ(GS), defined in equation (2), by replacing cS and mS with their weighted versions:

$${c}_{\rm{S}}={\sum }_{(i,j)\in E,i\in {V}_{\rm{S}},j\notin {V}_{\rm{S}}}{A}_{ij}$$
(20)
$${m}_{\rm{S}}={\sum }_{(i,j)\in E,i\in {V}_{\rm{S}},j\in {V}_{\rm{S}}}{A}_{ij}$$
(21)

where Aij is the edge weight for edge (i, j).

#### Identifying local graphs through random walk

Finally, we address the computational issue of the proposed method. A PPI network typically contains tens of thousands of nodes and hundreds of thousands of edges. It is thus computationally expensive to solve problem (18) directly. Note that cancer pathways reported in the literature usually contain only dozens of genes19. Thus, we can substantially reduce the computational complexity by first dropping the FDR constraint to identify a local graph around a seed and then solving the optimization problem on the local graph. In this way, the computational complexity of our method is only proportional to the size of a local graph, instead of the size of the entire network. Local graphs can be efficiently identified through random walk by using the PageRank-Nibble algorithm60. Specifically, given a seed, we first perform a random walk on a PPI network to compute a personalized PageRank score vector, where each element represents the probability that a random walk starting from the seed arrives at the corresponding node. Then, we normalize the score of each node by its degree, and extract a local graph by attaining the nodes with the K largest scores. Given that a random walk is performed only to provide a rough solution, our method is very robust with respect to parameter K and teleportation parameter γ used in the PageRank-Nibble algorithm (Supplementary Section 3.2). Hence, except for the FDR upper bound, our method does not have any critical parameter. Throughout the study, we set K = 400 and γ = 0.85. Because identified local graphs are much smaller than a PPI network, problem (18) can be solved efficiently.

#### Exploiting sparse structure of a PPI network

We further reduce the problem size by exploiting the sparse structure of a PPI network. After we extract a local graph of size K, the number of variables {xij} is equal to K2. However, {xij} only show up in the constraint (19), and if Aij = 0 and i ≠ j, there is no need to define xij. Based on the above observation, we can substantially reduce the problem size by replacing $${\sum }_{i,j}({D}_{ij}-{A}_{ij}){x}_{ij}$$ with $${\sum }_{(i,j)\in {\mathcal{I}}}({D}_{ij}-{A}_{ij}){x}_{ij}$$, where $${\mathcal{I}}=\{(i,j)| {A}_{ij}\ne 0\}\bigcup \{(i,i)| i\in V\}$$. Let EK be the number of edges in a local graph. The number of variables xij to be determined is K + 2EK, which is much smaller than K2.

#### Detailed implementation

Given a seed, we solve problem (18) to identify a densely connected subnetwork. Theoretically, given an FDR bound B, any gene with a local FDR less than B can be used as a seed. Because a significantly disrupted subnetwork usually contains multiple genes with small local FDRs (that is, cancer-related genes), using the above strategy would not only increase computational complexity but also lead to the identification of redundant subnetworks with only minor differences on their boundaries. Intuitively, a seed at the center of a subnetwork should be surrounded by multiple other seeds. We exploited this intuition to address the above issue by sorting the obtained seeds in descending order of the numbers of other seeds in their direct neighbors and skipping a seed if it had already been included in previous results. In addition, we only considered a subnetwork to be biologically meaningful if it contained more than one gene.

### Evaluation metrics

We compared FDRnet with five other methods in terms of their abilities to detect target genes and subnetworks and to control FDRs and their computational complexity.

#### F score

We used F score to assess the ability of an algorithm to detect target genes. Let $${\mathcal{A}}$$ and $${\mathcal{B}}$$ be the lists of genes in detected subnetworks and target subnetworks, respectively. The F score is the harmonic mean of recall and precision, given by

$$F({\mathcal{A}},{\mathcal{B}})=\frac{2\times {\rm{recall}}({\mathcal{A}},{\mathcal{B}})\times {\rm{precision}}({\mathcal{A}},{\mathcal{B}})}{{\rm{recall}}({\mathcal{A}},{\mathcal{B}})+{\rm{precision}}({\mathcal{A}},{\mathcal{B}})}$$
(22)

where $${\rm{recall}}({\mathcal{A}},{\mathcal{B}})=| {\mathcal{A}}\bigcap {\mathcal{B}}| /| {\mathcal{B}}|$$ and $${\rm{precision}}({\mathcal{A}},{\mathcal{B}})=| {\mathcal{A}}\bigcap {\mathcal{B}}| /| {\mathcal{A}}|$$.

#### Fsub score

F score is used to compare two sets of genes. As our goal is to detect significantly perturbed subnetworks, it tells only part of the story. To assess the ability of a method to detect target subnetworks, we proposed a new metric, referred to as the Fsub score, as a natural extension of the F score, for the comparison of two sets of subnetworks. Specifically, let $${\mathscr{A}}=\{{{\mathcal{A}}}_{1},\ldots ,{{\mathcal{A}}}_{M}\}$$ be M detected subnetworks and $${\mathscr{B}}=\{{{\mathcal{B}}}_{1},\ldots ,{{\mathcal{B}}}_{N}\}$$ be N target subnetworks. The F score of Ai with respect to $${\mathscr{B}}$$ is defined as

$$F({{\mathcal{A}}}_{i}| {\mathscr{B}})=\mathop{\max }\limits_{j\in \{1,\ldots ,N\}}F({{\mathcal{A}}}_{i},{{\mathcal{B}}}_{j})$$
(23)

where $$F({{\mathcal{A}}}_{i},{{\mathcal{B}}}_{j})$$ is the F score between $${{\mathcal{A}}}_{i}$$ and $${{\mathcal{B}}}_{j}$$. Then, the Fsub score of $${\mathscr{A}}$$ with respect to $${\mathscr{B}}$$ is defined as the size-weighted average of F scores of individual detected subnetworks:

$${F}_{{\rm{sub}}}({\mathscr{A}}| {\mathscr{B}})=\frac{1}{{\sum }_{i}| {{\mathcal{A}}}_{i}| }\mathop{\sum }\nolimits_{i = 1}^{M}| {{\mathcal{A}}}_{i}| \mathop{\max }\limits_{j\in \{1,\ldots ,N\}}F({{\mathcal{A}}}_{i},{{\mathcal{B}}}_{j})$$
(24)

#### FDR control and computational complexity

The third metric that we used to compare different methods is how well a method controls FDRs. As discussed above, each detected subnetwork can contain both target and non-target genes, and thus cannot be declared a false discovery. Therefore, the FDR of a detected subnetwork is defined as the percentage of non-target genes in the subnetwork. For the simulation study we can compute exact FDRs by using the ground-truth information, and for the breast cancer and lymphoma data we reported only estimated FDRs as defined in equation (1). Finally, we compared the running time of the six competing methods.

## Data availability

The breast cancer somatic mutation and copy number data (dbGaP study accession no. phs000178) were downloaded from the TCGA Firehose website (https://gdac.broadinstitute.org). The iRefIndex9.0 PPI network, the BioGRID v3.5.187 PPI network and the ReactomeFI v2019 PPI network were downloaded from http://compbio-research.cs.brown.edu/pancancer/hotnet2/, https://thebiogrid.org and https://reactome.org, respectively, without any restriction. For the lymphoma study, the gene expression data and the interactome data (HPRD PPI network) were obtained from the BioNet package (https://www.bioconductor.org/packages/release/bioc/html/BioNet.html) without any restriction. Source data are provided with this paper.

## Code availability

The software and user manual are available at https://github.com/yangle293/FDRnet (https://doi.org/10.5281/zenodo.4121885; ref. 61) and www.acsu.buffalo.edu/~yijunsun/lab/FDRnet.html.

## References

1. 1.

Beroukhim, R. et al. The landscape of somatic copy-number alteration across human cancers. Nature 463, 899–905 (2010).

2. 2.

The Cancer Genome Atlas Network Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).

3. 3.

Bailey, M. H. et al. Comprehensive characterization of cancer driver genes and mutations. Cell 173, 371–385 (2018).

4. 4.

Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).

5. 5.

Dees, N. D. et al. MuSiC: identifying mutational significance in cancer genomes. Genome Res. 22, 1589–1598 (2012).

6. 6.

Stransky, N. et al. The mutational landscape of head and neck squamous cell carcinoma. Science 333, 1157–1160 (2011).

7. 7.

Chapman, M. A. et al. Initial genome sequencing and analysis of multiple myeloma. Nature 471, 467–472 (2011).

8. 8.

Raphael, B. J., Dobson, J. R., Oesper, L. & Vandin, F. Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine. Genome Med. 6, 5 (2014).

9. 9.

Ideker, T., Ozier, O., Schwikowski, B. & Siegel, A. F. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18, S233–S240 (2002).

10. 10.

Dittrich, M. T., Klau, G. W., Rosenwald, A., Dandekar, T. & Müller, T. Identifying functional modules in protein–protein interaction networks: an integrated exact approach. Bioinformatics 24, 223–231 (2008).

11. 11.

Vandin, F., Upfal, E. & Raphael, B. J. Algorithms for detecting significantly mutated pathways in cancer. J. Comput. Biol. 18, 507–522 (2011).

12. 12.

Ciriello, G., Cerami, E., Sander, C. & Schultz, N. Mutual exclusivity analysis identifies oncogenic network modules. Genome Res. 22, 398–406 (2012).

13. 13.

Iorio, F. et al. Pathway-based dissection of the genomic heterogeneity of cancer hallmarks’ acquisition with SLAPenrich. Sci. Rep. 8, 1–16 (2018).

14. 14.

Sohler, F., Hanisch, D. & Zimmer, R. New methods for joint analysis of biological networks and expression data. Bioinformatics 20, 1517–1521 (2004).

15. 15.

Nacu, Ş., Critchley-Thorne, R., Lee, P. & Holmes, S. Gene expression network analysis and applications to immunology. Bioinformatics 23, 850–858 (2007).

16. 16.

Leiserson, M. D. et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet. 47, 106–114 (2015).

17. 17.

Reyna, M. A., Leiserson, M. D. & Raphael, B. J. Hierarchical HotNet: identifying hierarchies of altered subnetworks. Bioinformatics 34, i972–i980 (2018).

18. 18.

Razick, S., Magklaras, G. & Donaldson, I. M. iRefindex: a consolidated protein interaction database with provenance. BMC Bioinformatics 9, 405 (2008).

19. 19.

Giurgiu, M. et al. CORUM: the comprehensive resource of mammalian protein complexes—2019. Nucleic Acids Res. 47, D559–D563 (2019).

20. 20.

Beisser, D., Klau, G. W., Dandekar, T., Müller, T. & Dittrich, M. T. BioNet: an R-package for the functional analysis of biological networks. Bioinformatics 26, 1129–1130 (2010).

21. 21.

Qiu, Y.-Q., Zhang, S., Zhang, X.-S. & Chen, L. Detecting disease associated modules and prioritizing active genes based on high throughput data. BMC Bioinformatics 11, 26 (2010).

22. 22.

Gu, J., Chen, Y., Li, S. & Li, Y. Identification of responsive gene modules by network-based gene clustering and extending: application to inflammation and angiogenesis. BMC Syst. Biol. 4, 47 (2010).

23. 23.

Barabasi, A.-L. & Oltvai, Z. N. Network biology: understanding the cell’s functional organization. Nat. Rev. Genet. 5, 101–113 (2004).

24. 24.

Oughtred, R. et al. The BioGRID interaction database: 2019 update. Nucleic Acids Res. 47, D529–D541 (2019).

25. 25.

Jassal, B. et al. The reactome pathway knowledgebase. Nucleic Acids Res. 48, D498–D503 (2020).

26. 26.

Watson, I. R., Takahashi, K., Futreal, P. A. & Chin, L. Emerging patterns of somatic mutations in cancer. Nat. Rev. Genet. 14, 703–718 (2013).

27. 27.

Mermel, C. H. et al. Gistic2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 12, R41 (2011).

28. 28.

Forbes, S. A. et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 45, D777–D783 (2016).

29. 29.

Olivier, M., Hollstein, M. & Hainaut, P. TP53 mutations in human cancers: origins, consequences and clinical use. Cold Spring Harb. Perspect. Biol. 2, a001008 (2010).

30. 30.

Khatri, P. & Drăghici, S. Ontological analysis of gene expression data: current tools, limitations and open problems. Bioinformatics 21, 3587–3595 (2005).

31. 31.

Dustin, D., Gu, G. & Fuqua, S. A. W. ESR1 mutations in breast cancer. Cancer 125, 3714–3728 (2019).

32. 32.

Toy, W. et al. ESR1 ligand-binding domain mutations in hormone-resistant breast cancer. Nat. Genet. 45, 1439–1445 (2013).

33. 33.

Martínez-Iglesias, O., Alonso-Merino, E. & Aranda, A. Tumor suppressive actions of the nuclear receptor corepressor 1. Pharmacol. Res. 108, 75–79 (2016).

34. 34.

Soutourina, J. Transcription regulation by the Mediator complex. Nat. Rev. Mol. Cell Biol. 19, 262–274 (2018).

35. 35.

Eyboulet, F. et al. Mediator links transcription and DNA repair by facilitating Rad2/XPG recruitment. Genes Dev. 27, 2549–2562 (2013).

36. 36.

Rosenwald, A. et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. New Engl. J. Med. 346, 1937–1947 (2002).

37. 37.

Chapuy, B. et al. Molecular subtypes of diffuse large B cell lymphoma are associated with distinct pathogenic mechanisms and outcomes. Nat. Med. 24, 679–690 (2018).

38. 38.

Keshava Prasad, T. et al. Human Protein Reference Database—2009 update. Nucleic Acids Res. 37, D767–D772 (2008).

39. 39.

Xu-Monette, Z. Y. et al. Mutational profile and prognostic significance of TP53 in diffuse large B-cell lymphoma patients treated with R-CHOP: report from an international DLBCL Rituximab-CHOP Consortium Program Study. Blood 120, 3986–3996 (2012).

40. 40.

Lenz, G. & Staudt, L. M. Aggressive lymphomas. New Engl. J. Med. 362, 1417–1429 (2010).

41. 41.

Phelan, J. D. et al. A multiprotein supercomplex controlling oncogenic signalling in lymphoma. Nature 560, 387–391 (2018).

42. 42.

Munoz, J., Dhillon, N., Janku, F., Watowich, S. S. & Hong, D. S. STAT3 inhibitors: finding a home in lymphoma and leukemia. Oncologist 19, 536–544 (2014).

43. 43.

Hatzi, K. et al. A hybrid mechanism of action for BCL6 in B cells defined by formation of functionally distinct complexes at enhancers and promoters. Cell Rep. 4, 578–588 (2013).

44. 44.

Benson, A. R., Gleich, D. F. & Leskovec, J. Higher-order organization of complex networks. Science 353, 163–166 (2016).

45. 45.

Yin, H., Benson, A. R., Leskovec, J. & Gleich, D. F. Local higher-order graph clustering. In Proc. 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 555–564 (ACM, 2017); https://doi.org/10.1145/3097983.3098069

46. 46.

Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).

47. 47.

Efron, B., Tibshirani, R., Storey, J. D. & Tusher, V. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 96, 1151–1160 (2001).

48. 48.

Efron, B. & Tibshirani, R. Using specially designed exponential families for density estimation. Ann. Stat. 24, 2431–2461 (1996).

49. 49.

Strimmer, K. fdrtool: a versatile R package for estimating local and tail area-based false discovery rates. Bioinformatics 24, 1461–1462 (2008).

50. 50.

Langaas, M., Lindqvist, B. H. & Ferkingstad, E. Estimating the proportion of true null hypotheses, with application to DNA microarray data. J. R. Stat. Soc. B 67, 555–572 (2005).

51. 51.

Efron, B. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J. Am. Stat. Assoc. 99, 96–104 (2004).

52. 52.

Hong, W.-J., Tibshirani, R. & Chu, G. Local false discovery rate facilitates comparison of different microarray experiments. Nucleic Acids Res. 37, 7483–7497 (2009).

53. 53.

Albert, R. Scale-free networks in cell biology. J. Cell Sci. 118, 4947–4957 (2005).

54. 54.

Dao, P. et al. Inferring cancer subnetwork markers using density-constrained biclustering. Bioinformatics 26, i625–i631 (2010).

55. 55.

Colak, R. et al. Dense graphlet statistics of protein interaction and random networks. In Pacific Symposium on Biocomputing 178–189 (World Scientific, 2009); https://doi.org/10.1142/9789812836939_0018

56. 56.

Adams, W. P. & Sherali, H. D. Linearization strategies for a class of zero-one mixed integer programming problems. Oper. Res. 38, 217–226 (1990).

57. 57.

Fan, N. & Pardalos, P. M. Multi-way clustering and biclustering by the ratio cut and normalized cut in graphs. J. Combin. Optim. 23, 224–251 (2012).

58. 58.

Dilkina, B. N. & Gomes, C. P. Solving connected subgraph problems in wildlife conservation. In 7th International Conference on the Integration of Constraint Programming, Artificial Intelligence and Operations Research 102–116 (ACM, 2010); https://doi.org/10.1007/978-3-642-13520-0_14

59. 59.

IBM, Inc. CPLEX Optimizer Studio 12.7 (2016); https://www.ibm.com/analytics/cplex-optimizer

60. 60.

Andersen, R., Chung, F. & Lang, K. Local graph partitioning using PageRank vectors. In 47th Annual IEEE Symposium on Foundations of Computer Science 475–486 (IEEE, 2006); https://doi.org/10.1109/FOCS.2006.44

61. 61.

Yang, L. FDRnet 1.0.0 (version 1.0.0) (2020); https://doi.org/10.5281/zenodo.4121885

62. 62.

Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).

## Acknowledgements

This work is supported in part by NIH R01AI125982 (Y.S.), NIH R01DE024523195 (Y.S.) and NIH R01CA241123 (S.G.).

## Author information

Authors

### Contributions

L.Y., S.G. and Y.S. designed the study. L.Y., R.C. and Y.S. performed the data analysis. S.G. performed the biological discussions. L.Y., S.G. and Y.S. wrote the manuscript. All authors read and approved the final manuscript.

### Corresponding authors

Correspondence to Steve Goodison or Yijun Sun.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Peer review information Nature Computational Science thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editor recognition statement Fernando Chirigati was the primary editor on this Article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

## Rights and permissions

Reprints and Permissions

Yang, L., Chen, R., Goodison, S. et al. An efficient and effective method to identify significantly perturbed subnetworks in cancer. Nat Comput Sci 1, 79–88 (2021). https://doi.org/10.1038/s43588-020-00009-4

• Accepted:

• Published:

• Issue Date:

• ### Network propagation-based prioritization of long tail genes in 17 cancer types

• Hussein Mohsen
• , Vignesh Gunasekharan
• , Tao Qing
• , Montrell Seay
• , Yulia Surovtseva
• , Sahand Negahban
• , Zoltan Szallasi
• , Lajos Pusztai
•  & Mark B. Gerstein

Genome Biology (2021)