Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.


Redefining false discoveries in cancer data analyses

The nature of biological networks still brings challenges related to computational complexity, interpretable results and statistical significance. Recent work proposes a new method that paves the way for addressing these issues when analyzing cancer genomic data.

For more than a decade, bioinformaticians have used biological networks as a means to analyze cancer genomic data. This approach has shown potential for gaining biological insights into cancer initiation, progression, and response to therapies. Nevertheless, this area of computational biology still faces numerical challenges related to the complexity of biological networks (usually equipped with non-trivial topological features) and the computationally hard nature of related numerical problems. Existing network biology tools often do not scale well and are not able to handle scale-free networks. Most importantly, they compute significance estimates via traditional null hypothesis testing, which is in some cases unsuitable, and yields not easily interpretable results. In this issue of Nature Computational Science, Le Yang and colleagues1 reformulate the problem of identifying biological subnetworks that are ‘perturbed’ by genomic alterations in cancer. They redefine the concept of false discovery and implement a two-step approach that it is computationally elegant and efficient. This brings in an alternative, more intuitive, interpretation of significance for the outputted ‘hot subnetworks’ and tries to solve some numerical issues that have been challenging the community for a very long time.

Cancer arises from the accumulation of DNA alterations acquired since conception — that is, somatic mutations — and deregulating the normal activity of genes involved into a limited number of key biological processes2. These ‘driver’ mutations provide evolutionary advantages and specific physiological traits that allow normal cells to turn into cancer cells. The analysis of thousands of cancer genomes has allowed millions of somatic mutations to be catalogued3. A crucial task is to identify which of these actually drive cancer and the biological mechanisms impacted by them. A complication is that while few genes are frequently somatically mutated in cancer patients, ‘long tails’ of rare genomic aberrations in seemingly unrelated genes are also observed. It is difficult to characterize these as cancer drivers and they are not clearly distinguishable from passenger events, which do not affect cellular physiology4. Because rare mutations might impinge on common biological mechanisms, a common approach has been to analyze somatic mutations in the context of biological pathway maps and reference signaling networks, summarizing functional relationships occurring among gene products5,6. The aim is to seek statistical enrichments or combinatorial properties within sets of functionally connected genes and/or other topologically derived components — that is, subnetworks7,8. Such ‘hot’ subnetworks (1) are more interpretable than individual genes, (2) reveal mechanistic relations, and (3) unveil large sub-populations of cancer patients that lack mutations in established/frequently altered cancer drivers but host private mutations in rarely altered genes. Considering these collectively across patients and observing their functional links in the identified subnetworks elucidate common biological themes and shed light on the mechanisms that drive cancer.

Several approaches have been proposed to overlay cancer somatic mutations on protein–protein-interaction networks, thus identifying impacted subnetworks and unveiling oncogenetic network modules, new therapeutically exploitable cancer vulnerabilities, and immune tolerance mechanisms8,9. Most of these methods solve an optimization or a statistical combinatorial problem where several proposed solutions — that is, candidate subnetworks — are probed against a reward function. Usually, this optimization does not control the false discovery rate and a subsequent step is delegated to computing the significance of the solutions, and to estimating their rate of false discoveries. This creates some numerical challenges. First, considering each network solution as a ‘test unit’ involves performing an explosively large number of tests, which grows exponentially with the size of the initial reference network. This compromises the chances of obtaining any significant result following correction of the outcomes for multiple hypothesis testing. Second, in many cases the solutions are ‘nested’ with multiple network core components shared by many candidate solutions. Thus, the number of tests is inflated by considering multiple times the same set of genes. Finally, there is an interpretation problem. False discovery rates (FDRs) outputted by existing methods are associated with individual network solutions, which ultimately are sets of genes connected in the reference network. A network solution with an FDR = 0.10 means that there is a 10% probability for the corresponding network to be a false discovery. However, this does not reveal anything regarding the likelihood of the genes included in the solution to be true cancer drivers.

Yang and colleagues propose a different definition of an FDR associated with a given network solution (Fig. 1): in their work, the FDR of a subnetwork is defined as the predicted fraction of genes included in the solution that are false positives — that is, non-cancer drivers. The advantage of this is twofold: the FDR of a network (1) has a physical meaning, and it is immediately interpretable; and (2) can be considered as a parameter of the reward function at the optimization run time, which means that it can be controlled algorithmically. Another advantage of this approach is that it is implemented in a two-step procedure: the first one employs a PageRank method (a popular algorithm for webpage ranking) to identify subnetworks surrounding the nodes that are more likely to be true positive cancer genes. A second step then optimizes only these subnetworks instead of the whole reference network, taking into account not only FDRs but also network conductance, which is a measure of connection density. Working on a set of small ‘local subgraphs’ already moves the optimization procedure towards the right direction, providing approximate solutions for the conductance minimization problem and thus substantially speeding up the whole process. Overall, the proposed method scales well, performs better than other existing tools, and also works on scale-free networks.

Fig. 1: Overview of FDRnet, proposed by Yang and colleagues.

Gene scores, representing the likelihood of individual genes of being cancer drivers (or any other quantitative measurement), are computed with an empirical Bayesian method and projected on the reference network. a, Based on these scores, a set of seed nodes is identified; for each seed node, the algorithm assembles a number of random walks and computes a PageRank vector with an entry for each node in the network. b,c, Neighborhoods of the seed nodes are identified (b) by considering only the nodes with highest PageRank scores and put forward for a further optimization phase (c) where the reward function accounts for conductance (a metric of network connectivity) and false discovery rate.

The method proposed by Yang and colleagues can be also used with transcriptional, phosphoproteomic and any other gene/protein-level quantitative scores. Further studies on how to extend this approach to multi-omic data integration, and the simultaneous use of heterogeneous gene/protein quantitative scores to guide subnetwork optimization and searches are necessary. Nevertheless, this study nicely revamps a topic that has been extensively studied and infuses new lifeblood into a field that can be considered a cornerstone of systems and computational biology.


  1. 1.

    Yang, L., Chen, R., Goodison, S. & Yijun, S. Nat. Comput. Sci. (2021).

  2. 2.

    Stratton, M. R., Campbell, P. J. & Futreal, P. A. Nature 458, 719–724 (2009).

    Article  Google Scholar 

  3. 3.

    International Cancer Genome Consortium. Nature 464, 993–998 (2010).

    Article  Google Scholar 

  4. 4.

    Garraway, L. A. & Lander, E. S. Cell 153, 17–37 (2013).

    Article  Google Scholar 

  5. 5.

    Licata, L. et al. Nucleic Acids Res. 48, D504–D510 (2020).

    Google Scholar 

  6. 6.

    Türei, D., Korcsmáros, T. & Saez-Rodriguez, J. Nat. Methods 13, 966–967 (2016).

    Article  Google Scholar 

  7. 7.

    Signorelli, M., Vinciotti, V. & Wit, E. C. BMC Bioinform. 17, 352 (2016).

    Article  Google Scholar 

  8. 8.

    Ciriello, G., Cerami, E., Sander, C. & Schultz, N. Genome Res. 22, 398–406 (2012).

    Article  Google Scholar 

  9. 9.

    Mathews, J. C. et al. Proc. Natl Acad. Sci. USA 117, 16339–16345 (2020).

    Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Francesco Iorio.

Ethics declarations

Competing interests

F.I. receives funding from Open Targets, a public–private initiative involving academia and industry and performs consultancy for the joint CRUK–AstraZeneca Functional Genomics Centre. All the other authors declare no competing interests.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Najgebauer, H., Perron, U. & Iorio, F. Redefining false discoveries in cancer data analyses. Nat Comput Sci 1, 22–23 (2021).

Download citation


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing