The nature of biological networks still brings challenges related to computational complexity, interpretable results and statistical significance. Recent work proposes a new method that paves the way for addressing these issues when analyzing cancer genomic data.
For more than a decade, bioinformaticians have used biological networks as a means to analyze cancer genomic data. This approach has shown potential for gaining biological insights into cancer initiation, progression, and response to therapies. Nevertheless, this area of computational biology still faces numerical challenges related to the complexity of biological networks (usually equipped with non-trivial topological features) and the computationally hard nature of related numerical problems. Existing network biology tools often do not scale well and are not able to handle scale-free networks. Most importantly, they compute significance estimates via traditional null hypothesis testing, which is in some cases unsuitable, and yields not easily interpretable results. In this issue of Nature Computational Science, Le Yang and colleagues1 reformulate the problem of identifying biological subnetworks that are ‘perturbed’ by genomic alterations in cancer. They redefine the concept of false discovery and implement a two-step approach that it is computationally elegant and efficient. This brings in an alternative, more intuitive, interpretation of significance for the outputted ‘hot subnetworks’ and tries to solve some numerical issues that have been challenging the community for a very long time.
Cancer arises from the accumulation of DNA alterations acquired since conception — that is, somatic mutations — and deregulating the normal activity of genes involved into a limited number of key biological processes2. These ‘driver’ mutations provide evolutionary advantages and specific physiological traits that allow normal cells to turn into cancer cells. The analysis of thousands of cancer genomes has allowed millions of somatic mutations to be catalogued3. A crucial task is to identify which of these actually drive cancer and the biological mechanisms impacted by them. A complication is that while few genes are frequently somatically mutated in cancer patients, ‘long tails’ of rare genomic aberrations in seemingly unrelated genes are also observed. It is difficult to characterize these as cancer drivers and they are not clearly distinguishable from passenger events, which do not affect cellular physiology4. Because rare mutations might impinge on common biological mechanisms, a common approach has been to analyze somatic mutations in the context of biological pathway maps and reference signaling networks, summarizing functional relationships occurring among gene products5,6. The aim is to seek statistical enrichments or combinatorial properties within sets of functionally connected genes and/or other topologically derived components — that is, subnetworks7,8. Such ‘hot’ subnetworks (1) are more interpretable than individual genes, (2) reveal mechanistic relations, and (3) unveil large sub-populations of cancer patients that lack mutations in established/frequently altered cancer drivers but host private mutations in rarely altered genes. Considering these collectively across patients and observing their functional links in the identified subnetworks elucidate common biological themes and shed light on the mechanisms that drive cancer.
Several approaches have been proposed to overlay cancer somatic mutations on protein–protein-interaction networks, thus identifying impacted subnetworks and unveiling oncogenetic network modules, new therapeutically exploitable cancer vulnerabilities, and immune tolerance mechanisms8,9. Most of these methods solve an optimization or a statistical combinatorial problem where several proposed solutions — that is, candidate subnetworks — are probed against a reward function. Usually, this optimization does not control the false discovery rate and a subsequent step is delegated to computing the significance of the solutions, and to estimating their rate of false discoveries. This creates some numerical challenges. First, considering each network solution as a ‘test unit’ involves performing an explosively large number of tests, which grows exponentially with the size of the initial reference network. This compromises the chances of obtaining any significant result following correction of the outcomes for multiple hypothesis testing. Second, in many cases the solutions are ‘nested’ with multiple network core components shared by many candidate solutions. Thus, the number of tests is inflated by considering multiple times the same set of genes. Finally, there is an interpretation problem. False discovery rates (FDRs) outputted by existing methods are associated with individual network solutions, which ultimately are sets of genes connected in the reference network. A network solution with an FDR = 0.10 means that there is a 10% probability for the corresponding network to be a false discovery. However, this does not reveal anything regarding the likelihood of the genes included in the solution to be true cancer drivers.
Yang and colleagues propose a different definition of an FDR associated with a given network solution (Fig. 1): in their work, the FDR of a subnetwork is defined as the predicted fraction of genes included in the solution that are false positives — that is, non-cancer drivers. The advantage of this is twofold: the FDR of a network (1) has a physical meaning, and it is immediately interpretable; and (2) can be considered as a parameter of the reward function at the optimization run time, which means that it can be controlled algorithmically. Another advantage of this approach is that it is implemented in a two-step procedure: the first one employs a PageRank method (a popular algorithm for webpage ranking) to identify subnetworks surrounding the nodes that are more likely to be true positive cancer genes. A second step then optimizes only these subnetworks instead of the whole reference network, taking into account not only FDRs but also network conductance, which is a measure of connection density. Working on a set of small ‘local subgraphs’ already moves the optimization procedure towards the right direction, providing approximate solutions for the conductance minimization problem and thus substantially speeding up the whole process. Overall, the proposed method scales well, performs better than other existing tools, and also works on scale-free networks.
The method proposed by Yang and colleagues can be also used with transcriptional, phosphoproteomic and any other gene/protein-level quantitative scores. Further studies on how to extend this approach to multi-omic data integration, and the simultaneous use of heterogeneous gene/protein quantitative scores to guide subnetwork optimization and searches are necessary. Nevertheless, this study nicely revamps a topic that has been extensively studied and infuses new lifeblood into a field that can be considered a cornerstone of systems and computational biology.
Yang, L., Chen, R., Goodison, S. & Yijun, S. Nat. Comput. Sci. https://doi.org/10.1038/s43588-020-00009-4 (2021).
Stratton, M. R., Campbell, P. J. & Futreal, P. A. Nature 458, 719–724 (2009).
International Cancer Genome Consortium. Nature 464, 993–998 (2010).
Garraway, L. A. & Lander, E. S. Cell 153, 17–37 (2013).
Licata, L. et al. Nucleic Acids Res. 48, D504–D510 (2020).
Türei, D., Korcsmáros, T. & Saez-Rodriguez, J. Nat. Methods 13, 966–967 (2016).
Signorelli, M., Vinciotti, V. & Wit, E. C. BMC Bioinform. 17, 352 (2016).
Ciriello, G., Cerami, E., Sander, C. & Schultz, N. Genome Res. 22, 398–406 (2012).
Mathews, J. C. et al. Proc. Natl Acad. Sci. USA 117, 16339–16345 (2020).
F.I. receives funding from Open Targets, a public–private initiative involving academia and industry and performs consultancy for the joint CRUK–AstraZeneca Functional Genomics Centre. All the other authors declare no competing interests.
About this article
Cite this article
Najgebauer, H., Perron, U. & Iorio, F. Redefining false discoveries in cancer data analyses. Nat Comput Sci 1, 22–23 (2021). https://doi.org/10.1038/s43588-020-00008-5