Network-based machine learning and graph theory algorithms for precision oncology

Zhang, Wei; Chien, Jeremy; Yong, Jeongsik; Kuang, Rui

doi:10.1038/s41698-017-0029-7

Download PDF

Review Article
Open access
Published: 08 August 2017

Network-based machine learning and graph theory algorithms for precision oncology

Wei Zhang¹,
Jeremy Chien²,
Jeongsik Yong³ &
…
Rui Kuang¹

npj Precision Oncology volume 1, Article number: 25 (2017) Cite this article

35k Accesses
67 Citations
63 Altmetric
Metrics details

Subjects

Abstract

Network-based analytics plays an increasingly important role in precision oncology. Growing evidence in recent studies suggests that cancer can be better understood through mutated or dysregulated pathways or networks rather than individual mutations and that the efficacy of repositioned drugs can be inferred from disease modules in molecular networks. This article reviews network-based machine learning and graph theory algorithms for integrative analysis of personal genomic data and biomedical knowledge bases to identify tumor-specific molecular mechanisms, candidate targets and repositioned drugs for personalized treatment. The review focuses on the algorithmic design and mathematical formulation of these methods to facilitate applications and implementations of network-based analysis in the practice of precision oncology. We review the methods applied in three scenarios to integrate genomic data and network models in different analysis pipelines, and we examine three categories of network-based approaches for repositioning drugs in drug–disease–gene networks. In addition, we perform a comprehensive subnetwork/pathway analysis of mutations in 31 cancer genome projects in the Cancer Genome Atlas and present a detailed case study on ovarian cancer. Finally, we discuss interesting observations, potential pitfalls and future directions in network-based precision oncology.

An integrated network representation of multiple cancer-specific data for graph-based machine learning

Article Open access 29 April 2022

Graph machine learning for integrated multi-omics analysis

Article Open access 10 May 2024

Systematic pan-cancer analysis of mutation–treatment interactions using large real-world clinicogenomics data

Article 30 June 2022

Introduction

The revolutionary large-scale genomic and sequencing technologies developed in the past two decades have enabled an understanding of cancer biology in individual tumors for personalized treatment. Coordinated national and international efforts for cancer genome projects have been launched to characterize tens of thousands of individual tumors by somatic mutation, gene expression, copy number variation, DNA methylation, and various other types of genomic and epigenomic aberrations.^{1, 2} The large volume of accumulated cancer genomic data has facilitated the identification of precise oncogenes and tumor suppressors for the development of personalized therapeutic strategies. One of the well-recognized new observations in these studies is that cancer is better characterized by frequently mutated or dysregulated pathways than driver mutations, which are often distinct in the tumors of the same type.³ For example, studies have reported that only a few altered genes occur in more than 10% of the samples and that many other altered genes occur in less than 5% of the samples in the same tumor type.⁴ Furthermore, certain cancer types, such as prostate cancer and pediatric cancers, are not driven by a few somatic mutations or copy number variations, and the mechanism might be better understood in the context of systems biology.⁴ This important observation has led to a great effort to develop a collection of network-based computational methods to detect cancer pathways or subnetworks by integration of various genomic data, as shown in Fig. 1a, and these methods can be classified into three categories depending on the scenario of applying the analysis pipeline.

Network-based analysis has also attracted considerable attention in drug repositioning to reduce the cost of new drug development by using repositioned existing drugs on novel targets in drug–target networks for precision oncology.⁵ Based on the hypothesis that drugs tend to be more effective on target genes within or in the vicinity of a disease module in a molecular network,^{5, 6} several network-based approaches have been used to explore networks of drugs, diseases and targets to reposition drugs for new targets, as listed in Fig. 1b. In these methods, the drug–target relations can be inferred by various measures in the network, combining drug–drug, drug–target, drug–disease and disease–gene relations as shown in the drug–disease–target network in Fig. 1d, e. As summarized in Fig. 1b, these methods can be classified into three categories based on the underlying computational formulation: methods using graph connectivity measures, link prediction methods and network-based classification methods.

The focus of this review article is to provide a comprehensive and unified survey of machine learning and graph theory algorithms for network analysis in precision oncology. We compare the methods by their distinctions in the methodology and mathematical formulations such that the methods can be better applied and improved appropriately for precision oncology. An overview of this article is given in Fig. 1. We not only review the resources of biomedical and molecular networks listed in Fig. 1g and the network-based methods listed in Fig. 1a, b but also present a comprehensive network-based pathway analysis of mutations in 31 cancer genome projects in the Cancer Genome Atlas (TCGA) list in Fig. 1h and a case study on ovarian cancer to show the promise of applying network-based analysis.

Biomedical and molecular networks

In the literature, various biological and biomedical network databases have been compiled to support network analysis. Typically, the databases have been curated by the integration of high-throughput experimental screening results from studies in the literature and possibly computational predictions supervised by expert knowledge. The networks represent the collections of molecules, phenotypes and drugs as nodes and their relations as edges in graphs. In Table 1, we enumerate existing molecular networks, phenotype similarity networks or ontologies, and drug–target networks and the resources for obtaining these networks. The properties of these networks, including their nodes, edges and graph structures, are also shown in Table 1.

1.
Molecular networks: Biological molecular networks describe relations among molecules, such as protein–protein interactions, gene co-expression, functional similarities, regulatory relations or biochemical reactions. The new-generation high-throughput technologies have provided extensive content to construct such molecular networks. Protein–protein interaction networks are available from several well-maintained databases.^{7,8,9,10,11,12} Primarily, these networks include physical interactions determined by experiments and computationally derived interactions. Proteome-wide protein–protein interactions capture the interplay among proteins based on the functional associates from co-membership of protein complexes and pathways. A functional linkage network is a more comprehensive compilation of functional relations, physical interactions and co-expression in one network.^13,14,15 A transcriptional regulatory network models the molecular interactions between transcript factors/microRNA and target genes to regulate transcript expression.^{16, 17} A transcriptional regulatory network is a directed graph in which the edges connect a regulator to its targets. A cellular metabolic network can be constructed by the co-membership of biochemical reactions among metabolites and enzymes.^{18, 19} Several graph structures can be used to represent metabolic pathways, e.g., labeled directed graphs, unions of bipartite graphs (per reaction) and hypergraphs, depending on the level of detail of metabolic reactions to be modeled with the graph.²⁰
2.
Phenotype similarity networks and ontologies: Phenotypes, particularly disease phenotypes, are of special interest for cancer studies. The analysis of diseases in the context of other related diseases can offer insight into their genotypic drivers. Online Mendelian Inheritance in Man (OMIM) is a comprehensive compendium of human genes, genetic phenotypes and documentation of their phenotype–gene associations.²¹ Phenotype similarity networks can be constructed based on the genetic resemblance²² or the synopsis of the diseases and sometimes by mRNA expression.²³ Human Phenotype Ontology (HPO) is another more comprehensive organization of all human disease phenotypes in an ontology.²⁴ The ontology is a directed acyclic graph that can be used as a network structure for learning phenotype–gene associations.²⁵
3.
Drug–target and drug–drug networks: Drug–target associations can be modeled by a bipartite network with connections between the drugs and their targets. The drug–target pairs are typically derived from FDA-approved or experimental drugs and their human protein targets available from various drug databases.^26,27,28,29 Several different types of drug–drug similarity networks have been derived for drug repositioning. Drug–drug relations can be inferred based on similarity of molecular basis, chemical substructure, and phenotypes, such as known drug-indication relations, co-membership in drug combinations, and co-morbidity of diseases.³⁰

Table 1 List of molecular and biomedical networks

Full size table

Network-based analysis of personal genomic profiles

The goal of applying network-based analysis to personal genomic profiles is to identify aberrant network modules that are both informative of cancer mechanisms and predictive of cancer phenotypes. These methods can be classified into three categories based on the design of the analysis pipeline in different scenarios, as shown in Fig. 2. In these scenarios, the detection of the network modules facilitates two other goals: predicting cancer phenotypes and detecting driver genes. Depending on how the network information is processed in the pipeline, the inputs and the outputs to the predictive models or network analysis methods can differ. Below, we describe the three categories of the methods listed in Fig. 1a and then discuss the advantages and limitations of each of the categories.

Model-based integration of whole-genomic profiles and a network

Model-based integration formulates a single unified machine learning framework to integrate genomic profiles with a network as illustrated in Fig. 2a. The core technique is to introduce a network-based regularization into machine learning models such that the coefficients learned on the feature variables form dense subnetworks. The most commonly used network-based regularization is the graph Laplacian regularizer shown in Fig. 3a. The graph Laplacian was first introduced for spectral graph analysis³¹ and then used for semi-supervised learning in machine learning.^{32, 33} The graph Laplacian regularization is a summation of smoothness terms on the variables to encourage similar coefficients on the genes or other genomic features that are connected in the network. Below, we describe the graph Laplacian regularized methods in different learning frameworks as shown in Fig. 3b–e. To precisely describe the models, we also list all the necessary notations in Table 2 and the exact mathematical formulations of the methods in Supplementary Table S2.

Table 2 Notations

Full size table

In Fig. 3b, the widely used regression and survival models are extended to include the graph Laplacian constraint for the analysis of genomic data. The paper³⁴ proposed a network-constrained linear regression procedure that combines a graph Laplacian constraint with the L ₁-norm sparse linear regression to capture the relations among the regression coefficients.³⁵ This network-based linear regression is equivalent to a standard LASSO optimization problem.³⁴ The paper³⁶ proposed a network-based Cox proportional hazards model (Net-Cox) for survival analysis. In Cox regression, the objective is to learn the regression coefficients β and the baseline hazard function h ₀(t) such that the instantaneous risk of an event at time t for a patient x _i can be estimated by \(h\left( {t|{{\boldsymbol{x}}_i}} \right) = {h_0}(t)exp\left( {{\boldsymbol{x}}_i^{\rm{T}}\beta } \right)\). Similarly, the graph Laplacian constraint is introduced on the regression coefficients β. By alternating between maximization with respect to β and h ₀(t), a local optimum can be found.

As shown in Fig. 3c, the graph Laplacian constraint can also be introduced into linear classification models such as logistic regression³⁷ and support vector machines (SVMs).³⁸ Given the binary response vector y = (y ₁, ..., y _n)^T with y _i ∈ {1, 0}, a Bernoulli likelihood function minus both the L ₁-norm and the graph Laplacian constraints is maximized to learn the linear coefficients. In the model, \(p\left( {{{\boldsymbol{x}}_i}} \right) = \frac{{{\rm{exp}}\left( {{\beta _0} + {\boldsymbol{x}}_i^{\rm{T}}\beta } \right)}}{{1 + {\rm{exp}}\left( {{\beta _0} + {\boldsymbol{x}}_i^{\rm{T}}\beta } \right)}}\) is the probability that the ith sample is in class 1. The elastic-net procedure can be applied to maximize the regularized cost function. The paper³⁸ proposed a network-based SVM. Given the +1/−1 binary response vector y, the network-constrained SVM can be formulated as the addition of the hinge loss \(\mathop {\sum}\nolimits_{i = 1}^n {{{\left[ {1 - {y_i}\left( {{\beta _0} + {\boldsymbol{x}}_i^{\rm{T}}\beta } \right)} \right]}_ + }}\) and the graph Laplacian constraint, where the subscript “+” denotes the positive part, i.e., z ₊ = max{z, 0}.

Semi-supervised learning methods can more conveniently explore the structures among both the genomic features and the patient samples by learning with the graph Laplacians,^39,40,41 as shown in Fig. 3d. In the bipartite graph formulation introduced in the paper,⁴⁰ gene expression data are represented as a bipartite graph with weighted edges between patient samples and genomic features. The bipartite graph captures the co-expression among the genes and the samples as bi-clusters in the graph such that both the sample clusters and feature modules are explored. In the hypergraph formulation introduced in the papers,^{39, 41} the gene expression data are represented as weighted hyperedges on the patient nodes, and a graph Laplacian on the hypergraph can be introduced for semi-supervised learning on the patient samples. An additional graph Laplacian of a protein–protein interaction (PPI) network is then introduced to incorporate network information among the genomic features.

It is also possible to regularize non-negative matrix factorization (NMF) models with a graph Laplacian,^{42, 43} as shown in Fig. 3e. NMF aims to find two non-negative matrices U _m × k and H _n × k whose product can accurately approximate the data matrix X with X ≈ UH ^T. Combining the geometrically-based constraint with the original NMF leads to the graph-regularized NMF, where Tr(⋅) denotes the trace of a matrix.

Preprocessing integration to detect network-based features

The preprocessing integration methods comprise two steps, as illustrated in Fig. 2b. First, the genomic profiles and the network are processed together to generate network-based features; second, standard learning models are applied with the network-based features for predictions. In this scenario, the integration of network and genomic data occurs before applying a learning model. The paper⁴⁴ first proposed a graph algorithm to detect discriminative subnetworks for classification of patient samples. Highly discriminative genes are used as seed genes in a greedy search in a PPI network to find discriminative subnetworks, and then gene expression in each subnetwork is normalized as one feature value for classification with standard logistic regression. A similar approach was later proposed for application with features of discriminative pathways instead of subnetworks.⁴⁵ In this approach, the gene expression in a pathway is normalized as one feature for the collection of pathways from a molecular signature database.⁴⁶ The paper⁴⁷ used disease-specific subnetworks as features, where a set of known disease genes are first mapped into the PPI network and then the subnetworks of the disease genes are identified as disease module features. The paper⁴⁸ proposed implementing label propagation on the mutation data of each patient on a PPI network to generate network-smoothed features for classification of the patients. The paper⁴⁹ proposed to find a small subnetwork to connect all differentially expressed genes in a PPI network and then use the genes in the subnetwork as features to classify patient samples. This setting is the Steiner tree problem in graph theory, and a heuristic algorithm coupled with randomization was designed to combine multiple suboptimal Steiner trees to find an optimum solution with a higher probability.

This category of algorithms is a very useful generalization of the earlier gene-set-based methods^{50, 51} since the network structures suggest dynamic modules among the genes rather than a fixed set. These modules can be data-specific and disease-specific for improved results. Thus, the data-driven subnetwork discovery introduced by these methods is a key improvement over previous studies.^{50, 51}

Post-analysis of oncogenic alterations in networks

The post-analysis integration methods also consist of two steps, as illustrated in Fig. 2c. First, the genomic profiles are analyzed to generate a list of oncogenic alterations; second, the detected alterations are analyzed in the network. In this post-analysis integration, the network information is integrated in the analysis after the oncogenic alterations are first detected by standard statistical methods. The purpose of these methods is to assess how cancer-driving alterations disrupt a normal cellular system by examining the influences on network components.

The circuit flow algorithm⁵² first identifies differentially expressed genes and then the genomic aberrations by mutations and copy number variations (CNVs) associated with the differential gene expression. Next, a current flow algorithm is applied to find causal paths from the causal genes (altered genes) to the target genes (differentially expressed genes) in a PPI network. Finally, the causal genes are selected by a set-covering algorithm to explain all the differentially expressed target genes.

HotNet⁵³ first maps gene alterations in a gene network and then employs a diffusion kernel⁵⁴ to build an influence graph with the edges weighted by the influence between each pair of genes. Then, a combinatorial problem is formulated to find the subnetworks of genes altered in a significant number of patients. Similarly, TieDIE⁵⁵ and HotNet2,⁵⁶ an extension of HotNet, apply network diffusion to analyze multiple types of genomic alterations, and NetPathID⁵⁷ applies network diffusion to analyze CNVs in 16 types of cancers.

PARADIGM⁵⁸ is a probabilistic graphical model framework used to model the gene transcription, translation and post-translational events. Each gene is modeled by a factor graph of DNA copy numbers, gene expression, protein levels and protein activities. The factor graphs of genes are connected based on their regulatory relations in a pathway. The genomic and proteomic data are analyzed in the graphical models for the inference of pathway activities in each patient to derive integrated pathway activity (IPA) scores. The significantly altered genes/pathways can be identified using the IPA scores.

The mutual exclusivity module (MEMo) method⁵⁹ is another widely used method in the TCGA project. MEMo first builds a matrix representation of genes that are significantly altered by mutations or CNVs. Then, the altered genes are connected by their proximal in the HPRD PPI network.⁷ Finally, the cliques (a subgraph with all the gene pairs connected) are identified to analyze the mutual exclusivity in the patient data.

Signaling pathway impact analysis (SPIA)⁶⁰ and mixed integer programming (MILP)⁶¹ are two examples of earlier pathway-based methods for genomic data analysis. SPIA applies an iterative algorithm similar to a random walk to measure the pathway perturbations in the regulatory network such that the impact of differentially expressed genes on a pathway can be evaluated.⁶⁰ MILP is an optimization model to predict flux activity states of genes based on gene expression and a metabolic network.⁶¹

Comparison of the methods

Network-based analysis of genomic data is based on the assumptions that cancer-driven aberrations often target different genes in the same pathway or subnetwork in the molecular network and that such systematic behavior can be observed as a coordinated change of genes’ functions in pathways or network modules. Network-based analysis is an effective approach because it has been observed that mutated genes in a cancer pathway can either co-occur in the same patients or be mutually exclusive among the patients, and the systematic behavior is a more detectable and interpretable signal for the assessment of functional impacts of the aberrations.⁵⁹ It has also been shown that feature selection smoothed by graph Laplacian regularization based on the gene co-expression network is highly robust and generates more reproducible feature selections across independent datasets.⁶² Thus, the network-based approach is both well motivated and validated.

The three categories of methods have different relative advantages and disadvantages. Model-based integration methods are a fully supervised approach for both outcome prediction and subnetwork detection. The subnetworks are jointly discovered to contrast the control/case groups in the study based on a global optimization strategy, and thus these methods typically perform better in outcome prediction. In addition, the models can be tuned by a few clearly defined parameters, making it possible to train the models with cross-validation in contrast to the two-step methods in the other categories. The disadvantage is the need for more sophisticated optimization techniques, which are often less scalable. The preprocessing integration methods are more flexible in detecting customizable subnetwork features such that the detected features clearly reflect the hypothesized network-based characteristics. For example, the size and density of discriminative subnetworks can be precisely specified. However, it is not possible to guarantee that the detected subnetwork features are optimal features for prediction with the standard learning model in the second step. The post-analysis integration methods focus on associating mutations or other DNA aberrations with differential expression or certain other molecular phenotypes in the network context. Thus, these methods are highly informative regarding cancer mechanisms in the network.

In model-based integration, Graph LASSO is another choice of graph-based regularization other than the graph Laplacian regularizer.⁶³ Graph LASSO imposes a LASSO loss on each pair of connected variables in the network rather than a squared error as with the graph Laplacian regularizer. The LASSO loss terms force the coefficients of the connected pairs to be identical such that the inconsistent pairs are “sparse.” In practice, the assumption can be too strong in networks with overlapping clusters. In addition, optimization of Graph LASSO-constrained models is generally challenging, while the graph Laplacian regularizer is a quadratic constraint that is relatively straightforward to optimize. Thus, Graph LASSO is a less common choice for network-based integration methods.

Network-based methods for drug repositioning

Network-based algorithms have also been developed for drug repurposing by exploring drug–drug similarities, drug–target relations and gene-gene relations. These methods can be largely classified into three categories, i.e., graph connectivity measures, link prediction models and network-based classification methods, as illustrated in Fig. 4. The methods reviewed under each category are also listed in Fig. 1b. Below, we describe and compare the methods in the three categories.

Graph connectivity measures

The methods in this category are based on measuring the connectivity among the nodes in the graph, such as neighboring relations, the number of shared neighbors and shortest paths, to derive drug–drug, drug–target or drug–disease relations, as illustrated in Fig. 4a. Several early studies^{64, 65} showed that drugs sharing similar chemical structures, transcriptional responses following treatment and text mining analysis often share the same target, where the implication is that the drug–drug network based on the similarities can be used to reposition a drug for the targets of similar drugs. The paper⁶⁴ derived drug–drug similarities based on mining the side-effect description from medical symptoms in the Unified Medical Language System ontology. The paper⁶⁵ developed a method to predict similarities in terms of drug effect by comparing gene expression profiles following drug treatment across multiple cell lines and dosages. Both studies validated the correlation between drug–drug similarity and the likelihood of two drugs sharing a common protein target. Based on the observations, the paper⁶⁶ proposed a recommendation technique for predicting drug–target relations based on the drug–drug similarity matrix W computed based on the structural similarity of the drugs and sequence similarity of their targets and the known drug–target matrix A. By a simple multiplication (R = WA), the scores in matrix R can be used to derive a ranking of the candidate targets against each drug.

The paper²³ performed a large-scale analysis of ~7000 genomic expression profiles in the Gene Expression Omnibus with human disease and drug annotations to create a disease–drug network consisting of drug–drug, drug–disease and disease–disease relations. The study shows that the derived disease–disease relations are highly consistent with the definition in the Medical Subject Headings disease classification tree and that the drug–disease relations can be used to generate hypothesized drug repositioning and side effects. The paper⁶ further generalized the inference to drug–disease proximity in the network by the hypothesis that an effective drug for a disease must target proteins within or in the immediate vicinity of the corresponding disease module in the molecular interaction network. They applied a shortest-path-based measure coupled with a randomization normalization technique to derive the drug–disease proximity scores for the inference.

A recent work in the paper⁶⁷ performed a correlation analysis of disease modules and drug targets in the functional linkage network. The differentially expressed disease genes and the drug–target genes are first overlapped in the functional linkage network, and a mutual predictability score is then computed based on the neighboring relations among the genes to evaluate the repositioning of the drug for the disease.

Link prediction models

Link prediction models predict the relations between drugs and targets based on the global structures of the known interactions in the networks with matrix completion or random-walk approaches, as illustrated in Fig. 4b. The paper⁶⁸ predicted drug–target relations for drug repositioning based on a network of three types of relations: drug–drug structural similarity, target–target sequence similarity and drug–target relations from DrugBank.²⁶ It was shown that exploring the network topology outperforms simple inference rules by graph connectivity measures such as similar drugs sharing the same target or similar targets sharing the same drug. The paper⁶⁹ applied an information-flow approach on a heterogeneous network of drug–drug, disease–disease and target–target similarities along with the known disease–drug and drug–target relations. The algorithm iteratively updates the disease–drug and drug–target relations and converges to stationary scores for the prediction of their relations.

The paper⁷⁰ introduced a bipartite graph-learning method based on kernel regression to learn a co-mapping of drugs in chemical space and targets (proteins) in genomic space into a common pharmacological space. In the pharmacological space, the correlation between compound-protein pairs can be conveniently calculated to predict their interactions for drug repositioning.

The paper⁷¹ proposed a collaborative matrix factorization method to factorize known drug–target relations to predict new relations constrained by the drug–drug similarity network and the target–target similarity network. The paper⁷² proposed a manifold regularization semi-supervised learning method in which two classifiers in drug space and target space are learned and then combined to give a final score for drug–target interaction prediction. The paper⁷³ applied several random-walk methods on a heterogeneous network of drug–drug similarities, target–target similarities and drug–target relations such that the global structure among all the networks can be used to improve the prediction of new drug–target pairs.

Network-based classification methods

Network-based drug repositioning can also be reformulated as a classification problem such that standard classification methods can be applied to predict the new targets of each drug, as illustrated in Fig. 4c. These methods first extract the network topological features for all the targets in the networks. For each drug, a classifier can be trained with the known targets of the drug as positive samples and the others as negative samples. The learned classifiers can then be used to predict the new targets in the test set for each drug. The paper⁷⁴ proposed mapping disease-specific differentially expressed genes into a PPI network and using network topological features to detect new drug targets based on the known targets from the drug–target database by logistic regression. The paper⁷⁵ also applied a supervised bipartite model to predict the probability of each drug–target interaction based on the known drug targets as labels and the target–target interactions as features, where the bipartite model was augmented with additional training samples from the neighboring drug–target relations.

The paper⁷⁶ constructed a drug–drug kernel matrix based on chemical structure similarities and a target–target kernel matrix based on sequence similarities. For each drug, using the known targets as the positive training samples, an SVM classifier is built with the target–target kernel matrix to classify the candidate genes for new targets. In addition, for each target and using the known drugs as the positive training samples, an SVM classifier is built with the drug–drug kernel matrix to classify the drugs for new repositioned drugs. The paper⁷⁷ adopted a similar approach with two additional advanced kernel methods, applying diffusion-types of kernels to integrate both the drug–drug kernel matrix and the target–target kernel matrix to predict the new targets of a drug or the new repositioned drugs for a target.

Comparison of the methods

The three categories of methods have different relative advantages and disadvantages, as shown in Fig. 4d. Graph connectivity measures are straightforward to implement based on standard graph algorithms, and the prediction results are easy to interpret with the edges and the paths in the graph. However, the prediction performance is typically worse since only relatively local information of the networks is considered by the graph algorithms. Link prediction models retrieve the global structures of the networks to predict drug–target interactions for better prediction performance. The disadvantages are the lack of a satisfactory interpretation of the predictions and that the implementation of the models often relies on advanced optimization algorithms. When sophisticated optimization is required, the scalability can be poor. Network-based classification methods are more accurate for repositioning drugs with many known targets as the training samples but are not applicable to drugs with few or no known targets. The prediction results can be interpreted by the network topological features extracted from the networks, depending on the feature extraction strategy.

Another important aspect of the comparison is whether a method can generate de novo predictions for drugs with no known targets or gene targets with no known drugs. Graph connectivity measures are often more biased towards highly connected nodes in the graph such that new drugs or less-studied genes typically receive low rankings. Thus, de novo predictions are rarely made by graph connectivity measures. With no positive training pairs available, the network-based classification methods simply abandon the de novo cases. Link prediction models are often the most capable of making de novo predictions because global topological structures are generally less biased after proper normalization and control by randomization.

Network-based analysis of TCGA mutation data and a case study on ovarian cancer

To better discuss the network-based methods, we performed a network-based analysis of the mutated genes in the 31 cancer genome projects in TCGA^{78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101} and summarized the enriched KEGG pathways¹⁰² in Fig. 5. For the analysis, the mutation frequencies among the patients in the 31 TCGA provisional studies were downloaded from cBioPortal for Cancer Genomics.¹⁰³ In the network-based analysis, label propagation (λ = 0.5)^{48, 62} as described in Table S2 in the Supplementary Information was applied to the HPRD PPI network⁷ in each cancer study to capture the highly mutated subnetworks. The initialization was the gene mutation frequency among the patients in each cancer study for label propagation. The summation of the stationary scores of the genes in a KEGG pathway is compared with the scores of 10,000 random gene sets of the same size to derive p-values. In the analysis without the network, the highly mutated genes in each cancer type are overlapped with KEGG pathways with enrichment analysis to derive p-values by hypergeometric test. This network-based analysis clearly detects more significantly mutated pathways than the analysis without using the network, as shown in Fig. 5a, b, respectively.

Interestingly, the network-based analysis in Fig. 5a indicates that the AMPK signaling pathway is affected in breast cancer (BRCA) and uterine corpus endometrial cancer (UCEC). Prior studies demonstrated that BRCA patients receiving metformin, a pharmacological activator of AMPK, showed complete pathologic response, implicating the role of AMPK in BRCA.¹⁰⁴ Similarly, the loss of the AMPK activator LKB1 promotes endometrial cancer progression and metastasis,^{105, 106} implicating the AMPK pathway in endometrial cancer, and metformin inhibits endometrial cancer cell proliferation.¹⁰⁷ The HIF-1 pathway has been predicted to be affected in renal clear cell carcinoma (KIRC), BRCA, endometrial cancer (UCEC), glioblastoma multiforme (GBM), cervical cancer (CESC), and lung cancer (LUAD), and these results are consistent with prior studies implicating the VHL/HIF-1 pathway in these cancers.^{90, 108} The Hippo pathway has been predicted to be affected in colorectal cancer, renal papillary carcinomas, stomach cancer, and liver cancer, and these results are consistent with recent cancer genomic studies.^{97, 109} Finally, the PI3K-Akt pathway has been identified as one of the most frequently affected pathways in several cancer types, and several components of this pathway were reported to be mutated or amplified in various cancer types.¹¹⁰ Collectively, these results suggest that network analysis can identify clinically relevant pathways that are altered in different cancer types.

In the case study on the ovarian cancer patients shown in Fig. 6, the mutation data of the 316 TCGA ovarian cancer patients were downloaded from the Xena Public Data Hubs.¹¹¹ Similar to the study in the paper,⁴⁸ label propagation (λ = 0.1) was applied on the same HPRD PPI network in each patient to detect the patient-specific highly mutated subnetworks. The initialization was 1 for the mutated genes and 0 for the other genes and then normalized to sum to 1. Similarly, the summation of the stationary scores of the genes in a KEGG pathway was compared with the scores of 10,000 random gene sets of the same size to derive the p-value. In the analysis without the network, the mutated genes in each patient are overlapped with KEGG pathways with enrichment analysis to derive p-values by hypergeometric test. Hierarchical clustering was applied to cluster the patients into three groups using the –log₁₀ (p-values) as features. The network-based analysis informs a clustering of the patients by a significant relevance to survival (Fig. 6c). Notably, three subgroups of tumor samples can be identified from the network-based analysis shown in Fig. 6c, compared to four subgroups in the mutation-based analysis without the network in Fig. 6d. Although subgroups identified by mutation-based analysis without the network show no significant association with disease-free survival, two of the subgroups detected by the network-based analysis (Subgroup 1 and Subgroup 3) show significant association with disease-free survival relative to Subgroup 2. Interestingly, Subgroup 1 has the highest copy number alterations, whereas Subgroup 3 has the highest number of pathway alterations. These results are analogous to the spectrum of somatic alterations described by ref. 112. Although those authors placed ovarian cancer in class C, defined by extensive copy number alterations, the spectrum of somatic alterations can be further described as subgroups with higher copy number changes, mixed, and higher mutations within ovarian cancer. This case study shows that via network analysis, several subtypes of ovarian cancer can be grouped together for further assessment of clinical values, such as occurrence, relapse and treatment resistance. This information may also be valuable for the design or assessment of treatment strategies. Collectively, the network analysis unveils important cancer pathways and their correlation to subtypes of cancers that would not be identifiable by original mutation data analysis.

Discussion

Precision oncology tailors cancer treatment and repositions drugs based on personal genomic information. There are several promising aspects of the application of network-based analysis in precision oncology. With a network to capture the molecular organization in the cellular system, genomic data analysis is both more accurate and descriptive. The smoothness constraint introduced into the model-based integration methods is helpful in eliminating false positives and false negatives in high-dimensional genomic data. The network analysis identifies molecular targets in the context of pathways or interaction partners in a subnetwork that are interpretable for molecular mechanisms. For example, in the case study in Fig. 6a, the mutation information of each individual patient is propagated on the PPI network to detect the patient-specific subnetwork and improve the quality of the patient clustering by a significant relevance to survival. As a consequence, network-based analysis often reports consistent marker genes across different studies of the same cancer⁴⁰ or more comparable results in pan-cancer analysis.⁵⁶ Collectively, it is evident that network-based methods employ molecular and biomedical networks to extract useful personal genomic information, and build better predictive models for target identification, phenotype prediction and drug repositioning.

Conceptually, network-based analysis also adopts mutation patterns that are mutually exclusive or co-occurring. Mutually exclusively mutated genes are often located on the same pathway, and network analysis propagates the mutually exclusive signals to identify the pathway by a significant signal. Co-occurring mutated genes in a pathway/dense network module also mutually strengthen the mutation signals. The results in Fig. 6 clearly support that the mutation patterns are accurately captured in the case study on ovarian cancer by label propagation.

In drug repositioning, both molecular networks and drug–drug or phenotype similarity networks play important roles. It has been repeatedly observed that genes associated with the same (or similar) diseases tend to lie in a dense module in the PPI network. This observation has motivated effective network-based methods to predict new disease genes.⁴³ The analysis of gene modules in the PPI of similar diseases has also suggested associations between diseases and gene functions or pathways.⁴³ When drug targets and disease genes are analyzed together in the PPI network, their proximities are useful for drug repositioning.⁶

The methods compared in Figs. 2 and 4 have different relative advantages and disadvantages. The considerations involve a variety of key properties, including the performance of the methods, the interpretation of the results, the difficulty of implementation, the scalability to genome-wide analysis, and the characteristics of the training data. The appropriate choice of a network-based method for a particular analysis can be customized based on the information gained from these comparisons. For example, drugs with more known targets can be repositioned by the network-based classification models, while drugs with no known targets in the candidates can be repositioned by the link prediction methods. Depending on whether the analysis must be highly scalable to a huge network, simple graph connectivity measures or link prediction methods can be used.

In the application of network-based analysis, there are also several practical issues and limitations.

1.
Molecular networks often contain biased information. Well studied genes tend to have more connections in the PPI network, and they are also targets of more drugs and are associated with more disease phenotypes. Typically, it is important to exercise normalizations and repeat the experiments on randomized networks to assess the statistical significance of the results. The biases also prevent the prediction of de novo disease genes or target genes if the gene has no association with known diseases or is not a target of any drug.²⁵
2.
The empirical results of network-based methods rely on tuning parameters. The parameters often balance how much belief is imposed on the network topologies. When excessive weights are assigned to the network topology, there will be an “over-smoothing” effect such that nearly uniform scores are expected among the genes in even large and sparse neighborhoods. Thus, a proper procedure for determining the appropriate (optimal) parameters is critical, for example, by applying cross-validation and wet-lab validation.³⁶
3.
Commonly, a molecular network describes a general relation, such as protein–protein physical interaction or functional linkage. In some cases, the relations can be either positive or negative, e.g., gene co-expression. A practical approach is to apply a signed graph Laplacian.¹¹³ The models applied with a signed graph Laplacian can be solved in a manner similar to those with the normal graph Laplacian by the same algorithms.

Finally, this article targets the scope of precision oncology, including steps for understanding cancer mechanisms, finding targets and repositioning drugs, while previous survey studies have focused on detecting cancer-driven aberrations and understanding of the aberrations in molecular networks/pathways.^{4, 114, 115} This article also surveys several categories of algorithms, including model-based integration and preprocessing integration with machine learning methods, while previous reviews^{4, 114, 115} primarily surveyed the methods in one of the three categories, namely, post-analysis integration of oncogenic alterations in networks. Thus, this article offers a different scope and a more comprehensive survey of computational methods.

Future directions

Several challenges remain in the application of network-based analysis in precision oncology. These challenges concern the data quality, deployment for research or clinical use, and scalability of network analysis.

To precisely model the molecular interactions and drug–target relations, networks of better quality are required. It is known that most molecular networks and drug–target databases are incomplete and biased towards well-studied proteins/genes. Thus, continuing effort on the improvement of the networks with additional experimental data is important. In addition, network modeling with higher resolution is also crucial to model complex molecular functions at higher precisions. For example, proteins are present in the isoforms of genes, and thus isoform–isoform interactions are the true interactions to model in a network^116,117,118; mutations or other structure variations of a protein can also change the protein–protein binding or drug–protein docking in a specific tumor. Furthermore, even within each tumor, heterogeneous cell populations exist, and the drug targets and molecular interactions could be different for each cell population if measured by single-cell RNA sequencing.¹¹⁹ To partially address this issue, several computational methods for quality control of PPI screening have been proposed to reduce the number of false-positive and false-negative PPIs due to spurious errors and systematic biases from the high-throughput techniques.^{120, 121} Currently, it is still impossible to construct these more accurate networks at a large scale due to the limitation of the current high-throughput experimental methods for measurement of molecular interactions or drug screening.

While many network-based methods have been developed to support precision oncology, the implementations of the methods are independent, with non-standardized tools that are never easily accessible as a useful collection to oncologists for research or clinical use. Thus, there is a strong need to develop a software platform that integrates standardized biomedical, biological network data, and analytic software components to support comprehensive network-based analysis of patient genomic data and drug repositioning for precision oncology. This platform should be based on a sophisticated system design to meet oncologists’ requirements and support customization of the analysis pipeline. The concept of part of such a platform was proposed in the paper⁵ as an integrative network-based infrastructure to identify new druggable targets and repositionable drugs through the targeting of significantly mutated genes identified in human cancer genomes. In the future, the existing tools can be reimplemented as apps on a platform such as Cytoscape¹²² or another software environment similar to GALAXY for NGS data analysis¹²³ to facilitate the development and deployment of the software system for precision oncology.

Finally, scalability is always an issue in network-based analysis since it is common to model millions of genomic features, hundreds of thousands of drugs and tens of thousands of phenotypes in a very large network. For example, in an isoform–isoform interaction network, hundreds of thousands of nodes are contained in a single graph that cannot be loaded onto a computer with less than 100 GB of memory. Such big-data analysis will require more scalable algorithms and efficient computing platforms. For example, the standard label propagation can be applied to low-rank approximations of big graphs, enabling work with networks of millions of nodes.^{124, 125} Parallel implementations of the network-analysis methods, especially the optimization algorithms in those model-based approaches, are also necessary.

References

Weinstein, J. N. et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
Article PubMed PubMed Central CAS Google Scholar
Hudson, T. J. et al. International network of cancer genome projects. Nature 464, 993–998 (2010).
Article CAS PubMed Google Scholar
Krogan, N. J., Lippman, S., Agard, D. A., Ashworth, A. & Ideker, T. The cancer cell map initiative: defining the hallmark networks of cancer. Mol. Cell 58, 690–698 (2015).
Article CAS PubMed PubMed Central Google Scholar
Creixell, P. et al. Pathway and network analysis of cancer genomes. Nat. Methods 12, 615–621 (2015).
Article CAS PubMed PubMed Central Google Scholar
Cheng, F., Zhao, J., Fooksa, M. & Zhao, Z. A network-based drug repositioning infrastructure for precision cancer medicine through targeting significantly mutated genes in the human cancer genomes. J. Am. Med. Inform. Assoc 23, 681–691 (2016).
Guney, E., Menche, J., Vidal, M. & Barábasi, A.-L. Network-based in silico drug efficacy screening. Nat. Commun. 7, 10331–10331 (2016).
Article CAS PubMed PubMed Central Google Scholar
Prasad, T. K. et al. Human protein reference database-2009 update. Nucleic Acids Res. 37, D767–D772 (2009).
Article CAS Google Scholar
Stark, C. et al. BioGRID: A general repository for interaction datasets. Nucleic Acids Res. 34, D535–D539 (2006).
Article CAS PubMed Google Scholar
Chatr-Aryamontri, A. et al. MINT: the molecular interaction database. Nucleic Acids Res. 35, D572–D574 (2007).
Article CAS PubMed Google Scholar
Xenarios, I. et al. DIP, the database of interacting proteins: A research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 30, 303–305 (2002).
Article CAS PubMed PubMed Central Google Scholar
Szklarczyk, D. et al. STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, D447–D452 (2015).
Article CAS PubMed Google Scholar
Hermjakob, H. et al. IntAct: an open source molecular interaction database. Nucleic Acids Res. 32, D452–D455 (2004).
Article CAS PubMed PubMed Central Google Scholar
Zhang, B. & Horvath, S. et al. A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4, 1–45 (2005).
Article CAS Google Scholar
Li, W. et al. Integrative analysis of many weighted co-expression networks using tensor computation. PLoS Comput. Biol. 7, e1001106 (2011).
Article CAS PubMed PubMed Central Google Scholar
Huttenhower, C. et al. Exploring the human genome with functional maps. Genome Res. 19, 1093–1106 (2009).
Article CAS PubMed PubMed Central Google Scholar
Han, H. et al. TRRUST: a reference database of human transcriptional regulatory interactions. Sci. Rep. 5, 1432 (2015).
Google Scholar
Liu, Z.-P., Wu, C., Miao, H. & Wu, H. RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse. Database 2015, bav095 (2015).
Article PubMed PubMed Central CAS Google Scholar
Wishart, D. S. et al. HMDB: the human metabolome database. Nucleic Acids Res. 35, D521–D526 (2007).
Article CAS PubMed PubMed Central Google Scholar
Caspi, R. et al. The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res. 36, D623–D631 (2008).
Article CAS PubMed Google Scholar
Lacroix, V., Cottret, L., Thebault, P. & Sagot, M. F. An introduction to metabolic networks and their structural analysis. IEEE/ACM Trans. Comput. Biol. Bioinform. 5, 594–617 (2008).
Article PubMed Google Scholar
Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A. & McKusick, V. A. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514–D517 (2005).
Article CAS PubMed Google Scholar
Goh, K.-I. et al. The human disease network. Proc. Natl. Acad. Sci. 104, 8685–8690 (2007).
Article CAS PubMed PubMed Central Google Scholar
Hu, G. & Agarwal, P. Human disease-drug network based on genomic expression profiles. PLoS One 4, e6536 (2009).
Article PubMed PubMed Central CAS Google Scholar
Köhler, S. et al. The human phenotype ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 42, D966–D974 (2014).
Article PubMed CAS Google Scholar
Petegrosso, R., Park, S., Hwang, T. H. & Kuang, R. Transfer learning across ontologies for phenomegenome association prediction. Bioinformatics 33, 529–536 (2017).
PubMed Google Scholar
Wishart, D. S. et al. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 36, D901–D906 (2008).
Article CAS PubMed Google Scholar
Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107 (2012).
Article CAS PubMed Google Scholar
Chen, X., Ji, Z. L. & Chen, Y. Z. TTD: therapeutic target database. Nucleic. Acids. Res. 30, 412–415 (2002).
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M. & Hirakawa, M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 38, D355–D360 (2010).
Article CAS PubMed Google Scholar
Wu, Z., Wang, Y. & Chen, L. Network-based drug repositioning. Mol. Biosyst. 9, 1268–1281 (2013).
Article CAS PubMed Google Scholar
Chung, F. R. Spectral graph theory, Vol. 92 (American Mathematical Society, 1997).
Zhou, D., Bousquet, O., Lal, T. N., Weston, J. & Schölkopf, B. Learning with local and global consistency. In Advances in Neural Information Processing Systems 321–328 (MIT Press, 2004).
Zhu, X. & Ghahramani, Z. Learning from labeled and unlabeled data with label propagation. Technical Report (CMU, 2002).
Li, C. & Li, H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics 24, 1175–1182 (2008).
Article CAS PubMed Google Scholar
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Stat. Methodol. 58, 267–288 (1996).
Zhang, W. et al. Network-based survival analysis reveals subnetwork signatures for predicting outcomes of ovarian cancer treatment. PLoS Comput. Biol. 9, e1002975 (2013).
Article CAS PubMed PubMed Central Google Scholar
Sun, H. & Wang, S. Penalized logistic regression for high-dimensional DNA methylation data with case-control studies. Bioinformatics 28, 1368–1375 (2012).
Article CAS PubMed PubMed Central Google Scholar
Chen, L., Xuan, J., Riggins, R. B., Clarke, R. & Wang, Y. Identifying cancer biomarkers by network-constrained support vector machines. BMC Syst. Biol. 5, 1 (2011).
Article CAS Google Scholar
Hwang, T., Tian, Z., Kuangy, R. & Kocher, J.-P. Learning on weighted hypergraphs to integrate protein interactions and gene expressions for cancer outcome prediction. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining 293–302 (IEEE Computer Society, 2008).
Hwang, T. et al. Robust and efficient identification of biomarkers by classifying features on graphs. Bioinformatics 24, 2023–2029 (2008).
Article CAS PubMed Google Scholar
Tian, Z., Hwang, T. & Kuang, R. A hypergraph-based learning algorithm for classifying gene expression and arrayCGH data with prior knowledge. Bioinformatics 25, 2831–2838 (2009).
Article CAS PubMed Google Scholar
Cai, D., He, X., Han, J. & Huang, T. S. Graph regularized nonnegative matrix factorization for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1548–1560 (2011).
Article PubMed Google Scholar
Hwang, T. et al. Co-clustering phenome-genome for phenotype classification and disease gene discovery. Nucleic Acids Res. 40, e146–e146 (2012).
Article CAS PubMed PubMed Central Google Scholar
Chuang, H.-Y., Lee, E., Liu, Y.-T., Lee, D. & Ideker, T. Network-based classification of breast cancer metastasis. Mol. Syst. Biol. 3, 140 (2007).
Article PubMed PubMed Central Google Scholar
Lee, E., Chuang, H.-Y., Kim, J.-W., Ideker, T. & Lee, D. Inferring pathway activity toward precise disease classification. PLoS Comput. Biol. 4, e1000217 (2008).
Article PubMed PubMed Central CAS Google Scholar
Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).
Article CAS PubMed PubMed Central Google Scholar
He, D., Liu, Z.-P. & Chen, L. Identification of dysfunctional modules and disease genes in congenital heart disease by a network-based approach. BMC Genomics 12, 592 (2011).
Article CAS PubMed PubMed Central Google Scholar
Hofree, M., Shen, J. P., Carter, H., Gross, A. & Ideker, T. Network-based stratification of tumor mutations. Nat. Methods 10, 1108–1115 (2013).
Article CAS PubMed PubMed Central Google Scholar
Jahid, M. J. & Ruan, J. A. Steiner tree-based method for biomarker discovery and classification in breast cancer metastasis. BMC Genomics 13, S8 (2012).
Article PubMed PubMed Central Google Scholar
Guo, Z. et al. Towards precise classification of cancers based on robust gene functional expression profiles. BMC Bioinformatics 6, 58 (2005).
Article PubMed PubMed Central CAS Google Scholar
Edelman, E. et al. Analysis of sample set enrichment scores: assaying the enrichment of sets of genes for individual samples in genome-wide expression profiles. Bioinformatics 22, e108–e116 (2006).
Article CAS PubMed Google Scholar
Kim, Y.-A., Wuchty, S. & Przytycka, T. M. Identifying causal genes and dysregulated pathways in complex diseases. PLoS Comput. Biol. 7, e1001095 (2011).
Article CAS PubMed PubMed Central Google Scholar
Vandin, F., Upfal, E. & Raphael, B. J. Algorithms for detecting significantly mutated pathways in cancer. J. Comput. Biol. 18, 507–522 (2011).
Article CAS PubMed Google Scholar
Kondor, R. I. & Lafferty, J. D. Diffusion kernels on graphs and other discrete input spaces. In Proceedings of the Nineteenth International Conference on Machine Learning, Vol. 2, 315–322 (Morgan Kaufmann Publishers Inc., 2002).
Paull, E. O. et al. Discovering causal pathways linking genomic events to transcriptional states using Tied Diffusion Through Interacting Events (TieDIE). Bioinformatics 29, 2757–2764 (2013).
Article CAS PubMed PubMed Central Google Scholar
Leiserson, M. D. et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet. 47, 106–114 (2015).
Article CAS PubMed Google Scholar
Hwang, T. H. et al. Large-scale integrative network-based analysis identifies common pathways disrupted by copy number alterations across cancers. BMC Genomics 14, 440 (2013).
Article PubMed PubMed Central Google Scholar
Vaske, C. J. et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics 26, i237–i245 (2010).
Article CAS PubMed PubMed Central Google Scholar
Ciriello, G., Cerami, E., Sander, C. & Schultz, N. Mutual exclusivity analysis identifies oncogenic network modules. Genome Res. 22, 398–406 (2012).
Article CAS PubMed PubMed Central Google Scholar
Tarca, A. L. et al. A novel signaling pathway impact analysis. Bioinformatics 25, 75–82 (2009).
Article CAS PubMed Google Scholar
Shlomi, T., Cabili, M. N., Herrgård, M. J., Palsson, B. Ø. & Ruppin, E. Network-based prediction of human tissue-specific metabolism. Nat. Biotechnol. 26, 1003–1010 (2008).
Article CAS PubMed Google Scholar
Zhang, W., Hwang, B., Wu, B. & Kuang, R. Network propagation models for gene selection. In IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) 1–4 (IEEE, 2010).
Friedman, J., Hastie, T. & Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441 (2008).
Article PubMed Google Scholar
Campillos, M., Kuhn, M., Gavin, A.-C., Jensen, L. J. & Bork, P. Drug target identification using side-effect similarity. Science 321, 263–266 (2008).
Article CAS PubMed Google Scholar
Iorio, F. et al. Discovery of drug mode of action and drug repositioning from transcriptional responses. Proc. Natl. Acad. Sci. 107, 14621–14626 (2010).
Article CAS PubMed PubMed Central Google Scholar
Alaimo, S., Pulvirenti, A., Giugno, R. & Ferro, A. Drug-target interaction prediction through domain-tuned network-based inference. Bioinformatics 29, 2004–2008 (2013).
Article CAS PubMed PubMed Central Google Scholar
Chen, H.-R., Sherr, D. H., Hu, Z. & DeLisi, C. A network based approach to drug repositioning identifies plausible candidates for breast cancer and prostate cancer. BMC Med. Genomics 9, 51 (2016).
Article PubMed PubMed Central CAS Google Scholar
Cheng, F. et al. Prediction of drug-target interactions and drug repositioning via network-based inference. PLoS Comput. Biol. 8, e1002503 (2012).
Article CAS PubMed PubMed Central Google Scholar
Wang, W., Yang, S., Zhang, X. & Li, J. Drug repositioning by integrating target information through a heterogeneous network model. Bioinformatics 30, 2923–2930 (2014).
Article CAS PubMed PubMed Central Google Scholar
Yamanishi, Y., Araki, M., Gutteridge, A., Honda, W. & Kanehisa, M. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24, i232–i240 (2008).
Article CAS PubMed PubMed Central Google Scholar
Zheng, X., Ding, H., Mamitsuka, H. & Zhu, S. Collaborative matrix factorization with multiple similarities for predicting drug-target interactions. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1025–1033 (ACM, 2013).
Xia, Z., Wu, L.-Y., Zhou, X. & Wong, S. T. Semi-supervised drug-protein interaction prediction from heterogeneous biological spaces. In BMC Systems Biology, Vol. 4, S6 (BioMed Central Ltd, 2010).
Chen, X., Liu, M.-X. & Yan, G.-Y. Drug-target interaction prediction by random walk on the heterogeneous network. Mol. Biosyst. 8, 1970–1978 (2012).
Article CAS PubMed Google Scholar
Emig, D. et al. Drug target prediction and repositioning using an integrated network-based approach. PLoS One 8, e60618 (2013).
Article CAS PubMed PubMed Central Google Scholar
Mei, J.-P., Kwoh, C.-K., Yang, P., Li, X.-L. & Zheng, J. Drug-target interaction prediction by learning from local information and neighbors. Bioinformatics 29, 238–245 (2013).
Article CAS PubMed Google Scholar
Bleakley, K. & Yamanishi, Y. Supervised prediction of drug-target interactions using bipartite local models. Bioinformatics 25, 2397–2403 (2009).
Article CAS PubMed PubMed Central Google Scholar
van Laarhoven, T., Nabuurs, S. B. & Marchiori, E. Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics 27, 3036–3043 (2011).
Article PubMed CAS Google Scholar
Ley, T. J. et al. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N. Engl. J. Med. 368, 2059–2074 (2013).
Article PubMed CAS Google Scholar
Zheng, S. et al. Comprehensive pan-genomic characterization of adrenocortical carcinoma. Cancer Cell. 29, 723–736 (2016).
Article CAS PubMed PubMed Central Google Scholar
Cancer Genome Atlas Research Network. et al. Comprehensive molecular characterization of urothelial bladder carcinoma. Nature 507, 315–322 (2014).
Article CAS Google Scholar
Ciriello, G. et al. Comprehensive molecular portraits of invasive lobular breast cancer. Cell 163, 506–519 (2015).
Article CAS PubMed PubMed Central Google Scholar
Cancer Genome Atlas Network. et al. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).
Article CAS Google Scholar
The Cancer Genome Atlas Research Network. Integrated genomic and molecular characterization of cervical cancer. Nature 543, 378–384 (2017).
Article PubMed Central CAS Google Scholar
Davis, C. F. et al. The somatic genomic landscape of chromophobe renal cell carcinoma. Cancer Cell 26, 319–330 (2014).
Article CAS PubMed PubMed Central Google Scholar
Cancer Genome Atlas Network. et al. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).
Article CAS Google Scholar
Cancer Genome Atlas Research Network. et al. Comprehensive, integrative genomic analysis of diffuse lower-grade gliomas. N. Engl. J. Med. 2015, 2481–2498 (2015).
Article CAS Google Scholar
Brennan, C. W. et al. The somatic genomic landscape of glioblastoma. Cell 155, 462–477 (2013).
Article CAS PubMed PubMed Central Google Scholar
McLendon, R. et al. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068 (2008).
Article CAS Google Scholar
Cancer Genome Atlas Network. et al. Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature 517, 576–582 (2015).
Article CAS Google Scholar
Cancer Genome Atlas Research Network. et al. Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499, 43–49 (2013).
Article CAS Google Scholar
Cancer Genome Atlas Research Network. et al. Comprehensive molecular profiling of lung adenocarcinoma. Nature 511, 543–550 (2014).
Article CAS Google Scholar
Cancer Genome Atlas Research Network. et al. Comprehensive genomic characterization of squamous cell lung cancers. Nature 489, 519–525 (2012).
Article CAS Google Scholar
Ceccarelli, M. et al. Molecular profiling reveals biologically discrete subsets and pathways of progression in diffuse glioma. Cell 164, 550–563 (2016).
Article CAS PubMed PubMed Central Google Scholar
The Cancer Genome Atlas Research Network. Integrated genomic characterization of oesophageal carcinoma. Nature 541, 169–175 (2017).
Article CAS Google Scholar
Cancer Genome Atlas Research Network. et al. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609–615 (2011).
Article CAS Google Scholar
Campbell, J. D. et al. Distinct patterns of somatic genome alterations in lung adenocarcinomas and squamous cell carcinomas. Nat. Genet. 48, 607–616 (2016).
Cancer Genome Atlas Research Network. et al. Comprehensive molecular characterization of papillary renal-cell carcinoma. N. Engl. J. Med. 2016, 135–145 (2016).
Article CAS Google Scholar
Cancer Genome Atlas Research Network. et al. Integrated genomic characterization of papillary thyroid carcinoma. Cell 159, 676–690 (2014).
Article CAS Google Scholar
Cancer Genome Atlas Research Network. et al. The molecular taxonomy of primary prostate cancer. Cell 163, 1011–1025 (2015).
Article CAS Google Scholar
Cancer Genome Atlas Research Network. et al. Comprehensive molecular characterization of gastric adenocarcinoma. Nature 513, 202–209 (2014).
Article CAS Google Scholar
Cancer Genome Atlas Research Network. et al. Integrated genomic characterization of endometrial carcinoma. Nature 497, 67–73 (2013).
Article CAS Google Scholar
Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Article CAS PubMed PubMed Central Google Scholar
Gao, J. et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal. 6, pl1 (2013).
Article PubMed PubMed Central CAS Google Scholar
Jiralerspong, S. et al. Metformin and pathologic complete responses to neoadjuvant chemotherapy in diabetic patients with breast cancer. J. Clin. Oncol. 27, 3297–3302 (2009).
Article CAS PubMed PubMed Central Google Scholar
Contreras, C. M. et al. Loss of LKB1 provokes highly invasive endometrial adenocarcinomas. Cancer Res. 68, 759–766 (2008).
Article CAS PubMed Google Scholar
Peña, C. G. et al. LKB1 loss promotes endometrial cancer progression via CCL2-dependent macrophage recruitment. J. Clin. Invest. 125, 4063–4076 (2015).
Article PubMed PubMed Central Google Scholar
Cantrell, L. A. et al. Metformin is a potent inhibitor of endometrial cancer cell proliferationimplications for a novel treatment strategy. Gynecol. Oncol. 116, 92–98 (2010).
Article CAS PubMed Google Scholar
Pansare, V. et al. Increased expression of hypoxia-inducible factor 1α in type i and type ii endometrial carcinomas. Mod. Pathol. 20, 35–43 (2007).
Article CAS PubMed Google Scholar
Harvey, K. F., Zhang, X. & Thomas, D. M. The Hippo pathway and human cancer. Nat. Rev. Cancer 13, 246–257 (2013).
Article CAS PubMed Google Scholar
Yuan, T. & Cantley, L. PI3K pathway alterations in cancer: variations on a theme. Oncogene. 27, 5497–5510 (2008).
Article CAS PubMed PubMed Central Google Scholar
Goldman, M. et al. The UCSC cancer genomics browser: update 2015. Nucleic Acids Res. 43, D812 (2015).
Article CAS PubMed Google Scholar
Ciriello, G. et al. Emerging landscape of oncogenic signatures across human cancers. Nat. Genet. 45, 1127–1133 (2013).
Article CAS PubMed PubMed Central Google Scholar
Zhang, W., Johnson, N., Wu, B. & Kuang, R. Signed network propagation for detecting differential gene expressions and DNA copy number variations. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine 337–344 (ACM, 2012).
Kidd, B. A., Readhead, B. P., Eden, C., Parekh, S. & Dudley, J. T. Integrative network modeling approaches to personalized cancer medicine. Personal. Med. 12, 245–257 (2015).
Article CAS Google Scholar
Dimitrakopoulos, C. M. & Beerenwinkel, N. Computational approaches for the identification of cancer genes and pathways. Wiley Interdiscip. Rev. Syst. Biol. Med. 9 (2017).
Zhang, W. et al. Network-based isoform quantification with rna-seq data for cancer transcriptome analysis. PLoS Comput. Biol. 11, e1004465 (2015).
Article PubMed PubMed Central CAS Google Scholar
Tseng, Y.-T. et al. IIIDB: a database for isoform-isoform interactions and isoform network modules. BMC Genomics 16, S10 (2015).
Article PubMed PubMed Central CAS Google Scholar
W, L. et al. Pushing the annotation of cellular activities to a higher resolution: Predicting functions at the isoform level. Methods 93, 110–118 (2016).
Article CAS Google Scholar
Sultan, M. et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321, 956–960 (2008).
Article CAS PubMed Google Scholar
Vazquez, A., Rual, J.-F. & Venkatesan, K. Quality control methodology for high-throughput protein-protein interaction screening. Netw. Biol. Methods Appl. 781, 279–294 (2011).
Hosur, R. et al. A computational framework for boosting confidence in high-throughput protein-protein interaction datasets. Genome Biol. 13, R76 (2012).
Article PubMed PubMed Central Google Scholar
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Article CAS PubMed PubMed Central Google Scholar
Giardine, B. et al. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 15, 1451–1455 (2005).
Article CAS PubMed PubMed Central Google Scholar
Petegrosso, R., Zhang, W., Li, Z., Saad, Y. & Kuang, R. Low-rank label propagation for semi-supervised learning with 100 millions samples. Preprint at https://arxiv.org/abs/1702.08884 (2017).
Tian, Z. & Kuang, R. Global linear neighborhoods for efficient label propagation. In Proceedings of the 2012 SIAM International Conference on Data Mining 863–872 (SIAM, 2012).

Download references

Acknowledgements

The results are based upon data generated by The Cancer Genome Atlas established by the NCI and NHGRI. Information about TCGA and the investigators and institutions who constitute the TCGA research network can be found at http://cancergenome.nih.gov. The dbGaP accession number to the specific version of the TCGA dataset is phs000178.v9.p8. This research work is supported by a grant from the National Science Foundations, USA (NSF III 1149697).

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Minnesota Twin Cities, Minneapolis, MN, USA
Wei Zhang & Rui Kuang
Department of Cancer Biology, University of Kansas Medical Center, Kansas City, KS, USA
Jeremy Chien
Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota Twin Cities, Minneapolis, MN, USA
Jeongsik Yong

Authors

Wei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jeremy Chien
View author publications
You can also search for this author in PubMed Google Scholar
Jeongsik Yong
View author publications
You can also search for this author in PubMed Google Scholar
Rui Kuang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

W.Z. and R.K. drafted the manuscript and designed the experiments. W.Z. performed the experiments and analyzed the results. J.C. and J.Y. analyzed the results. W.Z., J.C., J.Y. and R.K. wrote the manuscript.

Corresponding author

Correspondence to Rui Kuang.

Ethics declarations

Competing Interests

The authors declare that they have no competing financial interests.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Supplementary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, W., Chien, J., Yong, J. et al. Network-based machine learning and graph theory algorithms for precision oncology. npj Precision Onc 1, 25 (2017). https://doi.org/10.1038/s41698-017-0029-7

Download citation

Received: 02 February 2017
Revised: 28 June 2017
Accepted: 29 June 2017
Published: 08 August 2017
DOI: https://doi.org/10.1038/s41698-017-0029-7

This article is cited by

Mathematical modeling of cancer immunotherapy for personalized clinical translation
- Joseph D. Butner
- Prashant Dogra
- Zhihui Wang
Nature Computational Science (2022)
Research perspectives on animal health in the era of artificial intelligence
- Pauline Ezanno
- Sébastien Picault
- Jean-François Guégan
Veterinary Research (2021)
The personalized medicine discourse: archaeology and genealogy
- Alfredo Cesario
- Franziska Michaela Lohmeyer
- Giovanni Scambia
Medicine, Health Care and Philosophy (2021)
Network-based drug sensitivity prediction
- Khandakar Tanvir Ahmed
- Sunho Park
- Wei Zhang
BMC Medical Genomics (2020)
HyperFoods: Machine intelligent mapping of cancer-beating molecules in foods
- Kirill Veselkov
- Guadalupe Gonzalez
- Ivan Laponogov
Scientific Reports (2019)

Subjects

Abstract

Similar content being viewed by others

Introduction

Biomedical and molecular networks

Network-based analysis of personal genomic profiles

Model-based integration of whole-genomic profiles and a network

Preprocessing integration to detect network-based features

Post-analysis of oncogenic alterations in networks

Comparison of the methods

Network-based methods for drug repositioning

Graph connectivity measures

Link prediction models

Network-based classification methods

Comparison of the methods

Network-based analysis of TCGA mutation data and a case study on ovarian cancer

Discussion

Future directions

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links