Network-Based Biomarkers Enable Cross-Disease Biomarker Discovery

Biomarkers lie at the heart of precision medicine, biodiversity monitoring, agricultural pathogen detection, amongst others. Surprisingly, while rapid genomic profiling is becoming ubiquitous, the development of biomarkers almost always involves the application of bespoke techniques that cannot be directly applied to other datasets. There is an urgent need for a systematic methodology to create biologically-interpretable molecular models that robustly predict key phenotypes. We therefore created SIMMS: an algorithm that fragments pathways into functional modules and uses these to predict phenotypes. We applied SIMMS to multiple data-types across four diseases, and in each it reproducibly identified subtypes, made superior predictions to the best bespoke approaches, and identified known and novel signaling nodes. To demonstrate its ability on a new dataset, we measured 33 genes/nodes of the PIK3CA pathway in 1,734 FFPE breast tumours and created a four-subnetwork prediction model. This model significantly out-performed existing clinically-used molecular tests in an independent 1,742-patient validation cohort. SIMMS is generic and can work with any molecular data or biological network, and is freely available at: https://cran.r-project.org/web/packages/SIMMS.


Abstract
Biomarkers lie at the heart of precision medicine, biodiversity monitoring, agricultural pathogen detection, amongst others. Surprisingly, while rapid genomic profiling is becoming ubiquitous, the development of biomarkers almost always involves the application of bespoke techniques that cannot be directly applied to other datasets. There is an urgent need for a systematic methodology to create biologically-interpretable molecular models that robustly predict key phenotypes. We therefore created SIMMS: an algorithm that fragments pathways into functional modules and uses these to predict phenotypes. We applied SIMMS to multiple data-types across four diseases, and in each it reproducibly identified subtypes, made superior predictions to the best bespoke approaches, and identified known and novel signaling nodes. To demonstrate its ability on a new dataset, we measured 33 genes/nodes of the PIK3CA pathway in 1,734 FFPE breast tumours and created a four-subnetwork prediction model. This model significantly out-performed existing clinically-used molecular tests in an independent 1,742-patient validation cohort. SIMMS is generic and can work with any molecular data or biological network, and is freely available at: https://cran.rproject.org/web/packages/SIMMS. Most human disease is complex, caused by interaction of genetic, epigenetic and environmental insults. A single disease phenotype can often arise in many ways, allowing a diversity of molecular underpinnings to yield a smaller number of phenotypic consequences. This molecular heterogeneity within a single disease is believed to underlie poor clinical trial results for some therapies 1 and the modest performance of many genome-wide association studies [2][3][4] .
Researchers thus face two challenges. First, molecular markers are needed to personalize and optimize treatment decisions by predicting patient outcome (prognosis/residual risk) and response to therapy. Second, clinical heterogeneity in patient phenotypes needs to be molecularly rationalized to allow targeting of the mechanistic underpinnings of disease. For example, if a single pathway is dysregulated in multiple ways, drugs targeting the pathway could be applied.
Several approaches have been taken to solve these challenges. The most common has been to measure mRNA abundances as a snapshot of cellular state, and to construct a predictive model from them 5,6 . Unfortunately these studies have been limited by noise and disease heterogeneity. Further, RNA is rarely directly functional 7 . Several groups have integrated multiple data-types using network and systems biology approaches identifying patient subtypes, with limited post-hoc clinical evaluation [8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26] . These algorithms have not yet clearly shown how the interplay between different pathways underpins disease etiology, nor generated biomarkers with systematically demonstrated reproducibility on independent patient cohorts across multiple indications 27 .
There is thus an urgent need to generate accurate and actionable biomarkers that integrate diverse molecular, functional and clinical information. We developed a subnetwork-based approach, called SIMMS, which uses arbitrary molecular data types to identify dysregulated pathways and create functional biomarkers. We validate SIMMS across 5 tumour types and 11,392 patients, using it to create biomarkers from a diverse range of molecular assays and uncovering unanticipated pan-cancer similarities.

SIMMS prioritization of candidate prognostic subnetworks
SIMMS acts upon a collection of subnetwork modules, where each node is a molecule (e.g. a gene or metabolite) and each edge is an interaction (physical or functional) between those molecules. Molecular data is projected onto these subnetworks using topology measurements that represent the impact of and synergy between different molecular features. To allow modeling of biological processes with different network architectures, we devised three network topology measurements: (nodes/molecules only), E (edges/interactions only) and N+E (nodes and edges). While the N model assumes independent and additive effects of parts of a subnetwork, the E and N+E models incorporate the impact of dysregulated interactions (Online Methods). SIMMS fits one of these models and computes a 'module-dysregulation score' (MDS) for each subnetwork that measures their strength of association with a specific disease, phenotype or outcome (Supplementary Figure 1).

Characteristics and benchmarking of SIMMS identified prognostic subnetworks
A key challenge faced by translational research is to extend the single gene biomarkers paradigm to clinically actionable metagenes/pathways. Thus, we tested the prognostic value of pathway-derived subnetworks using Cox modeling to quantify how effectively a subnetwork stratifies patients into groups with differential risk (Online Methods). SIMMS can use any network, and we chose to evaluate it using 449 gene-centric pathways from the high-quality, manually-curated NCI-Nature Pathway Interaction database 28 . These pathways comprise 500 non-overlapping subnetworks (Supplementary Table 1 This association between subnetwork size (number of genes) and prognostic power was consistent in breast, NSCLC and ovarian cancers, even though different pathways were altered in each but not in colon cancers. Second, the prognostic ability of Model N was significantly superior to that of Model N+E and E; a trend which was maintained across all four cancer types (one-way ANOVA, Tukey HSD multiple comparison test; Supplementary Figure 5). This suggests that mRNA abundance of functionally-related genesets alone is a strong predictor of patient outcome. We therefore focused solely on Model N moving forward, while recognizing that in other diseases different regulatory architectures may be disrupted.
Next we compared how SIMMS subnetwork scores perform against five wellknown machine learning algorithms treating genes as individual features in multivariate setting (Supplementary Results section 2). SIMMS identified an equal or significantly greater number of prognostic subnetworks compared to models based on genes in each of these subnetworks for these methods (P<0.01, proportion test; Figures 1a-d).

Multi-cancer analysis reveals recurrently dysregulated subnetworks
We next quantitatively determined the similarity between different tumour types at the pathway level. Cross disease assessment of significantly prognostic subnetworks (P<0.05) revealed well-known oncogenic pathways such as Aurora Kinase A and B signaling, apoptosis, DNA repair, RAS signaling, telomerase regulation and P53 activity in all four tumour types (Supplementary Tables 6-9).
By limiting to highly prognostic subnetworks (|log 2 HR|>1.5 and P<0.05) in each tumour type, 17 recurrently prognostic subnetworks (at least three tumour types) were identified (Figure 1e, Supplementary Figure 6). Significant overlap between prognostic subnetworks was observed for breast, colon and NSCLC (14 subnetworks: P overlap =0.009, 10 5 permutations; Figure 1f). These results can inform prospective clinical trials on repurposing strategies of anti-cancer drugs targeting the pathways underlying these subnetworks.
In breast cancer, subnetworks modules encompassing proliferation pathways (Mitosis, PLK1, AURKA and AURKB) were highly prognostic (Supplementary Table 6b). To ensure these are not driven by common proliferation genes, we tested gene overlap in these subnetworks and found them highly divergent Spearman's ρ=-0.59, P < 2.2 x 10 -16 ) (Figures 2e-f), both of which were associated with good outcome. Consistent with a recent breast cancer study 31 , naïve immune and stromal content estimates were only weakly associated with patient outcome (Supplementary Figures 7b-e), whilst SIMMS's MDS of T-cell receptor signaling not only provides accurate identification of patients who may benefit from immunotherapy but also indicates associated molecular targets.

Subnetwork-based biomarkers predict patient outcome
As SIMMS accurately identified individual prognostic subnetworks, we hypothesized that modeling of multiple aspects of tumour biology through these subnetwork into a single molecular biomarker could better rationalise patient heterogeneity emerging from alternative pathways of disease progression. First, to initialise the number of subnetworks, 10 million random sets of subnetworks of different sizes (1 to 250) were generated. These were tested for prognostic potential in a multivariate Cox model, thereby generating an empirical null distribution which allowed us to select the optimal number of pathways that  Table 14). Therefore SIMMS provided a consistent and unified approach to generating highly accurate biomarkers.
Focusing on breast cancer as a disease with well-defined molecular subtypes, we tested SIMMS on Metabric breast cancer cohort (n=1,970) 29 . Our prognostic classifier revealed two primary patient clusters with distinct pathway activities.
These clusters were highly correlated with the PAM50 32 intrinsic subtypes of breast cancer (f-measure=0.81; Figure 4a). Since breast cancer is a heterogeneous disease with distinct molecular and clinical characteristics 32 , we asked whether SIMMS could identify subtype-specific prognostic markers. To evaluate this, we classified patients into PAM50, ER+ and ER-subtypes and created SIMMS classifiers for each subtype. SIMMS classifiers were able to identify subgroups of patients at a significantly higher risk of relapse (P<0.05) in each of the Luminal-A, Normal-like and ER+ subtypes (Figure 4b). Importantly, these subgroups of patients present differential pathway activity (as quantified by SIMMS), and hence may benefit from aggressive/alternative treatments targetting these pathways. We further validated the efficacy of SIMMS when trained and tested for reproducibulity across different genomic platforms  Figure 19). Taken together these results demonstrate that pathway-driven subnetwork modelling can flexibly integrate diverse assays emerging from multiple platforms.

A PIK3CA signaling residual risk predictor in early breast cancer
While the public data used to evaluate SIMMS is valuable, it does not closely represent that used in clinical settings. To better represent this scenario, we focused on the PI3K-signaling pathway, which is frequently mutated in breast cancer and is the subject of several targeted therapies. We evaluated 1,734 samples from the Phase III TEAM clinical trial and measured mRNA abundance of 33 PI3K signaling genes in clinically-relevant FFPE samples. All samples were ER positive "luminal" breast cancers from the TEAM pathology study 33 (Supplementary Table 15

PIK3CA signaling modules outperform existing markers
To benchmark SIMMS's PI3K modules signature against current clinicallyvalidated approaches, we compared its performance to a clinically-used proteinbased residual risk predictor, IHC4 34 . IHC4 was assessed using quantitative IHC measurements of ER, PgR, Ki67 and HER2 35

Discussion
Patients with complex human diseases present highly heterogeneous molecular profiles, ranging from a few aberrant genes to a set of dysregulated pathways.
Because many different molecular aberrations can give rise to a single clinical phenotype, the importance of generating multi-modal datasets is increasingly appreciated 8,29,37,38 . Indeed, a single whole-genome sequencing experiment generates information about single nucleotide variations, copy number aberrations and genomic rearrangements. SIMMS puts this molecular variability into the context of existing knowledge of biological pathways using subnetwork information. Several other groups have considered the value of network models in predicting breast cancer outcome 11,12,39 and in subtyping glioblastoma 40 .
However no such tools have yet been developed to be generalizable to a broad range of diseases or to arbitrary topological measures that might be used to estimate interaction weights in network-models of biology 41,42 or to work with physical, functional, transcriptional or metabolic networks 43,44 . SIMMS provides this generalizability and flexibility by treating molecular profiles as generic featuers and not just genes.
Most previous biomarker studies have focused on establishing biomarkers using mRNA abundance profiles, with pathway-level analysis used post hoc to characterize the most interesting genes [45][46][47] . Our approach inverts this strategy, taking known pathways a priori and thus creating immediately interpretable and clinically actionable biomarkers 13 . For example, our PIK3CA risk predictor Precision molecular medicine is predicated on the concept of giving each patient the right drug in the right dose at the right time. This type of personalized treatment requires the development of robust biomarkers that precisely predict clinical phenotypes. Current clinical biomarkers are typically derived from a small number of genes, and do not yet recapitulate the full complexity of disease. SIMMS takes a step towards integrating diverse cellular processes into a singular model, and is well-positioned to take into account the influx of clinical sequencing data now being generated. However, as -omic techniques evolve to rapidly analyze and quantify cellular metabolites, network models may need to change from being gene-centric to including metabolites as core nodes. Further, single-cell analysis methods may allow accurate interrogation of the interactions between different cell-types, perhaps requiring simultaneous fitting of multiple distinct, but interacting network models. The continued development of robust, general biomarker discovery algorithms is thus required to generate the accurate and reproducible biomarkers needed for transforming medical care.

Pathways data-preprocessing
The pathway dataset was downloaded from the NCI-Nature Pathway Interaction database 28 in PID-XML format (Supplementary Table 1). The XML dataset was parsed to extract protein-protein interactions from all the pathways using custom Perl (v5.8.8) scripts (Supplementary File 1). The protein identifiers extracted from the XML dataset were further mapped to Entrez gene identifiers using Ensembl BioMart (version 62). Whereever annotations referred to a class of proteins, all members of the class were included in the pathway, in some case using additional annotations from Reactome and Uniprot databases. The protein-protein interactions, once mapped to the Entrez gene identifiers, were grouped under respective pathways for subsequent processing. The initial dataset contained 1,159 variable size subnetworks (Supplementary Figure 2ab). In order to identify redundant subnetworks, we tested the overlap between all pairs of subnetworks. When a pair of subnetworks had a two-way overlap above 80% (if two modules shared over 80% their network components; nodes and edges), we eliminated the smaller module. Additionally, all subnetworks modules containing less than 3 edges were excluded. In total, these criteria removed 659 subnetworks, resulting in 500 subnetworks.

mRNA abundance data pre-processing
All pre-processing was performed in R statistical environment (v2.13.0). Raw datasets from breast, colon, NSCLC and ovarian cancer studies  Results section 4).

TEAM study population
The TEAM trial is a multinational, randomized, open-label, phase III trial in which postmenopausal women with hormone receptor-positive luminal 53

Univariate data analyses
In order to avoid dataset-specific bias, all included studies were analyzed independently (Supplementary Table 2). First, each dataset was pre-processed independently, as described in the 'mRNA abundance data preprocessing' section above. Next, genes across all the datasets were evaluated for their prognostic power using a univariate Cox proportional hazards model followed by the Wald-test (R package: survival v2. . For breast, NSCLC and ovarian cancers with different survival end-points, overall survival (OS) was used as the survival time variable; for the studies that did not report OS, we used the closest alternative endpoint available in that study (e.g. diseasespecific survival or distant metastasis-free survival). For colon cancer, all studies reported relapse/disease free survival and hence was used as survival endpoint. All the genes were subsequently ranked by the Wald-test p-value within each study. The top genes across all studies were compared on multiple criterion (Supplementary Results section 1):

-Rank Product
The Rank Product 56 of each gene was computed as: Here k represents the number of studies which had the mRNA abundance measure available for gene g. r i is the rank of gene g in study i. The overall ranking table was used as a benchmark to identify datasets in which a given gene was ranked farthest when its rank product was compared to studywise ranks. The farthest dataset count was computed for the overall top ranked (100, 200, 300,…, 1000, 2000) genes (Supplementary Figure 3a-e).

-Percentile ranks
The p-value (Wald-test) based ranking was transformed into percentile ranks within each study. These ranks were used as a measure of gene's position with reference to the benchmark rank derived in the step 1 to evaluate deviation of genes' ranks for each study (Supplementary Figure 3f-l).
The mRNA abundance profiles of common genes across all studies were extracted and patient wise Spearman rank correlation coefficient was estimated (R package: stats v2.13.0). The correlation coefficient was used to further analyze intra-and inter-study correlation in order to identify any outlier studies ( Supplementary Figure 3j-l).

Eliminating redundant mRNA profiles (breast cancer data)
The Spearman rank correlation coefficient was also used to establish a nonredundant set of patients. This is important not only to identify any patients that might have participated in more than one study or duplicate data used in multiple papers, but also to train a robust model thereby preventing model over-

1-Gene hazard ratio
The hazard ratio for all the genes by combining samples from all the training datasets was estimated using the univariate Cox proportional hazards model.
The Cox model was fit to the median dichotomized grouping of mRNA abundance profiles of the samples as opposed to continuous measure of mRNA abundance.

2-Interaction hazard ratio
The hazard ratio for all the protein-protein interactions gathered from the NCI-Nature pathway interaction database were estimated using a multivariate Cox proportional hazards model. A Cox model, shown below, was fit to median dichotomized patient grouping of each of the interacting gene pairs: where X G1 and X G2 represent patient's group for gene 1 and gene 2. X G1.G2 represents patient's binary interaction measure between the gene 1 and gene 2, as shown below: where Å represents exclusive disjunction between the grouping of each gene.
The expression encodes XNOR boolean function emulating true (1) whenever both the interacting genes belong to the same group.

Subnetwork module-dysregulation score (MDS)
The pathway-based subnetworks were scored using three different models.
These models compute a module-dysregulation score (MDS) by incorporating the hazard ratio of nodes and edges that form the subnetwork:

3-Edges only
where n and e represent total number of nodes (genes) and edges (interactions) in a subnetwork respectively. HR represents the hazard ratios of genes and the protein-protein interactions in a subnetwork (P < 0.05) (section: Meta-analysis).
The subnetworks were ranked according to their MDS, thereby identifying candidate prognostic features.

Patient risk score
The subnetwork MDS was used to draw a list of the top n subnetwork features for each of the three models (section: Subnetwork module-dysregulation score).
These features were subsequently used to estimate patient risk scores using Model N+E, N and E. The patient risk score for each of the subnetworks (risk SN ) was expressed using the following models:

-Edges only
where n and e represent the total number of nodes (genes) and edges (interactions) in a subnetwork (SN), respectively. HR is the hazard ratio of genes and the protein-protein interactions (P < 0.5; only to filter out genes where Cox model fails to fit estimating large/unstable coefficients) (section: Meta-analysis) in a subnetwork. x and y are the two nodes connected by an edge e j and ω is the scaled intensity of an arbitrary molecular profile (e.g. mRNA abundance, copy number aberrations, DNA methylation beta values etc).
A univariate Cox proportional hazards model was fitted to the training set, and applied to the validation set for each of the subnetworks. The prognostic power of all three models was compared using non-parametric two sample Wilcoxon rank-sum test (R package: stats v2.13.0).

Subnetwork feature selection
In order to narrow down the size of subnetwork features in each of the three models yet maintaining the prognostic power, we fitted a Cox model based Generalised Linear Model (L1-penalty) in 10-fold cross validation setting on the training cohort (R package: glmnet v1.9-8). SIMMS package supports additional machine learning algorithms including elastic-nets (ridge to lasso), backward elimination and forward selection (R package: MASS v7. [3][4][5][6][7][8][9][10][11][12]. The fitted coefficients ( b ) were subsequently used to estimate an overall measure of per patient risk score for the validation set using the following formula: where Y ij is the i th patient's risk score for subnetwork j. The training set HRs of the nodes and edges were used to compute Y ij (section: Patient risk score).
Next, we median dichotomized the validation cohort into low-and high-risk patients (or quartiles) using the median risk score (or quartiles) estimated on the training set. The risk group classification was assessed for potential association with patient survival data using Cox proportional hazards model and Kaplan-Meier survival analysis.

Randomization of candidate subnetwork markers
Jackknifing was performed over the subnetwork marker space for four tumour types; breast, colon, NSCLC and ovarian.     Tables 10-13). P values were estimated using Wald-test. Illumina dataset (Metabric) was used for subtype-specific models. For these, the Metabric-published training and validation cohorts were maintained for training and validation purposes. Numbers in parenthesis indicate the size of the validation cohort. Asterisks represent statistical significance of differential outcome between the predicted low-and high-risk groups (* P<0.05, ** P<0.01, *** P<0.001, Wald-test).