Pathway-based subnetworks enable cross-disease biomarker discovery

Biomarkers lie at the heart of precision medicine. Surprisingly, while rapid genomic profiling is becoming ubiquitous, the development of biomarkers usually involves the application of bespoke techniques that cannot be directly applied to other datasets. There is an urgent need for a systematic methodology to create biologically-interpretable molecular models that robustly predict key phenotypes. Here we present SIMMS (Subnetwork Integration for Multi-Modal Signatures): an algorithm that fragments pathways into functional modules and uses these to predict phenotypes. We apply SIMMS to multiple data types across five diseases, and in each it reproducibly identifies known and novel subtypes, and makes superior predictions to the best bespoke approaches. To demonstrate its ability on a new dataset, we profile 33 genes/nodes of the PI3K pathway in 1734 FFPE breast tumors and create a four-subnetwork prediction model. This model out-performs a clinically-validated molecular test in an independent cohort of 1742 patients. SIMMS is generic and enables systematic data integration for robust biomarker discovery.


Supplementary Table 3-5
List of colon 14-17 , NSCLC 1, 18-22 and ovarian 1, 23-27 cancer studies used for training and validation of prognostic models using SIMMS. Studies within each cancer type were divided into training and independent validation cohorts maintaining homogeneity in size and platforms, and in part further addressed through 10-fold cross validation and permutation analyses. Datasets referred to as colon cancer cohorts in this study contained both colon and colorectal cancer patients.

Supplementary Table 6
Hazard ratios (95% CI, P values, size of the validation cohort and q values; Benjamini & Hochberg method) of patients' risk score based classification in breast cancer. a, b and c represent Model N+E (nodes and interactions/edges), Model N (nodes only) and Model E (interactions/edges only) respectively. A univariate Cox proportional hazards model was fitted to each of the subnetwork markers and subsequently applied to predict patient risk score in the validation cohort. The survival differences between the median-dichtomised risk scores (low and high-risk groups) were assessed using Kaplan-Meier analysis. Table 6d shows univariate Cox model coefficients (log2(HR)) of genes involved in T-cell receptor signalling subnetwork, which were used to estimate per patient risk score. Genes highlighted in green (Wald-test P < 0.5) were selected to contribute toward risk score estimation.

Supplementary Tables 7-9
Hazard ratios (95% CI, P values, size of the validation cohort and q values; Benjamini & Hochberg method) of patients' risk score based classification in  (Table 7a-c: Colon, Table 8a-c: NSCLC and   Table 9a-c: Ovarian) and subsequently applied to predict patient risk score in the validation cohort. The survival differences between the median-dichtomised risk scores (low and high-risk groups) were assessed using Kaplan-Meier analysis. Table 9d contains co-occurrence analysis of platinum responders/nonresponders and SIMMS predicted risk groups in TCGA ovarian cancer cohort.
Previously published data on treatment response was used 28 .

Supplementary Tables 10-13
List of subnetwork modules following feature selection performed through Cox model using generalized linear models with LASSO (L1-regularization) in breast (

Supplementary Table 15
REMARK 29 information of team study describing total number of samples recruited, eligible and analysed.

Supplementary Table 16
Univariate prognostic assessment of mRNA abundance profiles. mRNA abundance profiles in TEAM training cohort were median-dichotomized into lowand high-risk groups except for ERBB2 (HER2). ERBB2 dichotomization was performed using Expectation-maximization clustering. DRFS was used as the survival end point. Cox proportional hazards model was used to estimate the Hazard ratios followed by the Wald-test for the significance of difference between the risk groups. P values were corrected for multiple comparisons using Benjamini & Hochberg method. The varying n cohorts is an artefact of normalisation/log 2 transformation of zero (0 abundance) for some patients.

Supplementary Table 17
Univariate prognostic assessment of clinical variables and mutational profiles in TEAM training cohort. DRFS was used as the survival end point. Cox proportional hazards model was used to estimate the Hazard ratios. The significance of association between DRFS and dichotomous variables (age, HER2 status, and mutational profiles) was assessed using the Wald-test.
However, Log-rank test was used for multi-category variables (grade, T-stage and N-stage). Prognostic assessment of grade and stage was conducted such that the grade 2 and 3 patients were compared against the baseline grade 1; N Stage 1, 2 and 3 were compared against N Stage 0 (node-negative); and T Stage 2 and 3 were compared against the baseline T Stage 1. For mutational profiles, wild-type carriers were compared against mutated samples for a given gene (HER2, PIK3CA, AKT1 and RAS).

Supplementary Table 18
List of PIK3CA pathway modules and corresponding genes. Modules were derived on the basis of underlying biological functionality.

Supplementary Table 19
Multivariate PIK3CA modules-derived prognostic model. Model parameters were estimated using a multivariate Cox proportional hazards model initialized with eight mRNA modules (Supplementary Table 18), age, grade, pathological size and N-stage. Model was further refined using backwards elimination resulting in the variables presented in this table. The model was trained using TEAM training cohort. The refined model was subsequently used to predict patient risk score in the TEAM validation cohort. Survival differences between the mediandichotomized risk scores as well as quartiles of the risk scores were assessed using Kaplan-Meier analysis.

Supplementary Table 20
Distribution of patients' tumour and clinical characteristics in randomly assigned Training and Validation cohorts. Numbers in the parentheses indicate relative proportion within each group. Unequal distribution of patient characteristics across randomly assigned Training and Validation cohorts was tested using Fisher's exact test followed by adjustment of probability values for multiple comparisons. Patients within the pathology research study were matched to the overall TEAM trial cohort, see previous publication 30 .