Personalized Integrated Network Modeling of the Cancer Proteome Atlas

Personalized (patient-specific) approaches have recently emerged with a precision medicine paradigm that acknowledges the fact that molecular pathway structures and activity might be considerably different within and across tumors. The functional cancer genome and proteome provide rich sources of information to identify patient-specific variations in signaling pathways and activities within and across tumors; however, current analytic methods lack the ability to exploit the diverse and multi-layered architecture of these complex biological networks. We assessed pan-cancer pathway activities for >7700 patients across 32 tumor types from The Cancer Proteome Atlas by developing a personalized cancer-specific integrated network estimation (PRECISE) model. PRECISE is a general Bayesian framework for integrating existing interaction databases, data-driven de novo causal structures, and upstream molecular profiling data to estimate cancer-specific integrated networks, infer patient-specific networks and elicit interpretable pathway-level signatures. PRECISE-based pathway signatures, can delineate pan-cancer commonalities and differences in proteomic network biology within and across tumors, demonstrates robust tumor stratification that is both biologically and clinically informative and superior prognostic power compared to existing approaches. Towards establishing the translational relevance of the functional proteome in research and clinical settings, we provide an online, publicly available, comprehensive database and visualization repository of our findings (https://mjha.shinyapps.io/PRECISE/).

were additional new findings: for example, signatures of Core reactive, TSC/mTOR and Hormone receptor pathways were selected for a rare cancer, Lymphoid Neoplasm Diffuse Large B-cell Lymphoma (DLBC) by using the criteria, that were not found by our previous FDR-adjustment (see Supplementary Figure S21 for boxplots of the p-values of associations of 1000 random signatures). Because the two approaches provide distinct information on the prognostic power, we proved an additional table for the union set of findings using FDR and random signature approaches (Supplementary Table S9).

Section S1.2 Pan-cancer multiple `omics data and Preprocessing
Using TCGA-assembler 1 , we downloaded mRNA expression, microRNA expression, DNA methylation data, and clinical data from TCGA Data Coordinating Center. For mRNA expression data, RNASeqV2 data (generated by illuminaga or iluminahiseq) were downloaded and preprocessed using `DownloadRNASeqData' and `ProcessRNASeqData' functions for gene-level expression. For microRNA expression data, miRNASeq data (generated by illuminaga or illuminahiseq) were downloaded using the `DownloadmiRNASeqData' function. Using the `ProcessmiRNASeqData' function, we processed the miRNASeq data, using the Hg19 reference genomes to map the reads. Then we annotated the microRNAs to genes using the `microRNA' package in R version 3.2.0. DNA methylation data (generated by Illumina Infinium HumanMethylation 27K or 450K) were downloaded and preprocessed using `DownloadMethylationData', `DownloadMethylation27Data' and `DownloadMethylation450Data'. For cancer types that had both types of assays, the overlapping features between both assay types were used. Then we applied ComBat 2 to adjust for the known batch effects using an empirical Bayes framework. We assumed that DNA methylation affects protein expression levels by influencing mRNA expression. For a gene, we selected a CpG site by computing the absolute correlations between the mRNA expression and CpG sites outside the `gene body' region. We decomposed the expression changes (!) of each gene into two components, which are explained by methylation (") and mechanisms other than methylation: ! = ! $ + ! $ , ! $ = '", where ' is a regression coefficient that represents the `gene methylation' effect. A CpG site corresponding to a gene is chosen by regression on the combined data, including data from 6,844 patients across all 32 lineages.

Section S1.4 Calculation of concordance score
For each cancer type, we divided the whole data set into two parts, a training data set and a test data set. Using the test data, we predicted the protein expression value for each node under the model evaluated from the training data. We computed the concordance score as follows: for each cancer and each pathway, where 0 HIJH is the number of samples in the test data, ) is the number of proteins (nodes) in the pathway, and L .6 is the observed value for the / HX test sample and Y HX protein. We based our analysis on 10fold cross-validation.

Section S1.5 Calculation of permutation p-values for network connectivity scores
We investigated on the relationship between sample sizes and the connectivity score (CS), which was the constructed as the ratio of the observed number of edges in the network to the total number of possible edges (see Supplementary Figure S24 (left panel)). We did find a marginal positive correlation between sample size and CS with Spearman correlation, 0.55 --which indicates that CS values are (somewhat) driven by the sample sizes, although there is some variability. This is not surprising given the fact that we have more statistical power to detect true network edges as sample size increases.
To investigate this further, we generated the null CS distribution for a given pathway and tumor type as follows. Briefly, for each tumor type and pathway, we randomly select the same number of proteins in a given pathway while the covariates from upstream platforms (mRNA, microRNA and Methylation) are matched to the this randomly selected set of proteins. After constructing networks from 1000 random permutations of the proteins, a null distribution of CS is obtained for the pathway and cancer type. For the hypothesis, that a pathway shows high level of cross-signaling than expected at random, we compute the permutation p-value, is the CS value obtained from /th permutation and [\ is the CS value from the data. These ) ZJ values were not correlated with sample size (Supplementary Figure S24, right panel) and thus provide an inherent sample size adjustment.

Section S1.6 Robustness of cancer-specific integrative networks
We have validated the robustness of our cancer-specific integrative networks by comparing with a recent method OncoPPI (Li et al., 2017) and a resampling method. The OncoPPI network (Li et al., 2017) expands the lung-cancer associated protein interaction landscape for discovery of novel cancer targets and connects tumor suppressors to available drugs, offering an experimental resource for exploitation of PPI-mediated cancer vulnerabilities. In their analysis, 83 genes were selected based on frequency of alterations in lung cancer and known involvement in cancer signaling pathways and only 12 genes were overlapped with the genes used for our analysis. We applied OncoPPI (http://oncoppi.emory.edu) to the 12 overlapping genes (see Supplementary Figure S25 for OncoPPI sub-network for the 12 genes). The PRECISE networks for LUAD and LUSC are displayed in Supplementary Figure S26 and showed the same edge structure between the two tumor types, except for the directed and correlative edges between AKT1, AKT2, AKT3 and PTEN in LUAD and LUSC samples, respectively. The OncoPPI network had two edges AKT1-RAF1 and AKT1-MAPK14 that were not included in the PRECISE network (PRECISE network had a correlative edge between RAF1 and MAPK14). We could conclude that the two approaches provide distinct networks and highlight several differences in the methodologies: (1) We estimate PPI network from observational data integrating upstream data such as mRNA expression, microRNA expression and DNA methylation. Thus, the PPI network we obtain is a conditional network after adjusting for regulators from upstream data as well as protein regulators in a pathway. On the other hand, OncoPPI is based on a PPI detection platform that identifies marginal direct interactions. (2) Our approach uses pathway-based procedure with the two main reasons: (1) for conditional networks (see item (1)), adjusting for only the regulators within pathways is an efficient strategy to use prior knowledge; and (2) we have a final aim of measuring individual-specific pathway activities and finding pathway-based prognostic models. Note that OncoPPI selected a set of 83 genes based on frequency of alterations in lung cancer and known involvement in cancer signaling pathways, and obtained marginal interaction data for each pair of the proteins. (3) OncoPPI uses the previously reported PPIs by hard-thresholding, edge inclusion/exclusion (Figure 1c of Li et al., 2017). Our PRECISE estimation method for cancer-specific networks conflates Bayesian variable selection and causal structure learning. Thus, if the data does not support a particular a priori network structure, it will exhibit low posterior probabilities in the final estimands and will thus be naturally filtered out. In addition, instead of performing hard-thresholding (e.g. OncoPPI) on known PPIs, we use soft-thresholding approach by calibrating the priors of regressions based on the combined interaction scores from data-driven de novo causal structure and existing PPI information.
For assessing robustness of our estimated protein network, we selected TSC/mTOR pathway for LUAD patients that had a significant connectivity score (CS), 0.8 with the permutation p-value, 0.044 with 7 correlative edges and one directed edge. To make a sub-sample set, 285 samples (80%) are randomly sampled from the 356 LUAD samples after matching RPPA, mRNA expression, miRNA expression and DNA methylation data. For each sub-sample set, we built LUAD-specific TSC/mTOR pathway using our PRECISE algorithm. We investigate the stability of an edge in the LUAD-specific TSC/mTOR network across 100 sub-samples by computing the proportion of the sub-sample sets that produced the network with the edge (Supplementary Table S9). 6 out of 8 edges were kept across all 100 sub-sample sets. For the correlative edge, RB1-RPS6KB1 was conserved across 48 sets, 23 had the directed edge, RB1-> RPS6KB1 (23 sets selected RB1 as a covariate of RPS6KB1 and did not select RPS6KB1 as a covariate of RB1 in the Bayesian regressions), and other 29 sets selected no edges between RB1 and RPS6KB1. The directed edge RPS6->RB1 is conserved for 88 sub-sample sets.

Section S1.7 Normalized mutual information
To compare the clustering performance, we computed the normalized mutual information (NMI) scores to evaluate the dependence between clusters and tumor types for 0 samples. Given two partitions, b = {b * , … , b e } and g = {g * , … , g e }, the contingency table that represents the overlap between b and g is denoted by [ = [ .6 e×h , where [ .6 is the number of samples that the clusters b . and g 6 share. The NMI score between b and g are defined as follows: NMI is a quantitative measure of the overlap between two clusters, taking values from 0 to 1 --with values close to 0, when the two clusters are totally dissimilar, and 1 where they are identical.    [Supplementary Figure S15] Scatterplot of -log10 p-values of PRECISE versus PRECISE with no prior for BRCA patients. The vertical and horizontal lines indicate p-values at 0.05.