Many forms of cancer have multiple subtypes with different causes and clinical outcomes. Somatic tumor genome sequences provide a rich new source of data for uncovering these subtypes but have proven difficult to compare, as two tumors rarely share the same mutations. Here we introduce network-based stratification (NBS), a method to integrate somatic tumor genomes with gene networks. This approach allows for stratification of cancer into informative subtypes by clustering together patients with mutations in similar network regions. We demonstrate NBS in ovarian, uterine and lung cancer cohorts from The Cancer Genome Atlas. For each tissue, NBS identifies subtypes that are predictive of clinical outcomes such as patient survival, response to therapy or tumor histology. We identify network regions characteristic of each subtype and show how mutation-derived subtypes can be used to train an mRNA expression signature, which provides similar information in the absence of DNA sequence.
Cancer is a disease that is not only complex, i.e., driven by a combination of genes, but also wildly heterogeneous, in that gene combinations can vary greatly between patients. To gain a better understanding of these complexities, researchers involved in projects such as The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) are systematically profiling thousands of tumors at multiple layers of genome-scale information, including mRNA and microRNA expression, DNA copy number and methylation, and DNA sequence1,2,3. There is now a strong need for informatics methods that can integrate and interpret genome-scale molecular information to provide insight into the molecular processes driving tumor progression. Such methods are also of pressing need in the clinic, where the impact of genome-scale tumor profiling has been limited by the inability to derive clinically relevant conclusions from the data4,5.
One of the fundamental goals of cancer informatics is tumor stratification, whereby a heterogeneous population of tumors is divided into clinically and biologically meaningful subtypes as determined by similarity of molecular profiles. Most prior attempts to stratify tumors with molecular profiles have used mRNA expression data2,6,7,8,9, resulting in the discovery of informative subtypes in diseases such as glioblastoma and breast cancer. On the other hand, in TCGA cohorts including colorectal adenocarcinoma and small-cell lung cancer, subtypes derived from expression profiles do not correlate with any clinical phenotype including patient survival and response to chemotherapy2,10. These results might be due to limitations of expression-based analysis11 such as issues with RNA sample quality, lack of reproducibility between biological replicates and ample opportunities for overfitting of data.
A promising new source of data for tumor stratification is the somatic mutation profile, in which high-throughput sequencing is used to compare the genome or exome of a patient's tumor to that of the germ line to identify mutations that have become enriched in the tumor cell population12. As this set of mutations is presumed to contain the causal drivers of tumor progression13, similarities and differences in mutations across patients could provide invaluable information for stratification. Although individual mutations in cancer genes have long been used to stratify patients14,15,16,17, stratification based on the entire mutation profile has been more challenging. Somatic mutations are fundamentally unlike other data types such as expression or methylation, in which nearly all genes or markers are assigned a quantitative value in every patient. Instead, somatic mutation profiles are extremely sparse, with typically fewer than 100 mutated bases in an entire exome (Supplementary Fig. 1). They are also remarkably heterogeneous, such that it is very common for clinically identical patients to share no more than a single mutation2,18,19.
Here we report that these problems can be largely overcome by integrating somatic mutation profiles with knowledge of the molecular network architecture of human cells. It is widely appreciated that cancer is a disease not of individual mutations, nor of genes, but of combinations of genes acting in molecular networks corresponding to hallmark processes such as cell proliferation and apoptosis20,21. We postulated that, although two tumors may not have any mutations in common, they may share the networks affected by these mutations (as per Waddington's original theory of 'genetic canalization'22). Although current cancer pathway maps are incomplete, much relevant information is available in public databases of human protein-protein, functional and pathway interactions. An increasing number of studies have successfully integrated these network databases with tumor molecular profiles to map the molecular pathways of cancer23,24,25,26,27. Here we focus on the orthogonal problem of using network knowledge to stratify a cohort into meaningful subsets. Using this knowledge, we were able to cluster somatic mutation profiles into robust tumor subtypes that are biologically informative and have a strong association to clinical outcomes such as patient survival time and emergence of drug resistance. As a proof of principle, we applied this method to stratify the somatic mutation profiles of three major cancers cataloged in TCGA: ovarian, uterine and lung adenocarcinoma.
Overview of network-based stratification
NBS combines genome-scale somatic mutation profiles with a gene interaction network to produce a robust subdivision of patients into subtypes (Fig. 1a). Briefly, somatic mutations for each patient are represented as a profile of binary (1, 0) states on genes, in which a '1' indicates a gene for which mutation (a single-nucleotide base change or the insertion or deletion of bases) has occurred in the tumor relative to germ line. For each patient, we project the mutation profile onto a human gene interaction network obtained from public databases28,29,30. Next we apply network propagation31 to spread the influence of each mutation over its network neighborhood (Fig. 1b). The resulting matrix of 'network-smoothed' patient profiles is clustered into a predefined number of subtypes (k = 2, 3, ... 12) via non-negative matrix factorization32 (NMF, Fig. 1c), an unsupervised technique. Finally, to promote robust cluster assignments, we use consensus clustering33, aggregating the results of 1,000 different subsamples from the entire data set into a single clustering result (Fig. 1d). For further details, see Online Methods. To evaluate the impact of different sources of network data, we used three interaction databases for this analysis: search tool for the retrieval of interacting genes (STRING)29, HumanNet28 or PathwayCommons30. Supplementary Table 1 summarizes the number of genes and interactions used in our analysis from each of these three networks. Our implementation of NBS is available as Supplementary Software; for updated versions, NBS may be downloaded from http://idekerlab.ucsd.edu/software/NBS/.
Benchmarking and performance analysis
In an initial exploration of NBS, we simulated a somatic mutation data set using the structure of the TCGA ovarian tumor mutation data and the STRING gene interaction network (Fig. 2a). Mutation profiles were permuted, and patients were divided randomly and uniformly into a predefined number of subtypes (k = 4). Next we reassigned a fraction of mutations in each patient to fall within genes of a single 'network module' characteristic of that patient's subtype (the 'driver' mutation frequency f, varied from 0% to 15%); the remaining mutations were left to occur randomly. We selected the network modules randomly from the set of all network modules in STRING, defined as sets of densely interacting genes with size range s = 10–250 (see Online Methods for details and justification for the ranges of k, f and s). Although it is unknown whether these assumptions completely mirror the biology of cancer, they provide a reasonable model of a pathway-based genetic disease that is (i) driven by genetic circuits corresponding to a molecular network whose activity can be altered by mutations at multiple genes and (ii) characterized by many additional mutations that are noncausal 'passengers'.
Using this simulation framework, we measured the ability of NBS to recover the correct subtype assignments in comparison to a standard consensus clustering approach not based on network knowledge (Online Methods). NBS showed a striking improvement in performance, especially for large network modules, as these can be associated with any of numerous different mutations across the patient population (Fig. 2b). As module size decreased, the chance of observing the same mutated gene in patients of the same subtype increased, and the standard clustering algorithm performed increasingly well. We found that the high performance of NBS depended not only on network smoothing but also on the NMF clustering approach; substitution of NMF with an alternative method such as hierarchical clustering resulted in relatively poor performance (Fig. 2b).
Next we investigated how NBS performance was affected as a function of mutation frequency (Fig. 2c). Standard consensus clustering was sufficient for stratification at high mutation frequencies and for small modules, for which there is substantial overlap in mutations among patients of the same subtype (Fig. 2d); however, NBS was able to accurately recover the correct subtypes for a much larger range of both variables. Applying NBS on a permuted network resulted in poor performance (Fig. 2e), which is on par with that observed with standard consensus clustering. These results were qualitatively similar when we used multiple network modules per patient (2–6) and/or a different network (Supplementary Fig. 2).
Network-based stratification of tumor mutations
We next sought to apply NBS to stratify patients profiled by TCGA full-exome sequencing for uterine, ovarian and lung cancers (see Online Methods for further details). In each of the three cancers, we observed that NBS resulted in robust subtype structure, whereas standard consensus clustering was unable to stratify the patient cohort (Fig. 3a for uterine cancer; Supplementary Figs. 3a and 4a for ovarian and lung cancers, respectively). Similar results were obtained when we used any of the three human networks (STRING, HumanNet and PathwayCommons).
To determine the biological importance of the identified subtypes, we investigated whether they were predictive of observed clinical data. In uterine cancer, NBS subtypes (Supplementary Table 2) were closely associated with the recorded subtype on a histological basis (Fig. 3b,c and Supplementary Fig. 5). Survival analysis was not possible owing to low mortality rates for this cohort. In ovarian cancer, the identified subtypes (Supplementary Table 3) were significant predictors of patient survival time (log-rank P = 1.59 × 10−5; Fig. 3d,e and Supplementary Fig. 3b,c). Patients with the most aggressive ovarian tumor NBS subtype had a mean survival of approximately 32 months, compared to more than 80 months for those with the least aggressive NBS subtype (Supplementary Fig. 3d,e). Moreover, the NBS subtypes were predictive of survival independently of clinical covariates including tumor stage, age, mutation rate and residual tumor presence after surgery (Supplementary Fig. 6; likelihood ratio test, P = 3.75 × 10−5) and were also predictive of time to relapse after treatment with platinum chemotherapy ('platinum-free interval') (Supplementary Fig. 3f), as measured using a Kaplan-Meier analysis of platinum-free survival34. Finally, in lung cancer the identified NBS subtypes (Supplementary Table 4) were also significant predictors of patient survival (log-rank P = 1.95 × 10−6, Fig. 3f,g; median survival of 12 months versus approximately 50 months for the best-surviving subtype, Supplementary Fig. 4), with predictive value beyond known clinical covariates such as tumor stage, grade, mutation frequency, age at diagnosis and smoking status (likelihood ratio test, P = 3.3 × 10−4). Stratification using a network in which the mapping between mutated genes and the network was permuted, which disrupted the relationship between mutations and network structure, resulted in degraded predictive performance (Fig. 3b,d,f).
We compared these results to subtypes derived from other data types in the TCGA, including copy-number variation (CNV), methylation, mRNA expression, microRNA expression and protein profiles. For ovarian cancer, all other data types had inferior ability to predict survival beyond what could be predicted from clinical covariates (Fig. 4a) and led to different subtype assignments than NBS (Fig. 4b). In lung cancer, both NBS subtypes and those based on RNA-seq had good predictive power (Fig. 4c) and had some overlap in terms of patient assignments (Fig. 4d), whereas other data types were not predictive of survival. In uterine cancer, subtypes derived from all data types were highly predictive of histology (Fig. 4e; CNVs had highest predictive power overall) and also had very high overlap with NBS subtype assignments (Fig. 4f).
Distinct network modules associate with each tumor subtype
We next sought to identify the regions of the network that are most responsible for discriminating the somatic mutation profiles of tumors of different subtypes. Focusing on ovarian cancer as a proof of principle, for each subtype we identified genes for which the network-smoothed mutation state differs significantly for patients of that subtype versus the others (false discovery rate <0.05; Online Methods). This set of genes was projected onto the HumanNet network and visualized using Cytoscape35. The network for subtype 1 (Fig. 5), which had the worst overall survival and shortest platinum-free interval, contained over 20 genes in the fibroblast growth factor (FGF) signaling pathway, which has previously been implicated as a driver of tumor progression and associated with resistance to platinum and anti-VEGF therapy36. The network for subtype 2 was enriched in DNA damage–response genes including ATM, ATR, BRCA1, BRCA2, RAD51 and CHEK2 (Supplementary Fig. 7). Collectively these highlighted pathways are characteristic of a functional deficit in response to DNA damage, which has been referred to as 'BRCAness'7,37. Consistent with this finding, this subtype also included the vast majority of patients with BRCA1 and BRCA2 germ-line mutations (15 of 20 and 5 of 6 patients in the cohort, respectively). The network for subtype 3 was enriched for genes in the NF-κB pathway (Supplementary Fig. 8), whereas subtype 4 was enriched for genes involved in cholesterol transport and fat and glycogen metabolism (Supplementary Fig. 9). A similar analysis in uterine and lung cancers produced other subnetworks with unique characteristics, including enrichments for DNA-damage response, WNT signaling and histone modification (Supplementary Figs. 10,11,12,13,14,15,16). Thus, the NBS approach not only can stratify patients into clinically informative subtypes but may help identify the molecular network regions commonly mutated in each subtype.
Translation to predictive signatures
For NBS to be applicable to new patients not in the TCGA, it is necessary to complement it with a procedure for assigning a patient to one of the existing NBS subtypes. For this purpose, we explored the nearest shrunken centroid approach38, a standard method for sample classification that summarizes each subtype with a class 'centroid' and assigns new samples to the subtype with closest centroid. We found that this method was able to classify the network-smoothed mutation profile of an individual patient with over 95% accuracy (Fig. 6a; tenfold cross-validation).
However, mRNA expression data are presently much more widely available than are full genome or exome sequences: there are numerous existing cohorts of cancer patients that have been profiled in mRNA expression but not in somatic mutations7,39,40,41,42. We therefore sought to test whether, having used NBS to define subtypes within TCGA somatic mutation data, we could assign a new patient to these subtypes using an expression signature. To explore this idea, we used the mRNA expression profiles available for the TCGA ovarian tumor cohort to learn an expression signature for each subtype defined earlier by NBS, again using the nearest shrunken centroid approach38. We found that expression performed as an adequate surrogate for mutation profile, albeit at a reduced accuracy (Fig. 6a; >95% for mutations, ∼60% for expression and ∼30% at random). This expression signature was nonetheless able to recover stratification predictive of survival (Fig. 6b).
We examined the predictive value of this gene expression signature in two independent studies of serous ovarian tumors by Tothill et al.40 and Bonome et al.42 as well as in a meta-analysis including over 1,000 patients, which subsumes Tothill, Bonome and TCGA samples that included expression profiles but lacked somatic mutation profiles41 (Fig. 6c and Supplementary Fig. 17) and incorporates an unknown number of nonserous ovarian cancer samples. Using the expression signature we had learned from NBS analysis of TCGA data, all patients could be assigned to one of the four NBS subtypes. In the Tothill data set, the subtype assignments were found to be significantly predictive of patient survival and platinum drug resistance (log-rank P = 6.1 × 10−3 and 1.65 × 10−6 respectively; Fig. 6c and Supplementary Fig. 17), following the same trends observed in the original TCGA cohort. In the Bonome and the meta-analysis data sets, the recovered subtypes were again significantly associated with patient survival (log-rank P = 1.40 × 10−3 and 1.22 × 10−4, respectively; Supplementary Fig. 17). We note that the proportions of the recovered subtypes in each of the three independent expression cohorts appeared to be different (Supplementary Table 2), a phenomenon possibly due to different criteria for inclusion in each study (for example: the TCGA ovarian cohort is primarily composed of high-grade, late-stage patients) or possibly differences due to population substructure. As a final control, we performed clustering of the Tothill expression profiles independent of NBS subtypes; this resulted in a different set of subtypes that associated with survival to a more limited extent (P = 0.01, Supplementary Fig. 18). These results show that tumor subtypes defined by NBS can be identified in independent data sets when gene expression is used as a surrogate biomarker.
Effects of different classes of mutation on stratification
We studied the impacts of different classes of somatic mutation on the NBS approach. We first tested the effect on NBS of disrupting synonymous mutations by reassigning them to new randomly chosen gene locations. For uterine and lung cancers (Fig. 7a and Supplementary Fig. 19, respectively), disruption of synonymous mutations had little effect on NBS performance. In sharp contrast, disruption of nonsynonymous mutations or of all mutations greatly affected stratification performance. Interestingly, in the ovarian cancer cohort (Fig. 7b), disruption of either synonymous or nonsynonymous mutations was detrimental to performance.
We also studied the effect of removing mutations judged to be nonfunctional in cancer by methods such as MutationAssessor43, cancer-specific high-throughput annotation of somatic mutations (CHASM)13 and the variant effect scoring tool (VEST)44, which use features such as sequence conservation and protein structural information to assess the likely impact of mutations. Filtering mutations with these tools resulted in decreased association of NBS subtypes with patient survival in all three cancers (Fig. 7c–e, with the possible exception of VEST for ovarian tumors: Fig. 7d). Finally, we studied the effect of removing genes with long sequences or late cell-cycle replication times: both of these characteristics have been postulated to accrue high numbers of mutations that may be unrelated to tumor progression45. We found that removal of long genes substantially degraded the ability to identify ovarian and lung subtypes predictive of survival (Fig. 7d,e). However, removal of late-replicating genes had little effect and, in the case of the lung tumor cohort, actually increased predictive power (Fig. 7e).
Here we have reported the discovery that, through the use of prior knowledge captured in molecular networks, a set of tumor mutation profiles can be stratified into subtypes that are both biologically and clinically informative. These subtypes are distinct from those recovered through stratification of other types of data and are independent of other clinical markers known to be associated with survival. We can identify network modules characteristic of each subtype, which may provide new insight into the biological mechanisms driving tumor progression. To our knowledge, this is the first time that somatic mutation profiles have been used to stratify patients in an unsupervised fashion.
One might consider at least three potential reasons for the good performance of NBS. First, somatic mutations represent a digital signal in that a given gene can be considered either mutated or not, whereas most other data layers are analog signals representing measurements of continuous values. In general, digital systems have improved accuracy and reproducibility and are more robust to noise46. Second, somatic mutation profiles are differential measurements between tumor and normal tissue, whereas expression and other 'omics profiles are absolute measurements in each patient. The differential analysis filters out mutations or variants present in the patient's germ line, leaving only tumor-specific changes. In contrast, it has been difficult to identify a true 'baseline' gene expression state for a tissue, as these measurements are dynamic and highly context specific. Finally, the somatic mutation profile captures the causal genetic events underlying tumor progression, whereas mRNA or protein expression profiles are a functional readout of the current cell state and are influenced by external factors that may be unrelated to tumor biology.
The network modules we identified as characteristic for each tumor subtype provide new insights into the biology of cancer and raise many new questions. One particularly promising finding was the prominence of the FGF pathway in ovarian tumor subtype 1 (Fig. 5). This pathway has been implicated in tumor proliferation and angiogenesis, and many inhibitors for this pathway are in clinical development47. Specifically, it has been shown that increased expression of FGF1 is associated with poor survival in ovarian cancer48, and inhibition of FGFR1 and FGFR2 increases sensitivity to cisplatin in ovarian cancer cell lines36. An intriguing question for future work is whether subtype 1 patients are particularly responsive to therapy directed at network-identified targets, such as treatment with inhibitors of FGFR1.
Another interesting observation is that several network modules are enriched for long genes. For example, for ovarian tumor subtype 2, a total of 12 of 176 genes in the module are in the top 2% by length (P = 2.3 × 10−4). One prominent example is TTN, the longest known coding gene. Although prominent 'gold-standard' catalogs of cancer genes—such as the Catalogue of Somatic Mutations in Cancer (COSMIC) cancer gene census49 and the list of Vogelstein et al.50—are also enriched for long genes (for example, 17 of 125 in the Vogelstein list, P = 5.11 × 10−10), there remains some controversy about the roles these genes may play in cancer. On the one hand, it is possible that long genes are highly mutated not because they are drivers of cancer but simply owing to chance because they are a bigger 'target' to hit. On the other hand, there is no definitive evidence that mutations in long genes are not functional or do not contribute to tumor progression. Our analysis provides some evidence that these long genes should not be ignored. In the molecular network, long mutated genes were highly interconnected to other functionally related genes of all lengths, which are also found to be mutated in patients of that subtype. For example, the network region for ovarian tumor subtype 1 (Fig. 5) showed TTN interconnected to genes such as NEB, ANK1 and MYOM2, all of which are also mutated in patients of this subtype. These genes encode components of the cytoskeleton thought to have both structural and signaling roles51. Although TTN is a long gene and thus might accrue mutations by chance, it is striking that other members of the same protein interaction neighborhood are also found to be mutated in tumors of the same subtype. Using permutation analysis, we estimated that the chance of TTN having an immediate network neighborhood with this same number of mutations is roughly P < 0.0001. Thus, one possibility is that the TTN and other cytoskeletal components are required for platinum-induced, P53-independent apoptosis, and that mutation in either structural or signaling proteins in this pathway leads to platinum resistance. In support of this theory is prior work demonstrating that cell shape is associated with chemotherapy response in ovarian cancer52.
Another interesting observation is that synonymous mutations, though dispensable for stratification of uterine and lung tumors, appear to have some predictive power in stratification of ovarian tumors. In support of this finding, a number of high-profile studies have suggested that synonymous mutations may indeed play a causal role in cancer progression53,54,55,56. Further study is needed to understand whether ovarian cancer is indeed the outlier in this respect and whether and how synonymous mutations truly function in this disease.
Finally, we see many opportunities to improve upon the basic concept of NBS in future work. First, integrating multiple layers of information beyond somatic mutations (for example: CNVs, epigenome, transcriptome, etc.) into a composite stratification method might further expand our ability to identify subtypes with clinically relevant differences. Second, although we have shown the utility of three sources of gene-gene interactions, there are other types of networks worth exploring, such as those involved in signaling, metabolism or transcription. Although this study focused on uterine, ovarian and lung cancers, the NBS method is broadly applicable to any cohort of cancer patients for which somatic mutations are known. Finally, analyzing NBS subtypes across all cancers simultaneously (i.e., a pan-cancer analysis) will offer the intriguing opportunity to explore whether the genes and networks underlying the progression of a tumor are more informative of clinical outcome than its tissue of origin.
Expanded overview of network-based stratification.
The technique of network-based stratification (NBS) combines genome-scale somatic mutation profiles with a gene interaction network to produce a robust subdivision of patients into subtypes (Fig. 1a). Briefly, somatic mutations for each patient are represented as a profile of binary (1, 0) states on genes, in which a '1' indicates a gene for which mutation has occurred in the tumor relative to germ line (i.e., a single-nucleotide base change or the insertion or deletion of bases). For each patient independently we project the mutation profiles onto a human gene interaction network obtained from public databases28,29,30. Next, the technique of network propagation31 is applied to spread the influence of each mutation profile over its network neighborhood (Fig. 1b). The result is a 'network-smoothed' profile in which the state of each gene is no longer binary but reflects its network proximity to the mutated genes in that patient along a continuous range [0, 1]. Following this 'network smoothing', patient profiles are clustered into a predefined number of subtypes (k = 2, 3, ... 12) using the unsupervised technique of non-negative matrix factorization32 (NMF; Fig. 1c). For NBS we use a variant of NMF that encourages the selection of gene sets supporting each subtype according to high network connectivity (NetNMF)58. Finally, to promote robust cluster assignments, we use the technique of consensus clustering33, in which the above procedure is repeated for 1,000 different subsamples in which subsets of 80% of patients and genes are drawn randomly without replacement from the entire data set. The results of all 1,000 runs are aggregated into a (patient × patient) co-occurrence matrix, which summarizes the frequency with which each pair of patients has cosegregated into the same cluster. This co-occurrence matrix is then clustered a second time to recover a final stratification of the patients into clusters/subtypes (Fig. 1d). Our implementation of the NBS method is available for download as a Matlab package from http://idekerlab.ucsd.edu/software/NBS/ or as Supplementary Software. The former should be used for obtaining the most up-to-date versions.
Processing of patient mutation profiles.
High-grade serous ovarian cancer, uterine endometrial carcinoma and lung adenocarcinoma somatic mutation data were downloaded from the TCGA data portal on 8 August 2012, 1 January 2013 and 1 January 2013, respectively. Only mutation data generated using the Illumina GAIIx platform were retained for subsequent analysis, and patients with fewer than 10 mutations were discarded. This left 356 patients with mutations in 9,850 genes for the TCGA ovarian cohort, 248 patients with mutations in 17,968 genes for the TCGA uterine endometrial cohort and 381 patients with mutations in 15,967 genes in the TCGA lung adenocarcinoma cohort. Patient mutation profiles were constructed as binary vectors such that a bit is set if the gene corresponding to that position in the vector harbors a mutation in that patient. Additional details on processing and organization of the data are available in a previous TCGA publication2.
Sources of molecular network data.
Patient mutation profiles were mapped onto gene interaction networks from three sources: STRING v.9 (ref. 29), HumanNet v.1 (ref. 28) and PathwayCommons30 (Supplementary Table 1). STRING integrates protein-protein interactions from literature curation, computationally predicted interactions, and interactions transferred from model organisms based on orthology. HumanNet uses a naïve Bayes approach to weight different types of evidence together into a single interaction score focusing on data collected in humans, yeast, worms and flies. PathwayCommons aggregates interactions from several pathway and interaction databases, focused primarily on physical protein-protein interactions (PPIs) and functional relationships between genes in canonical regulatory, signaling and metabolic pathways (including hallmark pathways of cancer). Supplementary Table 1 summarizes the number of genes and interactions used in our analysis from each of these three networks.
All network sources comprise a combination of interaction types, including direct protein-protein interactions between a pair of gene products and indirect genetic interactions representing regulatory relationships between pairs of genes (for example, coexpression or TF activation). The PathwayCommons network was filtered to remove any nonhuman genes and interactions, and all remaining interactions were used for subsequent analysis. Only the most confident 10% of interactions for both the STRING and HumanNet networks were used for this work, ordered according to the quantitative interaction score provided as part of both networks. This threshold was chosen using an independent ROC analysis with respect to a set of Gene Ontology–derived gold standards (data not shown). After filtering of edges, all networks were used as unweighted, undirected networks.
After mapping a patient mutation profile onto a molecular network, network propagation31 is applied to 'smooth' the mutation signal across the network. Network propagation uses a process that simulates a random walk on a network (with restarts) according to the function
F0 is a patient-by-gene matrix, and A is a degree-normalized adjacency matrix of the gene interaction network, created by multiplying the adjacency matrix by a diagonal matrix with the inverse of its row (or column) sums on the diagonal. α is a tuning parameter governing the distance that a mutation signal is allowed to diffuse through the network during propagation. The optimal value of α is network dependent (0.7, 0.5 and 0.7, for HumanNet, PathwayCommons and STRING, respectively), but the specific value seems to have only a minor effect on the results of NBS over a sizable range (for example, 0.5–0.8). The propagation function is run iteratively with t = [0, 1, 2, ...] until Ft+1 converges (the matrix norm of Ft+1 – Ft < 1 × 10−6). Following propagation, the rows of the resultant matrix Ft are quantile normalized to ensure that the smoothed mutation profile for each patient follows the same distribution.
Network-regularized NMF is an extension that constrains NMF to respect the structure of an underlying gene interaction network. This is accomplished by minimizing the following objective function using an iterative method32,58,59:
W and H form a decomposition of the patient × gene matrix F (resulting from network smoothing as described above) such that W is a collection of basis vectors, or 'metagenes', and H is the basis vector loadings. The trace(WtKW) function constrains the basis vectors in W to respect local network neighborhoods. The term K is and adjacency matrix of a nearest neighbors network derived from the graph Laplacian of an influence distance matrix23 that is derived from the original network. The degree to which local network topology versus global network topology constrains W is determined by the number of nearest neighbors. We experimented with neighbor counts ranging from 5 to 50 to include in the nearest network, and we observed only small changes in outcome (data not shown). For the work presented in this manuscript, the 11 most influential neighbors of each gene in the network as determined by network influence distance were used.
Clustering was performed with a standard consensus clustering framework, discussed in detail by Monti et al.33 and used in previous TCGA publications2,18,60. Briefly, we used network-regularized NMF (see above) to derive a stratification of the input cohort. In order to ensure robust clustering, network-regularized NMF was performed 1,000 times on subsamples of the data set. In each subsample, we sampled 80% of the patients and 80% of the mutated genes at random without replacement. The set of clustering outcomes for the 1,000 samples was then transformed into a co-clustering matrix. This matrix records the frequency with which each patient pair was observed to have membership in the same subtype over all clustering iterations in which both patients of the pair were sampled. The result is a similarity matrix of patients, which we then used to stratify the patients by applying either average linkage hierarchical clustering or a second symmetric NMF step. Patients showing poor cluster association to a single subtype were excluded from further analysis.
Simulation of somatic mutation cohorts.
We used simulations to determine the ability of NBS to recover subtypes from somatic mutation profiles. In order to quantify the performance of NBS, we needed a cohort with specified subtypes as a 'ground truth' reference and to be able to control the properties of the simulated signal determining the different subtypes. We simulated a somatic mutation cohort as follows. Patient mutation profiles were sampled with replacement from the TCGA ovarian data set. For each patient, the mutation profile was permuted, whereas the per-patient mutation frequency was kept invariant; this resulted in a background mutation matrix with no subtype signal. For simulation of an underlying network structure for NBS to detect, a network-based signal was added to the patient-by-mutation matrix as follows. First, we established a set of network communities (i.e., connected components enriched for edges shared within community members) in the input network (STRING, HumanNet or PathwayCommons) using the network community detection algorithm QCut61. Next, we divided the patient cohort randomly into four equal-sized subtypes (four was selected as reasonable owing to the four expression-based subtypes that have been identified for glioblastoma, ovarian and breast cancers2,18,60,62). Each subtype was assigned a small number (for example, 1–6) of network modules that together had a combined size s ranging from 10 to 250 genes. These network modules represent 'driver' subnetworks characterizing the subtype. For each patient, we reassigned a fraction of the patient's mutations f to genes covered by the driver modules for that patient's subtype. This procedure resulted in a patient × gene mutation matrix with underlying network structure while maintaining the per-patient mutation frequency.
A plausible range for the number of driver mutation in a tumor was recently proposed to be between 2 and 8 driver mutations50. We note that in our simulation framework, a 4% mutation rate corresponds to between 1 and 9 mutations with a median of 3, which is on par with the aforementioned estimate. In order to estimate the appropriate size of cancer pathways (s), we examined the known cancer pathways in the NCI-Nature pathway interaction database63. We observe that pathways in the database are of varying sizes, 2–139 genes, with a median size of 34, and over 23% of pathways include over 50 genes.
Identifying differentially mutated subnetworks.
After applying NBS, we identified genes that were enriched for mutation in each of the subtypes relative to the whole cohort. To do this we applied the significance analysis of microarrays (SAM) method64 on the network-smoothed mutation profiles. This is a nonparametric method developed for discovering differentially expressed genes in microarray experiments. We used a rank-based Wilcoxon-type statistic and compared each subtype against the remaining cohort. Significance was assessed using the SAM permutation scheme with 1,000 permutations. The resulting set of genes for each subtype was overlaid on the network used for network smoothing.
Survival analysis was performed using the R “survival” package. We fit a Cox-proportional hazards model65 to determine the relationship between the NBS-assigned subtypes and patient survival. A likelihood-ratio test and associated P value is calculated by comparing the full model, which includes subtypes and clinical covariates, against a baseline model that includes covariates only. Clinical covariates available in TCGA and included in the model were age, grade, stage, residual surgical resection and mutation rate, as well as cigarette smoking status for the lung cancer cohort.
Comparing predictive power and overlap with TCGA subtypes.
Added predictive power is estimated using a likelihood-ratio test comparing the Cox proportional hazards model given subtypes and clinical covariates (age, stage, grade, mutation frequency and residual tumor presence after surgery) compared to a covariate-only model. Significance of overlap is assessed using a Pearson's χ2 test of independence between NBS subtypes with a specific network and number of subtypes (ovarian, HumanNet, four subtypes; lung, HumanNet, six subtypes; uterine, STRING, three subtypes) and the different data types with varying number of subtypes reported in the TCGA and subtyped using consensus-clustering NMF. TCGA subtypes were downloaded from the Firehose run from 25 May 2012 (http://gdac.broadinstitute.org/runs/analyses__2012_05_25/reports/cancer/OV/).
Shrunken-centroid prediction on expression profiles.
We used shrunken centroids to derive an expression signature equivalent to the somatic mutation-based NBS subtypes. Expression data were provided by Győrffy et al.41, who aggregated several expression data sets as part of a meta-analysis of ovarian cancer. In this analysis, all data were regularized using quantile and MAS5 normalization. We performed this analysis on the Tothill et al.40 (ovarian serous samples only), Bonome et al.42 and TCGA data sets, as well as across the full meta-analysis cohort. We used the “pamr” R package with default parameters to train a shrunken-centroid model38 on mRNA expression levels for all genes in the TCGA ovarian data set with subtype assignment as the class label. The trained model was next used to predict subtype labels on the held-out Tothill et al. and Bonome et al. data or the full meta-analysis expression cohort (excluding any TCGA samples included in the training set).
We include a table of the class centroids for each of the three TCGA somatic mutation cohorts and the four expression cohorts of ovarian cancer included in this study (Supplementary Table 5).
Missense mutations were scored using three methods: CHASM13, VEST44 and MutationAssessor43. CHASM and VEST use supervised machine learning to score mutations. The CHASM training set is composed of a positive class of driver mutations from the COSMIC database and a negative class of synthetic passenger mutations simulated according to the mutation spectrum observed in the tumor type under study. The VEST training set comprises a positive class of disease mutations from the Human Gene Mutation Database66 and a negative class of variants detected in the ESP6500 (http://evs.gs.washington.edu/EVS/) cohort with an allele frequency of >1%. MutationAssessor uses patterns of conservation from protein alignments of large numbers of homologous sequences to assess the functional impact of missense mutations. CHASM and VEST scores were obtained from the CRAVAT webserver44 (http://www.cravat.us/). MutationAssessor precomputed mutation scores were downloaded from http://mutationassessor.org/. After using each method to score all mutations across all patients, we picked a permissive threshold for retaining mutations to use for NBS (retaining the top 75% of mutations as scored by CHASM and VEST and using MutationAssessor with the “low threshold” setting).
RepliSeq67 data for the GM12878 cell line were downloaded from the ENCODE project website (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwRepliSeq/, downloaded March 2013). Summed normalized tag densities were used as a proxy for replication time (higher counts indicating that a transcript was replicated earlier in the cell cycle). Normalized tag densities for RefSeq protein coding regions were retrieved using bigWigAverageOverBed68 with RefSeq gene sequence features in .gff3 format downloaded from http://www.yandell-lab.org/software/VAAST/data/hg19/Features/refGene_hg19.gff3. Tag densities were averaged for each transcript, and the longest transcript was selected to represent each gene.
We would like to thank all members of the Ideker lab, specifically J. Dutkowski, R. Srivas, G. Bean, M. Yu and M. Choueiri, for many fruitful discussions during various stages of this project. We also thank G. Hofree for her input, patience and support. J.P.S. is supported in part by grants from the Marsha Rivkin Center for Ovarian Cancer Research and the Conquer Cancer Foundation of the American Society of Clinical Oncology. This work was supported by US National Institutes of Health grants P41 GM103504 and P50 GM085764.