Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# Integrative pathway enrichment analysis of multivariate omics data

## Abstract

Multi-omics datasets represent distinct aspects of the central dogma of molecular biology. Such high-dimensional molecular profiles pose challenges to data interpretation and hypothesis generation. ActivePathways is an integrative method that discovers significantly enriched pathways across multiple datasets using statistical data fusion, rationalizes contributing evidence and highlights associated genes. As part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2658 cancers across 38 tumor types, we integrated genes with coding and non-coding mutations and revealed frequently mutated pathways and additional cancer genes with infrequent mutations. We also analyzed prognostic molecular pathways by integrating genomic and transcriptomic features of 1780 breast cancers and highlighted associations with immune response and anti-apoptotic signaling. Integration of ChIP-seq and RNA-seq data for master regulators of the Hippo pathway across normal human tissues identified processes of tissue regeneration and stem cell regulation. ActivePathways is a versatile method that improves systems-level understanding of cellular organization in health and disease through integration of multiple molecular datasets and pathway annotations.

## Introduction

Pathway enrichment analysis is an essential step for interpreting high-throughput (omics) data that uses current knowledge of genes and biological processes. A common application determines statistical enrichment of molecular pathways, biological processes and other functional annotations in long lists of candidate genes1. Genomic, transcriptomic, proteomic and epigenomic experiments emphasize complementary aspects of underlying biology and are best analyzed integratively, as is now routinely done in large-scale projects such as The Cancer Genome Atlas (TCGA)2, Clinical Proteome Tumor Analysis Consortium (CPTAC), International Cancer Genome Consortium (ICGC)3,4, Genotype-Tissue Expression (GTEx)5, and others6. Thus, simultaneous analysis of multiple candidate gene lists for characteristic pathways is increasingly needed.

Numerous approaches are available for interpreting single gene lists. For example, the GSEA algorithm can detect upregulated and downregulated pathways in gene expression datasets7. Web-based methods such as Panther8, ToppCluster9, and g:Profiler10 detect significantly enriched pathways amongst ranked or unranked gene lists and are generally applicable to genes and proteins from various analyses. Some approaches allow analysis of multiple input gene lists however these primarily rely on visualization rather than data integration to evaluate the contribution of distinct gene lists towards each detected pathway9,10. Finally, no methods are available for unified pathway analysis of coding and non-coding mutations from whole-genome sequencing (WGS) data, or integrating these with other types of DNA aberrations such as copy number changes and balanced genomic rearrangements.

Cancer genomes are characterized by multiple classes of mutations, including single nucleotide variants (SNVs), small insertions–deletions (indels), copy number alterations, and translocations. These affect a small number of frequently mutated pan-cancer driver genes such as TP53, less-frequent and tissue-specific genes such as SPOP in prostate cancer, and numerous infrequently mutated genes. The majority of currently known driver mutations of SNVs and indels affect protein-coding sequence11 and only few high-confidence non-coding drivers have been found, such as the mutation hotspots in the TERT promoter12. Discovery of coding and non-coding driver mutations is a major goal of large cancer whole genome sequencing efforts. The PCAWG Consortium aggregated whole genome sequencing data from 2658 cancers across 38 tumor types generated by the ICGC and TCGA projects. These sequencing data were re-analysed with standardized, high-accuracy pipelines to align to the human genome (reference build hs37d5) and identify germline variants and somatically acquired mutations, as described in the PCAWG marker paper4. A consensus analysis of variant calls in PCAWG tumors generated a high-confidence catalog of driver mutations in protein-coding driver genes (CDS) and non-coding regions of 5′ and 3′ untranslated elements (UTRs), promoters and enhancers13. A consensus pathway and network analysis of PCAWG driver mutations used knowledge of molecular pathways and gene interaction networks as priors to further discover infrequent candidate driver variants including those in the non-coding genome14.

Here we report the development of the ActivePathways method that uses data fusion techniques to address the challenge of integrative pathway analysis of multi-omics data. It detects significantly enriched pathways across multiple datasets, including those pathways that are not apparent in any individual dataset. We present several analyses to demonstrate this method. First, we integrate cancer driver genes with coding and non-coding mutations predicted using the PCAWG dataset13 and reveal numerous processes and additional genes with frequent coding and non-coding mutations. Second, we integrate patient clinical information with transcriptomic and copy number alterations in breast cancers of the METABRIC project15 to discover prognostic pathways and processes in breast cancer subtypes. Third, we integrate transcriptomic data of normal tissues of the GTEx project5 with ChIP-seq data to infer gene regulatory networks and biological processes downstream of the Hippo pathway of tissue growth control and regeneration. Thus ActivePathways is a versatile method for combining diverse multiomics datasets.

## Results

### Multi-omics pathway enrichment analysis with ActivePathways

ActivePathways is a simple three-step method that extends our earlier work10 (Fig. 1). It requires two input datasets. The first input is a table of P-values with genes listed in rows and evidence from distinct datasets listed in columns. The columns can include P-values of differential gene expression, gene essentiality, mutation or copy number alteration burden and many others that are derived using platform-specific quantification methods. The second input is a collection of gene sets that represents collective knowledge of gene function and interactions we refer to as pathways. The most common analysis utilizes biological processes from gene ontology16 (GO) and molecular pathways from the Reactome database17. Depending on the hypothesis, these data may also include many other types of gene sets such as targets of transcription factors or microRNAs.

In the first step of ActivePathways, we derive an integrated gene list that for each input gene aggregates significance from multiple omics datasets. The integrated gene list is compiled by a fusion of gene significance from different omics datasets (i.e., evidence) using the Brown’s extension18 of the Fisher’s combined probability test. The Brown’s method considers dependencies between datasets and thus provides more conservative estimates of significance for genes that are supported by multiple similar omics datasets. The integrated input gene list is then ranked by decreasing significance and filtered using a lenient cut-off, designed to capture additional candidate genes with sub-significant signals while discarding the bulk of insignificant genes (unadjusted Brown Pgene < 0.1). In the second step, a pathway enrichment analysis is conducted on the integrated gene using a ranked hypergeometric test10 and a collection of gene sets (i.e., biological processes, molecular pathways, and other gene annotations). The ranked hypergeometric test is designed to capture smaller pathways tightly associated with few top-ranking genes and also broader processes associated with larger subsets of input genes. The family-wise multiple testing correction method by Holm19 is then applied across tested pathways to select the pathways significantly enriched in the integrated gene list (Qpathway < 0.05). In the third step, we perform a similar analysis on the gene lists of individual omics datasets separately to determine the omics evidence supporting the integrative pathway analysis results determined in step 2. Importantly, the third step also highlights pathways that are only found through data integration and are not apparent in any single omics dataset separately, providing the added value of integrated analysis. Finally, the method provides input files for Enrichment Map20 for visualizing resulting pathways with the corresponding omics evidence.

### Integrating coding and non-coding drivers in 2658 cancer genomes

We performed an integrative pathway analysis of driver genes predicted in the PCAWG project based on somatic SNVs and indels. This analysis comprised 29 cancer patients cohorts of histological tumor types and 18 meta-cohorts combining multiple types of tumors, with 47 cohorts in total (Supplementary Table 1). ActivePathways identified at least one significantly enriched process or pathway in 89% of these cohorts (42/47, Qpathway < 0.05, ranked hypergeometric test) (Fig. 2a). We analyzed the evidence supporting predictions of enriched pathways: most cohorts showed enrichments in pathways supported by protein-coding mutations in genes (37/47 or 79%). This serves as a positive control since the majority of currently-known cancer driver genes have frequent protein-coding mutations.

Non-coding mutations in genes also contributed broadly to the discovery of frequently mutated biological processes and pathways: 24/47 cohorts (51%) showed significantly enriched pathways that were apparent when only analyzing non-coding driver scores corresponding to UTRs, promoters or enhancers. The majority of PCAWG tumor cohorts (41/47 or 87%) revealed some frequently mutated pathways that were apparent when integrating coding and non-coding mutations however remained undetected when considering either coding or non-coding mutations separately, emphasizing the value of our integrative approach. As expected, cohorts with more patient tumor samples generated more significantly enriched pathways (Spearman ρ = 0.74, P = 2.3 × 10−9; Supplementary Fig. 1), suggesting that larger datasets are better powered to distinguish rarely mutated genes involved in biological pathways and processes. Discovery of pathways enriched in non-coding mutations suggests that our integrative pathway analysis is an attractive strategy for illuminating the dark matter of the non-coding cancer genome.

We studied the adenocarcinoma cohort of 1773 samples of 16 tumor types. Integrative pathway analysis highlighted 432 genes that were significantly enriched in 526 pathways (Qpathway < 0.05) (Fig. 2b). As expected, the majority of pathways were only supported by genes with frequent coding mutations (328/526 or 62%). However, an additional set of 101 pathways (19%) was supported by both coding and non-coding gene mutations, 72 pathways (14%) were only apparent in the integrated analysis of both coding and non-coding mutations, and 25 (5%) were only enriched in non-coding mutations. Accumulation of these individually infrequent non-coding mutations into relevant pathways and processes is apparent in our integrative analysis and remains undetected in a gene-focused analysis.

The major biological themes with frequent protein-coding mutations included hallmark cancer processes such as ‘apoptotic signaling’ (24 genes; Qpathway = 4.3 × 10−5) and ‘mitotic cell cycle’ (8 genes; Qpathway = 0.0026), and additional biological processes such as chromatin modification and RNA splicing that are increasingly recognized in cancer biology. Thus, ActivePathways captures the expected cancer pathways enriched in driver genes with protein-coding mutations as positive controls. In contrast to these solely protein-coding driver associations, a large group of developmental processes and signal transduction pathways was detected as enriched in both coding and non-coding mutations in genes; for example ‘embryo development process’ was supported by mutations in exons, 3′UTRs and gene promoters (68 genes; Qpathway = 2.9 × 10−12), while the Reactome pathway ‘repression of WNT target genes’ was only apparent in the integrated analysis of coding and non-coding mutations but not in either dataset alone (5 genes, Qpathway = 0.016). In summary, these data show that ActivePathways is a sensitive approach for integrating multi-omics signals such as coding and non-coding mutations, interpreting supporting omics evidence, and finding additional functional associations that are not apparent in any single input dataset.

### Pathway-based prioritization of rarely mutated cancer genes

Pathway analysis can identify candidate genes that would otherwise remain undetected in gene-based analyses. ActivePathways enhances such discovery by integrating signals across multiple datasets. In the pathway analysis of coding and non-coding mutations in PCAWG, we focused on a group of processes involved in kidney development that were exclusively detected through the integration of coding and non-coding mutations (Fig. 2c, d). ActivePathways found 18 genes involved in these processes, only five of which were predicted as driver genes in the consensus driver analysis of the PCAWG project13. Additional known cancer genes included the oncogene MYC with 13 patients with 3′UTR mutations (driver P-value PUTR3 = 4.8 × 10−4), the transcription factor SMAD3 of the TGF-β pathway with 14 patients with protein-coding mutations (PCDS = 4.0 × 10−4) and the growth inhibitory tumor suppressor gene TSC1 with 23 patients with protein-coding mutations (PCDS = 1.4 × 10−4) as well as candidate cancer genes such as IQGAP1 with ten patients with promoter mutations (Ppromoter = 8.2 × 10−4) that encodes a signaling protein involved in cell motility and morphology. The additional genes remained below the FDR-adjusted significance cut-off in the gene-focused consensus driver analysis (Q = 0.17–0.62), however were found by ActivePathways due to their associations with kidney development. Thus ActivePathways can exploit functional gene annotations and multiple omics signals to find further candidate genes that remain undetected in gene-focused analyses.

We evaluated the effects of our data integration strategy and examined all 333 pathway-associated candidate genes detected in the adenocarcinoma cohort (Fig. 2e). As expected, these included a considerable proportion of known cancer genes. First, as positive controls we found 60/64 significantly mutated genes that were also identified in the PCAWG consensus driver analysis13. In addition, we found a set of 47/333 known cancer genes annotated in the COSMIC Cancer Gene Census database11 that were not detected in the driver analysis, significantly more than expected by chance alone (seven genes expected, Fisher’s exact P = 4.0 × 10−24). Those included well-known cancer driver genes MYC, IDH1, NF1, and BCL9. ActivePathways was able to detect these additional genes for several reasons. First, the integrated gene list was filtered using a lenient statistical cut-off in ActivePathways (Pgene < 0.1) that allowed a long tail of 273/333 genes with less-frequent mutations to be detected through pathway associations. Second, certain genes were upgraded through the data fusion procedure as a single stronger P-value per gene was derived by combining multiple weaker P-values corresponding to the coding regions, promoters, UTRs, enhancers of the gene. This affected 17/333 pathway-associated genes including six known cancer genes (HNRNPA2B1, STAG2, TCF7L2, SUZ12, CLTC, and ZNF521), Thus, the integration procedure prioritized specific pathway-related genes compared to their original rankings in individual mutation datasets. However, the majority of genes showed reduced significance after the fusion and were excluded from the pathway analysis: 3543 genes had at least one significant P-value prior to data fusion (Pgene < 0.1) while 88% of these (3112) were considered insignificant following the Brown P-value combination step. In contrast, the majority of pathway-associated genes (220/333) showed improved rankings in the integrated gene list compared to their original rankings in individual input datasets. This formal combination of P-values across omics datasets is therefore more conservative than a naïve approach of selecting a top P-value for every gene. Thus, ActivePathways finds additional candidate genes that remain undetected in gene-by-gene analyses and are highlighted due to their multiple omics signals in pathways.

### Integrating prognostic CNA and mRNA signals in breast cancer

To demonstrate an integrative analysis of patient clinical information with multiple types of omics data, we studied the pathways and processes associated with patient prognosis in breast cancer. We leveraged the METABRIC dataset15 of 1780 breast cancer samples drawn from all four subtypes (HER2-enriched, basal-like, luminal-A, luminal-B) and evaluated all genes using three types of prognostic evidence. mRNA abundance profiles were deconvolved between mRNA abundance levels in tumor cells (TC) and tumor-adjacent cells (TAC) using the ISOpure algorithm21,22. mRNA values were associated with patient survival using median dichotomization and log-rank tests. Gene copy number alterations (CNA) were included as the third type of evidence and associated with patient survival using log-rank tests.

ActivePathways identified 192 significantly enriched GO biological processes and Reactome pathways across the four subtypes of breast cancer, of which nine pathways were enriched in multiple cancer subtypes and 33 pathways were only apparent through the integrative pathway analysis but not in any of the CNA or mRNA datasets alone. The major findings enriched in prognostic signatures in breast cancer subtypes involved the processes and pathways of immune response, apoptosis, ribosome biogenesis and chromosome segregation (Fig. 3a).

Immune activity was associated with prognostic genes in basal-like and HER2-enriched breast cancers with significant enrichment of GO processes such as ‘immune system development’ (Qbasal = 3.0 × 10−4, 113 genes; QHER2 = 0.035, 61 genes; ranked hypergeometric test) and ‘lymphocyte differentiation’ (QHER2 = 6.8 × 10−4, 46 genes; Qbasal = 8.4 × 10−4, 45 genes). The majority of related genes were associated with improved patient prognosis upon increased mRNA abundance in tumor cells or tumor-adjacent cells, comprising 50/61 genes in the HER2-enriched subtype and 78/113 genes in the basal subtype (Fig. 3b). Interestingly, only a minority of these genes (10) were significant in both of the two breast cancer subtypes, suggesting different modes of immune activity in subtypes and emphasizing the power of our pathway-based approach. Basal-like breast cancers were associated with additional 67 terms involving immune response and blood cells, however no immune-related terms were enriched for luminal subtypes of breast cancers. Prognostic features of immune-related genes in HER2-enriched and basal-like breast cancers are well known23,24. Our pathway-based findings indicate that immune activity in breast tumor cells and in the surrounding microenvironment negatively affects tumor progression and improves prognosis.

Apoptosis was associated with patient prognosis in HER2-enriched and luminal-A breast cancers through enriched GO processes such as ‘negative regulation of apoptotic process’ (QHER2 = 0.030, 122 genes; QluminalA = 0.015, 228 genes) and ‘programmed cell death’ (QHER2 = 0.015, 125 genes; QluminalA = 0.016, 231 genes) (Fig. 3c). Interestingly, anti-apoptotic pathways were only detected in the integrative analysis and not in genomic and transcriptomic gene signatures separately. Among the genes negatively regulating apoptosis, DUSP1 provided the strongest prognostic signal in HER2-enriched breast cancers. This was apparent in the molecular stratification of samples by mRNA of tumor cells (log-rank PTC = 0.019, HR = 1.5) and tumor-adjacent cells (PTAC = 8.3 × 10−4, HR = 1.8) as well as gene copy number amplifications (PCNA = 9.8 × 10−4, HR = 2.8) (Fig. 3d). DUSP1 encodes a phosphatase signaling protein of the MAPK pathway that is over-expressed in malignant breast cancer cells and inhibits apoptotic signaling25. HER2 over-expression is known to suppress apoptosis in breast cancer26. Anti-apoptotic signaling is a hallmark of cancer and expectedly associated with worse patient prognosis.

ActivePathways also identified prognostic pathway associations that were only apparent in single subtypes of breast cancer. For example, the prognostic genes for luminal-B subtype were enriched in processes related to ‘chromosome segregation’ (QluminalB = 0.017, 41 genes) that have been associated with worse outcome in breast cancer27. As another example, luminal-A breast cancers were associated with prognosis in ribosomal and RNA processing genes, such as ‘ribosome biogenesis’ (QluminalA = 6.9 × 10−10, 60 genes) and ‘rRNA metabolic process’ (QluminalA = 1.8 × 10−13, 64 genes). Although not described specifically in the luminal-A subtype, ribosomal mRNA abundance has been shown to be prognostic in breast cancer as a marker of cell proliferation28,29. In summary, ActivePathways can be used for integrating clinical data with multiomics information of molecular alterations. Such analyses can provide leads for functional studies and biomarker development.

### Interpreting co-expressed and DNA-bound targets of Hippo TFs

To demonstrate the use of ActivePathways for studying gene regulation, we analyzed transcriptomes of non-cancerous human tissues from the GTEx project5. We focused on the Hippo signaling pathway involved in organ size control, tissue homeostasis and cancer30,31 and studied regulatory networks downstream of the two transcription factors (TFs) YAP and TAZ (encoded by YAP1 and WWTR1). YAP and TAZ are evolutionarily conserved master regulators of Hippo signaling in mammals that respond to intracellular and extracellular signals of cell–cell interactions, cell polarity, mechanical cues, G protein-coupled receptor signaling, and cellular energy status30,31.

We performed an integrative pathway enrichment analysis of transcriptomics and epigenomics data of the two master regulators of the Hippo pathway. First, we predicted transcriptional target genes for YAP and TAZ (1898 and 1319 genes, respectively), using co-expression analysis and robust rank aggregration32 over 9642 transcriptomes of 40 tissue types in GTEx (Q < 0.05). Second, we studied the set of 2356 target genes of YAP that have DNA-binding sites at gene promoters derived from a YAP ChIP-seq study33 reanalyzed in the ReMap database34 (Q < 0.05). The three gene lists with corresponding significance values were used as input to ActivePathways for the integrative analysis.

Integrative analysis of transcriptional and DNA-binding target genes of YAP/TAZ resulted in 225 significantly enriched GO processes and Reactome pathways (Qpathway < 0.05) (Fig. 4a). The resulting pathways are expected in the context of Hippo signaling and included development and morphogenesis, cell motility, organization of actin cytoskeleton and cell–cell junctions, signal transduction pathways such as EGFR, Wnt, Robo, TGF-beta, rho GTPase, and others. ActivePathways highlighted 2066 genes in these pathways. We examined known members of the Hippo pathway and found 13 of 32 genes among pathway-associated genes (WWTR1, LATS1, WWC1, YWHAQ, YAP1, MAPK8, MAPK9, STK3, TNIK, TEAD3, MAP4K3, MINK1, TEAD1), more than expected from chance alone (four genes expected, P = 3.6 × 10−5, Fisher’s exact test). We also compiled an extended list of 106 Hippo-related genes30,31,35 and confirmed that these were enriched in pathway-associated genes (36/308 genes expected, P = 9.1 × 10−26, Fisher’s exact test) (Fig. 4b). A large fraction of pathways (55 or 24%) was only identified in the integrated analysis and not in any input dataset alone, underlining the advantage of detecting significantly enriched pathways across multiple complementary omics datasets. Similarly, Hippo-related genes were either supported by ChIP-seq data alone (54 genes), mRNA data alone (30 genes), or both by mRNA and ChIP-seq data (22 genes), concordant with the notion that RNA-seq and ChIP-seq show limited agreement regarding TF target genes and provide complementary insights into gene regulation. Thus, our integrative analysis of co-expression networks and TF–DNA interactions of Hippo master regulators expectedly converges on Hippo-related genes and pathways.

In addition to GO terms and pathways, ActivePathways can be used to interpret omics data with other classes of functional gene sets such as TF target genes. To further elucidate gene regulatory networks downstream of YAP/TAZ, we considered potential enrichments of DNA-binding target genes of 161 TFs profiled in ChIP-seq studies of the ENCODE project36. We found a regulatory network of 17 TFs and 1426 target genes enriched in the YAP/TAZ regulome (Q < 0.05, ranked hypergeometric test) (Fig. 4c). These included the DNA-binding targets of the master regulators of pluripotent stem cells37 NANOG (208/774 target genes, Q = 1.3 × 10−6) and POU5F1 (107/406 target genes, Q = 7.3 × 10−11). This finding is in agreement with the role of Hippo pathway activity in stem cell regulation38. The regulatory network was significantly enriched in 50 Hippo-related genes (25 expected, P = 1.2 × 10−7, Fisher’s exact test) and six core Hippo genes (two expected, P = 0.030; WWTR1, VGLL4, TNIK, MAPK8, MOB1A, LATS1), similarly to the pathway-based analysis above. However, the two analyses revealed distinct genes: 886 genes were commonly found by both analyses, 1180 genes were found only in the pathway-based analysis, and 540 genes were found only in the TF-based analysis (Fig. 4d). Thus, integrative enrichment analysis of TF target genes provides complementary information to GO terms and pathways. In summary, this analysis highlights genes and pathways related to Hippo signaling in human tissues and demonstrates the use of ActivePathways for studying gene regulatory networks across complementary omics datasets and technology platforms.

### Evaluating the robustness and sensitivity of ActivePathways

We carefully benchmarked ActivePathways using the dataset of cancer driver genes predicted by PCAWG13. First, we compared the performance of ActivePathways with six methods used in the PCAWG pathway and network analysis working group14 (Hierarchical HotNet39,40, SSA−ME41, NBDI42, induced subnetwork analysis40, CanIsoNet43, hypergeometric test). These diverse methods used molecular interaction networks, functional gene sets and/or transcriptomics data to analyze the PCAWG pan-cancer dataset of predicted cancer driver genes. A subsequent consensus analysis defined pathway-implicated driver gene lists with protein-coding and non-coding mutations, based on a majority vote of the pathway and network analysis methods14. ActivePathways recovered these consensus gene lists with the highest accuracy: 100% of coding driver genes (87/87) and 85% of non-coding candidate genes (79/93) were detected (Fig. 5a). Thus ActivePathways agreed the most with the ensemble of several other methods in prioritizing known and candidate cancer driver genes using pathway and network context.

We also compared the performance of ActivePathways with a standard approach of pathway enrichment analysis that considers a single statistically-filtered gene list using a ranked hypergeometric test1,10 (Fig. 5b). To this end, we analyzed individual gene lists of protein-coding and non-coding drivers of the PCAWG adenocarcinoma cohort using multiple gene selection thresholds (Q < 0.05, Q < 0.1, Q < 0.25, P < 0.1). ActivePathways showed increased sensitivity of pathway enrichment analysis compared to the standard approach, in particular for pathways involving non-coding mutations that were not promimently represented in any single gene lists. Even compared to the analysis employing the most lenient gene selection filter (P < 0.1), ActivePathways identified 72 additional pathways that were only apparent through the integration of coding and non-coding mutations and remained undetected in the analysis of individual datasets. Thus, our method provides additional information to common approaches that focus on single gene lists.

We evaluated the robustness of ActivePathways to parameter variations and missing data. We varied the parameter Pgene that determines the ranked gene lists used in the pathway enrichment analysis (default threshold Pgene < 0.1). The majority of PCAWG cohorts (40/47 or 85%) retrieved significantly enriched pathways even with a considerably more stringent threshold (Pgene < 0.001), however 67% fewer pathways were found compared to the default threshold in the median cohort (Supplementary Fig. 2). We then evaluated the robustness of ActivePathways to missing data by randomly removing subsets of driver scores from the initial dataset. Even when removing 50% of gene driver P-values with P < 0.001, the majority of cohorts (37/47 or 79%) had at least one significantly enriched pathway, however 66% fewer pathways were found on average (Supplementary Fig. 3).

We evaluated the expected false positive rates of ActivePathways. We tested 1,000 simulated datasets for each of 47 patient cohorts and expectedly found no significant pathways in 92% of these simulations (Supplementary Fig. 4). Simulated data were obtained by randomly reassigning P-values of driver predictions to different genomic elements, a conservative approach that disrupts pathway annotations of genes while retaining the presence of strong P-values observed in the real data. The median family-wise false discovery rate of ActivePathways computed across cohorts (7.2%) slightly exceeded the applied multiple testing correction (Q < 0.05). Higher rates were observed in cohorts including melanoma tumors, potentially due to abundant promoter mutations caused by impaired nucleotide excision repair in protein-bound genomic regions44. We evaluated quantile-quantile (QQ) plots of pathway-based ranked hypergeometric P-values from ActivePathways and found that these often deviated from the expected uniform distribution and appeared statistically inflated (Supplementary Fig. 5). However, P-values derived from simulated gene scores showed no inflation in our simulations. Anticipating that the most significant cancer driver genes associated with protein-coding mutations, we performed partial simulations. We constructed datasets with simulated gene P-values for protein-coding mutations and true P-values for non-coding mutations. As expected, partially simulated datasets showed a lesser extent of P-value inflation, suggesting that highly significant known cancer genes are responsible for inflation due to their involvement in many pathways. Statistical testing of highly redundant pathways and processes violates the independence assumption of statistical tests and multiple testing procedures, a known caveat of pathway enrichment analysis1, which likely explains the observed distribution of significance values of our method. Collectively, these benchmarks show that ActivePathways is a sensitive and robust method for detecting significantly enriched pathways and processes through integrative analysis of multi-omics data.

## Discussion

Integrative pathway enrichment analysis helps distill thousands of high-throughput measurements to a smaller number of pathways and biological themes that are most characteristic of the experimental data at hand, ideally leading to mechanistic insights and candidate genes for follow-up studies. In particular, a joint analysis of complementary datasets often leads to insights that are unavailable in any particular dataset. ActivePathways provides a generally available framework for systematically prioritizing genes and pathways across multiple omics datasets that utilizes fusion of gene significance. This allows us to identify pathways and processes that stand out when the data are combined, yet are not apparent in any single analyzed dataset. The use of ActivePathways is demonstrated in the three case studies above. In our example of cancer driver discovery, pathway analysis provides complementary insights to gene-focused driver discovery as it also focuses on sub-significant genes that are clustered into relevant biological processes and carry coding and non-coding mutations. In the integrative analysis of molecular alterations in breast cancer subtypes, we find a spectrum of genes and pathways whose molecular signatures in the tumors or the microenvironment have potential prognostic significance. A subset of these findings, such as anti-apoptotic signaling, is only apparent through data integration. In the final example, we use ActivePathways to associate the co-expression and DNA-binding networks of Hippo master regulators to downstream pathways and processes with multi-omics evidence.

Our general pathway analysis strategy is applicable to diverse kinds of datasets where well-calibrated P-values are available for the entire set of genes or proteins. A multi-omics study may quantify genes using a series of genomic, transcriptomic and proteomic experiments and compute corresponding P-values. Data from epigenomic experiments and genome-wide association studies can be also analyzed after signals have been appropriately mapped to genes, for example by identifying ChIP-seq peaks in gene promoters similarly to our GTEx analysis. Clinical and phenotypic information of patients can be also included through association and survival statistics, as shown in our analysis of prognostic signatures of breast cancer subtypes. Our method is expected to work with raw, unadjusted P-values and also with Q-values adjusted for multiple testing, however it is primarily intended for unadjusted P-values for increased sensitivity. ActivePathways conducts multiple testing correction at the pathway level and reports significantly enriched pathways at a family-wise error rate cutoff, regardless of the gene-specific multiple testing correction applied upstream. Quantification of genes and proteins through P-values is more robust than quantification through their abundance measures such as counts or fold-changes. P-values provided to ActivePathways need to be computed using dedicated methods for individual omics platforms such that inherent biases in the data are accounted for prior to pathway analysis. In our example of cancer driver discovery, appropriately computed P-values of driver gene predictions from the PCAWG project13 account for confounding factors of somatic mutations, such as gene sequence length and nucleotide content, mutation signatures active in different types of tumors45 and biological correlates of mutation frequency such as transcription and replication timing46. On the one hand, considering all such variations directly in the pathway enrichment analysis would require substantially more complex models. On the other hand, directly analyzing pathways using simpler metrics (such as mutation counts or frequencies) would propagate any upstream biases to the pathway enrichment analysis and cause challenges with false positives and data interpretation. Thius, given appropriately derived P-values for genes, proteins and other molecules, ActivePathways can be applied to a wide range of analyses.

Our method comes with important caveats. First, we only evaluate genes and proteins annotated in pathway databases. Such databases have variable coverage, rely on frequent data updates47 and may miss sparsely annotated candidate genes. The most general type of pathway enrichment analysis considers biological processes and molecular pathways however many kinds of gene sets available in resources such as MSigDB48 can be used to expand the scope of ActivePathways. Second, pathway information is highly redundant and analyses of rich molecular datasets often result in many significant results reflecting the same underlying pathway. We address this redundancy by visualizing and summarizing pathway results as enrichment maps1,20 that summarize multiple similar pathways and processes into general biological themes. Statistical inflation of results accompanied by biological redundancy is addressed by a stringent multiple testing correction. Third, the analysis treats pathways as gene sets and does not consider interactions of genes in pathways. This simplified strategy allows us to consider a wider repertoire of pathways and processes as reliable mechanistic interactions are often context-specific and limited to a small subset of well-studied signaling pathways. Several advanced methods such as HotNet39, PARADIGM49, and GeneMania50 model pathways through gene and protein interactions.

Translation of discoveries into improved human health through actionable mechanistic insights, biomarkers, and molecular therapies is a long-standing goal of biomedical research. For example, next-generation cancer genomics projects such as ICGC-ARGO aim to collect multi-omics datasets with detailed clinical profiles that will present new challenges for pathway and network analysis techniques. In summary, ActivePathways is integrative pathway analysis method that improves systems-level understanding of cellular organization in health and disease.

## Methods

### Integrated and evidence-based gene lists

The main input of ActivePathways is a matrix of P-values where rows include all genes of a genome and columns correspond to evidence from omics datasets. To interpret multiple omics datasets, a combined P-value is computed for each gene using a data fusion approach, resulting in an integrated gene list. The integrated gene list is computed by merging all P-values of a given gene into one combined P-value using the Brown’s extension18 of the Fisher’s combined probability test that accounts for overall covariation of P-values from different sources of evidence. The integrated gene list of Brown P-values is then ranked in order of decreasing significance and filtered using a lenient threshold (unadjusted P < 0.1 by default). Evidence-based gene lists representing different omics datasets are based on ranked P-values from individual columns of the input matrix and filtered using the same significance threshold.

### Statistical enrichment of pathways

Statistical enrichment of pathways in ranked lists of candidate genes is carried out with the ranked hypergeometric test. The test considers one pathway gene set at a time and analyzes increasing subsets of input genes from the top of the ranked gene list. The same procedure is used for integrated and evidence-based gene lists. At each iteration, the test computes the hypergeometric enrichment statistic and P-value for the set of genes shared by the pathway and top sub-list of the input gene list. For optimal processing speed, only gene lists ending with a pathway-related gene are considered. The ranked hypergeometric statistic selected the input gene sub-list that achieves the strongest enrichment and the smallest P-value as the final result for the given pathway, as:

$$\left( {P_{{\mathrm{pathway}}},G} \right) = \left\{ {{\mathrm{min}},{\mathrm{argmin}}_n} \right\}\mathop {\sum }\limits_{x = k}^{{\mathrm{min}}\left( {n,K} \right)} \frac{{\left( {\begin{array}{*{20}{c}} K \\ k \end{array}} \right)\left( {\begin{array}{*{20}{c}} {N - K} \\ {n - k} \end{array}} \right)}}{{\left( {\begin{array}{*{20}{c}} N \\ n \end{array}} \right)}},$$

where Ppathway stands for the hypergeometric P-value of the pathway enrichment at the optimal sub-list of the significance-ranked candidate genes, G represents the length of the optimal sub-list, i.e., the number of top genes from the input gene list, N is the number of protein-coding genes with annotations in the pathway database, i.e., in Gene Ontology and Reactome, K is the total number of genes in a given pathway, n is the number of genes in a given gene sub-list considered, and k is the number of pathway genes in the considered sub-list. For a conservative estimate of pathway enrichment, we consider as background N the universe of genes contained in input gene sets (terms from pathway databases and ontologies) rather than the complete repertoire of protein-coding genes. To obtain candidate genes involved in the pathway of interest, we intersect pathway genes with the optimal sub-list of candidate genes. The ranked hypergeometric P-value is computed for all pathways and resulting P-values are corrected for multiple testing using the Holm-Bonferroni method of family-wise error rate (FWER)19. Significant pathways are reported by default (Q < 0.05).

### Evaluating omics evidence of enriched pathways

Each evidence-based gene list derived from a single omics dataset is also analyzed for enriched pathways with the ranked hypergeometric test. Pathways found in the integrated gene list are then labeled for supporting evidence in the case they are also found as significant in any evidence-based gene list. A pathway is considered to be found only through data integration and labeled as combined-only if it is identified as enriched in the integrated gene list but not identified as enriched in any of the evidence-based gene lists at equivalent significance cut-offs (default Q < 0.05). Each detected pathway is additionally annotated with pathway genes apparent in the optimal sub-list of candidate genes separately for the integrated gene list and each evidence-based gene list.

### Pathways and processes

We used gene sets corresponding to biological processes of Gene Ontology16 and molecular pathways of the Reactome database17 downloaded from the g:Profiler web server10 on Jan 26th 2018. Large general gene sets with more than a thousand genes and small specific gene sets with less than five genes were excluded to avoid statistical inflation of large gene lists and interpretation challenges with very small lists.

### Enrichment map visualization

ActivePathways creates input files for the EnrichmentMap app20 of Cytoscape51 for network visualization of similar pathways and their coloring according to supporting evidence. Enrichment maps for adenocarcinoma driver mutations, prognostic quantification of molecular alterations of breast cancer, and transcriptional networks downstream of Hippo signaling were visualized with stringent pathway similarity scores (Jaccard and overlap combined coefficient 0.6) and manually curated for the most representative groups of similar pathways and processes. Singleton pathways that were redundant with larger groups of pathways were merged with the latter or discarded. Coloring of pathways in the adenocarcinoma enrichment map was rearranged by merging colors of pathways supported by non-coding mutation scores of promoters, enhancers and/or UTRs into one group.

### Coding and non-coding mutations of the PCAWG dataset

We used ActivePathways to analyze driver predictions of coding and non-coding mutations across white-listed 2583 whole cancer genomes of the ICGC-TCGA PCAWG project. P-values of driver predictions were computed separately for protein-coding sequences, promoters, enhancers and untranslated regions (UTR3, UTR5 across multiple subsets of samples representing histological tumor types and pan-cancer cohorts as reported in the PCAWG driver discovery study13. We used gene-enhancer mapping predictions provided by PCAWG, excluded enhancers with more than five target genes, and selected the most significant enhancer for each gene, if any. Unadjusted P-values for coding sequences, promoters, enhancers and UTRs were compiled as input matrices and analyzed as described above. Missing P-values were conservatively interpreted as ones. Results from ActivePathways were validated with two lists of cancer genes. Predicted drivers from the PCAWG consensus analysis13 were selected as statistically significant findings (Q < 0.05) following a stringent multiple testing correction spanning all types of elements (exons, UTRs, promoter, enhancers). The curated list of known cancer genes was retrieved from the COSMIC Cancer Gene Census (CGC) database11. One-tailed Fisher’s exact tests were used to estimate enrichment of these genes in our results, using all human protein-coding genes as the statistical background set.

### Prognostic CNA and mRNA signals in breast cancer

ActivePathways was used to evaluate prognostic pathways in breast cancer subtypes. Multiple types of omics data were used for an integrative analysis: mRNA gene expression data and gene copy number alteration (CNA) data of the were derived from the METABRIC cohort of 1991 patients with a single primary fresh frozen breast cancer specimen each15. Curtis et al.15 classified the patients into the intrinsic breast cancer subtypes using the PAM50 mRNA-based classifier52 resulting in 330 basal-like breast cancers, 238 HER2-enriched breast cancers, 721 luminal-A breast cancers, 491 luminal-B breast cancers. Using these data, we computationally deconvolved tumor cell (TC) mRNA and tumor adjacent cell (TAC) mRNA abundance levels from the bulk profiled specimens. TC mRNA was deconvolved using the ISOpure21 method release 2010b in MATLAB. TAC mRNA abundance profiles were computed using the ISOpure.calculate.tac function from the R package ISOpureR22 v1.1.2. The deconvolution analyses were performed independently for each breast cancer subtype. The mRNA univariate survival analysis was conducted as follows. For each gene, patients were dichotomized based on mRNA abundance. Dichotomization was either based on the median mRNA abundance for that gene or a fixed value of 6.5. Based on the mRNA abundance distribution of genes on the Y chromosome in female samples, the value 6.5 was estimated as the threshold for noise for non-expressed genes. Median dichotomization was used if the median was above 6.5 or if there were no events in one of the groups when dichotomizing based on 6.5. The high and low groups based on mRNA abundance were compared by univariate log-rank tests for overall patient survival. TC and TAC mRNA abundance values were evaluated independently. Survival modeling was performed in the R statistical environment (v3.4.3) using the survival package (v2.42–3). The CNA univariate survival analysis was conducted as follows. For each gene, we assessed whether more gains or losses were apparent. The copy number status with a higher count was subsequently used to separate patients into two groups: those with the chosen copy number status and the remaining patients. The two groups were then used for overall survival modeling with log-rank tests.

### Hippo pathway target genes in mRNA and ChIP-seq data

This analysis included two types of omics data, mRNA abundance measurements from RNA-seq experiments and transcription factor DNA-binding measurements from chromatin immunoprecipitation sequencing (ChIP-seq) experiments. The RNA-seq dataset of human tissues was downloaded from GTEx v7 data portal (https://www.gtexportal.org/home/). The dataset included transcript abundance values of 21,518 protein-coding genes in 11,688 samples across 53 tissues. Tissues with less than 25 available samples and low gene expression (mean TPM < 1.0; transcripts per million) were excluded from further analysis, resulting in 40 tissues and 9672 samples with mRNA abundance profiles of 19,025 genes. Transcriptional target gene lists for the master transcription factors YAP and TAZ (encoded by YAP1, WWTR1) were predicted separately in the following two steps. First, we computed pairwise Pearson correlations between a given TF and all other genes within a tissue of interest and ranked these by the significance P-value of positive correlations. Second, the resulting ranked gene lists were then aggregated into one master target gene list across the GTEx tissues using the robust rank aggregation (RRA) method32 with default parameters (Qgene < 0.05). FDR-adjusted values of genes from RRA for YAP and TAZ were used as the first and second evidence for input of ActivePathways, respectively. For DNA-binding targets of YAP, we retrieved ChIP-seq binding sites in three cell lines (CCLP1, MSTO, and HUCCT1) from an earlier study33 that were reprocessed in the ReMap database34. Binding sites were filtered by statistical significance (Q < 0.05) and mapped to gene promoters of the human genome (hg19) using gene promoters defined in the PCAWG driver analysis13. If a promoter had multiple binding sites, the site with the strongest FDR-value was selected as a representative site for that gene. FDR-adjusted values of genes with YAP DNA-binding sites were used as the third evidence for input of ActivePathways. Significantly enriched pathways among the putative target genes of YAP and TAZ were subsequently detected using ActivePathways. We compiled a list of 308 Hippo-related genes from the KEGG pathway database35 and two recent review papers30,31. To validate the analysis, we tested the overall sets of genes identified by ActivePathways for enrichment of Hippo-related genes using Fisher’s exact tests. The Hippo analysis was conducted separately for two collections of functional gene sets. First, we tested GO biological processes and Reactome pathways similarly to analyses described above. Second, we used gene sets corresponding to transcription factor binding sites (TFBS) derived from the ENCODE project36.

### Method benchmarking

We benchmarked ActivePathways using multiple approaches, including simulated datasets of P-values, method parameter variations, and partial replacement of P-values scores with insignificant P-values. Benchmarking was performed on the PCAWG predictions of cancer driver genes with coding and non-coding mutations. To evaluate the false discovery rate of ActivePathways, we created simulated datasets by reassigning all observed P-values to random genes and their genomic elements. Simulations were conducted separately for different tumor cohorts. One thousand simulated datasets were analyzed with ActivePathways and those with at least one significantly detected pathway counted towards false discovery rates. A separate set of simulations maintained the positions of non-coding P-values among genes and randomly reassigned P-values corresponding to protein-coding mutations, expectedly leading to a reduction in detected pathways as the PCAWG datasets primarily included strong P-values for genes with frequent protein-coding mutations. Quantile-quantile analysis and QQ-plots were used to compare P-value distributions of pathways discovered from true P-values, partially shuffled P-values (true non-coding and shuffled protein-coding P-values), and fully shuffled P-values. To evaluate robustness of ActivePathways, we randomly replaced a fraction of significant driver P-values in the true dataset (P < 0.001) with insignificant P-values (P = 1). We tested different fractions of missing values (10, 25, and 50%) across a thousand datasets of randomly selected missing data points. We concluded that most PCAWG cohorts included significantly enriched pathways even with large fractions of missing data. To further evaluate robustness, we tested different values of the Brown P-value threshold used to select the integrated gene list for pathway enrichment analysis. The default parameter value (Pgene < 0.1) was compared to alternative values (0.001, 0.01, 0.05, and 0.2). We concluded that ActivePathways found enriched pathways in most tumor cohorts even at more stringent gene selection levels.

### Ethical review

Sequencing of human subjects’ tissue was performed by ICGC and TCGA consortium members under a series of locally approved Institutional Review Board (IRB) protocols as described in Hudson et al. Informed consent was obtained from all human participants. Ethical review of the current data analysis project was granted by the University of Toronto Research Ethics Board (REB) under protocol #30278, “Pan-cancer Analysis of Whole Genomes: PCAWG”.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

## Data availability

Somatic and germline variant calls, mutational signatures, subclonal reconstructions, transcript abundance, splice calls and other core data generated by the ICGC/TCGA Pan-cancer Analysis of Whole Genomes Consortium is described in the marker paper4 and available for download at https://dcc.icgc.org/releases/PCAWG. Additional information on accessing the data, including raw read files, can be found at https://docs.icgc.org/pcawg/data/. In accordance with the data access policies of the ICGC and TCGA projects, most molecular, clinical and specimen data are in an open tier which does not require access approval. To access potentially identifiable information, such as germline alleles and underlying sequencing data, researchers will need to apply to the TCGA Data Access Committee (DAC) via dbGaP (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page = login) for access to the TCGA portion of the dataset, and to the ICGC Data Access Compliance Office (DACO; http://icgc.org/daco) for the ICGC portion. In addition, to access somatic single nucleotide variants derived from TCGA donors, researchers will also need to obtain dbGaP authorization. Derived PCAWG datasets described specifically in this manuscript can be found at the locations listed below. Additional relevant datasets are listed in the PCAWG study of pathways and networks14.

LabelSynapse IDICGC DCC URLICGC DCC FilenameAccess (Open/Controlled)
PCAWG driver P-valuessyn8494939https://dcc.icgc.org/releases/PCAWG/networks/final_integration_results_2017_03_16.tar.gzOpen
PCAWG pathway and network method resultssyn21413360https://dcc.icgc.org/releases/PCAWG/networks/pathway_and_network_method_results.tar.gzOpen
PCAWG pathway and network consensus resultssyn11654843https://dcc.icgc.org/releases/PCAWG/networks/method_results_2017_10_10.tar.gzOpen
Enhancer-gene mappingssyn7201027https://dcc.icgc.org/releases/PCAWG/networks/map.enhancer.gene.txt.gzOpen
Coding and non-coding elementssyn21416282https://dcc.icgc.org/releases/PCAWG/networks/gene-coding-and-non-coding-elements.tar.gzOpen

## Code availability

ActivePathways is freely available as an R package and source code on the GitHub repository https://github.com/reimandlab/ActivePathways and in the Comprehensive R Archive Network (CRAN). Additional scripts are available upon request. The core computational pipelines used by the PCAWG Consortium for alignment, quality control and variant calling are available to the public at https://dockstore.org/search?search=pcawg under the GNU General Public License v3.0, which allows for reuse and distribution.

## References

1. 1.

Reimand, J. et al. Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, cytoscape and enrichmentmap. Nat. Protoc. 14, 482–517 (2019).

2. 2.

Weinstein, J. N. et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).

3. 3.

Hudson, T. J. et al. International network of cancer genome projects. Nature 464, 993–998 (2010).

4. 4.

The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature https://doi.org/10.1038/s41586-020-1969-6 (2020).

5. 5.

GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).

6. 6.

Sinha, A. et al. The proteogenomic landscape of curable prostate cancer. Cancer Cell 35, 414–427 e416 (2019).

7. 7.

Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).

8. 8.

Mi, H., Muruganujan, A. & Thomas, P. D. PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res. 41, D377–D386 (2013).

9. 9.

Kaimal, V., Bardes, E. E., Tabar, S. C., Jegga, A. G. & Aronow, B. J. ToppCluster: a multiple gene list feature analyzer for comparative enrichment clustering and network-based dissection of biological systems. Nucleic Acids Res. 38, W96–W102 (2010).

10. 10.

Reimand, J., Kull, M., Peterson, H., Hansen, J. & Vilo, J. g:Profiler—a web-based toolset for functional profiling of gene lists from large-scale experiments. Nucleic acids Res. 35, W193–W200 (2007).

11. 11.

Futreal, P. A. et al. A census of human cancer genes. Nat. Rev. Cancer 4, 177–183 (2004).

12. 12.

Huang, F. W. et al. Highly recurrent TERT promoter mutations in human melanoma. Science 339, 957–959 (2013).

13. 13.

Rheinbay, E. et al. Analyses of non-coding somatic drivers in 2,658 cancer whole genomes. Nature https://doi.org/10.1038/s41586-020-1965-x (2020).

14. 14.

Reyna, M. A. et al. Pathway and network analysis of more than 2500 whole cancer genomes. Nat. Commun. https://doi.org/10.1038/s41467-020-14367-0 (2020).

15. 15.

Curtis, C. et al. The genomic and transcriptomic architecture of 2000 breast tumours reveals novel subgroups. Nature 486, 346–352 (2012).

16. 16.

Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).

17. 17.

Fabregat, A. et al. The reactome pathway knowledgebase. Nucleic Acids Res. 46, D649–D655 (2018).

18. 18.

Brown, M. B. A method for combining non-independent, one-sided tests of significance. Biometrics 31, 987–992 (1975).

19. 19.

Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70 (1979).

20. 20.

Merico, D., Isserlin, R., Stueker, O., Emili, A. & Bader, G. D. Enrichment map: a network-based method for gene-set enrichment visualization and interpretation. PLoS ONE 5, e13984 (2010).

21. 21.

Quon, G. et al. Computational purification of individual tumor gene expression profiles leads to significant improvements in prognostic prediction. Genome Med. 5, 29 (2013).

22. 22.

Fox, N. S., Haider, S., Harris, A. L. & Boutros, B. C. Landscape of transcriptomic interactions between breast cancer and its microenvironment. Nat. Commun. 10, 3116 (2019).

23. 23.

Adams, S. et al. Prognostic value of tumor-infiltrating lymphocytes in triple-negative breast cancers from two phase III randomized adjuvant breast cancer trials: ECOG 2197 and ECOG 1199. J. Clin. Oncol. 32, 2959–2966 (2014).

24. 24.

Sabatier, R. et al. A gene expression signature identifies two prognostic subgroups of basal breast cancer. Breast Cancer Res. Treat. 126, 407–420 (2011).

25. 25.

Wang, H. Y., Cheng, Z. & Malbon, C. C. Overexpression of mitogen-activated protein kinase phosphatases MKP1, MKP2 in human breast cancer. Cancer Lett. 191, 229–237 (2003).

26. 26.

Carpenter, R. L. & Lo, H. W. Regulation of apoptosis by HER2 in breast cancer. J. Carcinog. Mutagen. https://doi.org/10.4172/2157-2518.S7-003 (2013).

27. 27.

Denu, R. A. et al. Centrosome amplification induces high grade features and is prognostic of worse outcomes in breast cancer. BMC Cancer 16, 47 (2016).

28. 28.

Belin, S. et al. Dysregulation of ribosome biogenesis and translational capacity is associated with tumor progression of human breast cancer cells. PLoS ONE 4, e7147 (2009).

29. 29.

Guimaraes, J. C. & Zavolan, M. Patterns of ribosomal protein expression specify normal and malignant human cells. Genome Biol. 17, 236 (2016).

30. 30.

Varelas, X. The Hippo pathway effectors TAZ and YAP in development, homeostasis and disease. Development 141, 1614–1626 (2014).

31. 31.

Yu, F. X., Zhao, B. & Guan, K. L. Hippo pathway in organ size control, tissue homeostasis, and cancer. Cell 163, 811–828 (2015).

32. 32.

Kolde, R., Laur, S., Adler, P. & Vilo, J. Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics 28, 573–580 (2012).

33. 33.

Galli, G. G. et al. YAP drives growth by controlling transcriptional pause release from dynamic enhancers. Mol. Cell 60, 328–337 (2015).

34. 34.

Cheneby, J., Gheorghe, M., Artufel, M., Mathelier, A. & Ballester, B. ReMap 2018: an updated atlas of regulatory regions from an integrative analysis of DNA-binding ChIP-seq experiments. Nucleic Acids Res. 46, D267–D275 (2018).

35. 35.

Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361 (2017).

36. 36.

ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

37. 37.

Loh, Y. H. et al. The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nat. Genet. 38, 431–440 (2006).

38. 38.

Mo, J. S., Park, H. W. & Guan, K. L. The Hippo signaling pathway in stem cell biology and cancer. EMBO Rep. 15, 642–656 (2014).

39. 39.

Leiserson, M. D. et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet. 47, 106–114 (2015).

40. 40.

Reyna, M. A., Leiserson, M. D. M. & Raphael, B. J. Identifying hierarchies of altered subnetworks. Bioinformatics 34, i972–i980 (2018).

41. 41.

Pulido-Tamayo, S., Weytjens, B., De Maeyer, D. & Marchal, K. SSA-ME Detection of cancer driver genes using mutual exclusivity by small subnetwork analysis. Sci. Rep. 6, 36257 (2016).

42. 42.

Verbeke, L. P. et al. Pathway relevance ranking for tumor samples through network-based data integration. PLoS ONE 10, e0133503 (2015).

43. 43.

Kahraman, A. & von Mering, C. Pathogenic impact of isoform switches in 1209 cancer samples covering 27 cancer types using an isoform-specific interaction network. bioRxiv https://doi.org/10.1101/742379 (2019).

44. 44.

Sabarinathan, R., Mularoni, L., Deu-Pons, J., Gonzalez-Perez, A. & Lopez-Bigas, N. Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature 532, 264–267 (2016).

45. 45.

Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).

46. 46.

Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).

47. 47.

Wadi, L., Meyer, M., Weiser, J., Stein, L. D. & Reimand, J. Impact of outdated gene annotations on pathway enrichment analysis. Nat. Methods 13, 705–706 (2016).

48. 48.

Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).

49. 49.

Vaske, C. J. et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics 26, i237–i245 (2010).

50. 50.

Warde-Farley, D. et al. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res. 38, W214–W220 (2010).

51. 51.

Cline, M. S. et al. Integration of biological networks and gene expression data using cytoscape. Nat. Protoc. 2, 2366–2382 (2007).

52. 52.

Parker, J. S. et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J. Clin. Oncol. 27, 1160–1167 (2009).

## Acknowledgements

We would like to thank Drs. Benjamin Raphael, Matthew Reyna, Lincoln Stein and Josh Stuart for valuable discussions on the manuscript. This work was funded by Ontario Institute for Cancer Research (OICR) Investigator Awards to J.R. and P.C.B. provided by the Government of Ontario; Operating Grant to J.R. from Cancer Research Society (CRS) (#21089); Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant to J.R. (#RGPIN-2016-06485), and the Canada First Research Excellence Fund, University of Toronto Medicine by Design to J.R. H.Z. was supported by a CIHR Canadian Graduate Scholarship. J.B. was supported by a BioTalent Canada Student Internship. P.C.B. was supported by TFRI and CIHR New Investigator Awards. We acknowledge the contributions of the many clinical networks across ICGC and TCGA who provided samples and data to the PCAWG Consortium, and the contributions of the Technical Working Group and the Germline Working Group of the PCAWG Consortium for collation, realignment and harmonized variant calling of the cancer genomes used in this study. We thank the patients and their families for their participation in the individual ICGC and TCGA projects.

## Author information

Authors

### Contributions

J.B., H.Z., J.R. developed the method and the software package. M.P., H.Z., N.S., N.S.F., D.A.R., M.W.M., J.R. analysed the data. P.C.B. contributed to study design and supervision. J.R. designed and supervised the study. Members of the PCAWG Drivers and Functional Interpretation Working Group (co-led by Mark Gerstein, Gad Getz, Michael S. Lawrence, Jakob Skou Pedersen, Benjamin J. Raphael, Joshua M. Stuart, and David A. Wheeler) contributed datasets and improved study design. All authors discussed the results and contributed to the final manuscript.

### Corresponding author

Correspondence to Jüri Reimand.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Peer review information Nature Communications thanks Zhaohui Qin and other anonymous reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Reprints and Permissions

Paczkowska, M., Barenboim, J., Sintupisut, N. et al. Integrative pathway enrichment analysis of multivariate omics data. Nat Commun 11, 735 (2020). https://doi.org/10.1038/s41467-019-13983-9

• Accepted:

• Published:

• DOI: https://doi.org/10.1038/s41467-019-13983-9

• ### The current issues and future perspective of artificial intelligence for developing new treatment strategy in non-small cell lung cancer: harmonization of molecular cancer biology and artificial intelligence

• Ichidai Tanaka
• Taiki Furukawa
• Masahiro Morise

Cancer Cell International (2021)

• ### Subterranean Microbiome Affiliations of Plantain (Musa spp.) Under Diverse Agroecologies of Western and Central Africa

• Manoj Kaushal
• Yao Kolombia
• Cargele Masso

Microbial Ecology (2021)

• ### Decoding human cancer with whole genome sequencing: a review of PCAWG Project studies published in February 2020

• Simona Giunta

Cancer and Metastasis Reviews (2021)

• ### Mining transcriptome data to identify genes and pathways related to lemon taste using supervised and unsupervised data learning methods

• Zahra Zinati
• Sima Sazegari

Horticulture, Environment, and Biotechnology (2021)

• ### Prospecting whole cancer genomes

• Jason F. Kreisberg
• Trey Ideker
• Gordon Mills

Nature Cancer (2020)