Integrative pathway enrichment analysis of multivariate omics data

Multi-omics datasets represent distinct aspects of the central dogma of molecular biology. Such high-dimensional molecular profiles pose challenges to data interpretation and hypothesis generation. ActivePathways is an integrative method that discovers significantly enriched pathways across multiple datasets using statistical data fusion, rationalizes contributing evidence and highlights associated genes. As part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2658 cancers across 38 tumor types, we integrated genes with coding and non-coding mutations and revealed frequently mutated pathways and additional cancer genes with infrequent mutations. We also analyzed prognostic molecular pathways by integrating genomic and transcriptomic features of 1780 breast cancers and highlighted associations with immune response and anti-apoptotic signaling. Integration of ChIP-seq and RNA-seq data for master regulators of the Hippo pathway across normal human tissues identified processes of tissue regeneration and stem cell regulation. ActivePathways is a versatile method that improves systems-level understanding of cellular organization in health and disease through integration of multiple molecular datasets and pathway annotations.

P athway enrichment analysis is an essential step for interpreting high-throughput (omics) data that uses current knowledge of genes and biological processes. A common application determines statistical enrichment of molecular pathways, biological processes and other functional annotations in long lists of candidate genes 1 . Genomic, transcriptomic, proteomic and epigenomic experiments emphasize complementary aspects of underlying biology and are best analyzed integratively, as is now routinely done in large-scale projects such as The Cancer Genome Atlas (TCGA) 2 , Clinical Proteome Tumor Analysis Consortium (CPTAC), International Cancer Genome Consortium (ICGC) 3,4 , Genotype-Tissue Expression (GTEx) 5 , and others 6 . Thus, simultaneous analysis of multiple candidate gene lists for characteristic pathways is increasingly needed.
Numerous approaches are available for interpreting single gene lists. For example, the GSEA algorithm can detect upregulated and downregulated pathways in gene expression datasets 7 . Webbased methods such as Panther 8 , ToppCluster 9 , and g:Profiler 10 detect significantly enriched pathways amongst ranked or unranked gene lists and are generally applicable to genes and proteins from various analyses. Some approaches allow analysis of multiple input gene lists however these primarily rely on visualization rather than data integration to evaluate the contribution of distinct gene lists towards each detected pathway 9,10 . Finally, no methods are available for unified pathway analysis of coding and non-coding mutations from whole-genome sequencing (WGS) data, or integrating these with other types of DNA aberrations such as copy number changes and balanced genomic rearrangements.
Cancer genomes are characterized by multiple classes of mutations, including single nucleotide variants (SNVs), small insertions-deletions (indels), copy number alterations, and translocations. These affect a small number of frequently mutated pan-cancer driver genes such as TP53, less-frequent and tissuespecific genes such as SPOP in prostate cancer, and numerous infrequently mutated genes. The majority of currently known driver mutations of SNVs and indels affect protein-coding sequence 11 and only few high-confidence non-coding drivers have been found, such as the mutation hotspots in the TERT promoter 12 . Discovery of coding and non-coding driver mutations is a major goal of large cancer whole genome sequencing efforts. The PCAWG Consortium aggregated whole genome sequencing data from 2658 cancers across 38 tumor types generated by the ICGC and TCGA projects. These sequencing data were re-analysed with standardized, high-accuracy pipelines to align to the human genome (reference build hs37d5) and identify germline variants and somatically acquired mutations, as described in the PCAWG marker paper 4 . A consensus analysis of variant calls in PCAWG tumors generated a high-confidence catalog of driver mutations in protein-coding driver genes (CDS) and non-coding regions of 5′ and 3′ untranslated elements (UTRs), promoters and enhancers 13 . A consensus pathway and network analysis of PCAWG driver mutations used knowledge of molecular pathways and gene interaction networks as priors to further discover infrequent candidate driver variants including those in the non-coding genome 14 .
Here we report the development of the ActivePathways method that uses data fusion techniques to address the challenge of integrative pathway analysis of multi-omics data. It detects significantly enriched pathways across multiple datasets, including those pathways that are not apparent in any individual dataset. We present several analyses to demonstrate this method. First, we integrate cancer driver genes with coding and noncoding mutations predicted using the PCAWG dataset 13 and reveal numerous processes and additional genes with frequent coding and non-coding mutations. Second, we integrate patient clinical information with transcriptomic and copy number alterations in breast cancers of the METABRIC project 15 to discover prognostic pathways and processes in breast cancer subtypes. Third, we integrate transcriptomic data of normal tissues of the GTEx project 5 with ChIP-seq data to infer gene regulatory networks and biological processes downstream of the Hippo pathway of tissue growth control and regeneration. Thus Acti-vePathways is a versatile method for combining diverse multiomics datasets.

Results
Multi-omics pathway enrichment analysis with ActivePathways. ActivePathways is a simple three-step method that extends our earlier work 10 (Fig. 1). It requires two input datasets. The first input is a table of P-values with genes listed in rows and evidence from distinct datasets listed in columns. The columns can include P-values of differential gene expression, gene essentiality, mutation or copy number alteration burden and many others that are derived using platform-specific quantification methods. The second input is a collection of gene sets that represents collective knowledge of gene function and interactions we refer to as pathways. The most common analysis utilizes biological processes from gene ontology 16 (GO) and molecular pathways from the Reactome database 17 . Depending on the hypothesis, these data may also include many other types of gene sets such as targets of transcription factors or microRNAs.
In the first step of ActivePathways, we derive an integrated gene list that for each input gene aggregates significance from multiple omics datasets. The integrated gene list is compiled by a fusion of gene significance from different omics datasets (i.e., evidence) using the Brown's extension 18 of the Fisher's combined probability test. The Brown's method considers dependencies between datasets and thus provides more conservative estimates of significance for genes that are supported by multiple similar omics datasets. The integrated input gene list is then ranked by decreasing significance and filtered using a lenient cut-off, designed to capture additional candidate genes with subsignificant signals while discarding the bulk of insignificant genes (unadjusted Brown P gene < 0.1). In the second step, a pathway enrichment analysis is conducted on the integrated gene using a ranked hypergeometric test 10 and a collection of gene sets (i.e., biological processes, molecular pathways, and other gene annotations). The ranked hypergeometric test is designed to capture smaller pathways tightly associated with few top-ranking genes and also broader processes associated with larger subsets of input genes. The family-wise multiple testing correction method by Holm 19 is then applied across tested pathways to select the pathways significantly enriched in the integrated gene list (Q pathway < 0.05). In the third step, we perform a similar analysis on the gene lists of individual omics datasets separately to determine the omics evidence supporting the integrative pathway analysis results determined in step 2. Importantly, the third step also highlights pathways that are only found through data integration and are not apparent in any single omics dataset separately, providing the added value of integrated analysis. Finally, the method provides input files for Enrichment Map 20 for visualizing resulting pathways with the corresponding omics evidence.
Integrating coding and non-coding drivers in 2658 cancer genomes. We performed an integrative pathway analysis of driver genes predicted in the PCAWG project based on somatic SNVs and indels. This analysis comprised 29 cancer patients cohorts of histological tumor types and 18 meta-cohorts combining multiple types of tumors, with 47 cohorts in total (Supplementary Table 1). ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-13983-9 ActivePathways identified at least one significantly enriched process or pathway in 89% of these cohorts (42/47, Q pathway < 0.05, ranked hypergeometric test) (Fig. 2a). We analyzed the evidence supporting predictions of enriched pathways: most cohorts showed enrichments in pathways supported by proteincoding mutations in genes (37/47 or 79%). This serves as a positive control since the majority of currently-known cancer driver genes have frequent protein-coding mutations.
Non-coding mutations in genes also contributed broadly to the discovery of frequently mutated biological processes and pathways: 24/47 cohorts (51%) showed significantly enriched pathways that were apparent when only analyzing non-coding driver scores corresponding to UTRs, promoters or enhancers. The majority of PCAWG tumor cohorts (41/47 or 87%) revealed some frequently mutated pathways that were apparent when integrating coding and non-coding mutations however remained undetected when considering either coding or non-coding mutations separately, emphasizing the value of our integrative approach. As expected, cohorts with more patient tumor samples generated more significantly enriched pathways (Spearman ρ = 0.74, P = 2.3 × 10 −9 ; Supplementary Fig. 1), suggesting that larger datasets are better powered to distinguish rarely mutated genes involved in biological pathways and processes. Discovery of pathways enriched in non-coding mutations suggests that our (1) P-value matrix gene x evidence a b c Fig. 1 Method overview. a ActivePathways requires as input (1) a matrix of gene P-values for different omics datasets, and (2) a collection of gene sets corresponding to biological pathways and processes. b In step (1), gene P-values are merged using the Brown procedure and filtered to produce an integrated gene list that combines evidence from datasets and is ranked by decreasing significance with a lenient threshold. In step (2), pathway enrichment analysis is conducted on the integrated gene list using the ranked hypergeometric test that determines the optimal level of enrichment in the ranked gene sub-list for every pathway. In step (3), separate gene lists are compiled from individual input datasets and analyzed for pathway enrichment using the ranked hypergeometric test, to find supporting evidence for each pathway from the integrative analysis. c ActivePathways provides a list of enriched pathways in the integrated gene list, the associated genes with significant Brown P-values, and annotations of evidence supporting each pathway.
Results of ActivePathways are visualized as enrichment maps where nodes correspond to pathways and pathways with many shared genes are connected into networks representing broader biological themes. integrative pathway analysis is an attractive strategy for illuminating the dark matter of the non-coding cancer genome. We studied the adenocarcinoma cohort of 1773 samples of 16 tumor types. Integrative pathway analysis highlighted 432 genes that were significantly enriched in 526 pathways (Q pathway < 0.05) (Fig. 2b). As expected, the majority of pathways were only supported by genes with frequent coding mutations (328/526 or 62%). However, an additional set of 101 pathways (19%) was supported by both coding and non-coding gene mutations, 72 pathways (14%) were only apparent in the integrated analysis of both coding and non-coding mutations, and 25 (5%) were only enriched in non-coding mutations. Accumulation of these 0 250 500 750 P a n c a n − n o − ly m p h P a n c a n − n o − s k in − m e la n o m a P a n c a n P a n c a n − n o − s k in  individually infrequent non-coding mutations into relevant pathways and processes is apparent in our integrative analysis and remains undetected in a gene-focused analysis.
The major biological themes with frequent protein-coding mutations included hallmark cancer processes such as 'apoptotic signaling' (24 genes; Q pathway = 4.3 × 10 −5 ) and 'mitotic cell cycle' (8 genes; Q pathway = 0.0026), and additional biological processes such as chromatin modification and RNA splicing that are increasingly recognized in cancer biology. Thus, ActivePathways captures the expected cancer pathways enriched in driver genes with protein-coding mutations as positive controls. In contrast to these solely protein-coding driver associations, a large group of developmental processes and signal transduction pathways was detected as enriched in both coding and non-coding mutations in genes; for example 'embryo development process' was supported by mutations in exons, 3′UTRs and gene promoters (68 genes; Q pathway = 2.9 × 10 −12 ), while the Reactome pathway 'repression of WNT target genes' was only apparent in the integrated analysis of coding and non-coding mutations but not in either dataset alone (5 genes, Q pathway = 0.016). In summary, these data show that ActivePathways is a sensitive approach for integrating multi-omics signals such as coding and non-coding mutations, interpreting supporting omics evidence, and finding additional functional associations that are not apparent in any single input dataset.
Pathway-based prioritization of rarely mutated cancer genes. Pathway analysis can identify candidate genes that would otherwise remain undetected in gene-based analyses. ActivePathways enhances such discovery by integrating signals across multiple datasets. In the pathway analysis of coding and non-coding mutations in PCAWG, we focused on a group of processes involved in kidney development that were exclusively detected through the integration of coding and non-coding mutations (Fig. 2c, d). ActivePathways found 18 genes involved in these processes, only five of which were predicted as driver genes in the consensus driver analysis of the PCAWG project 13 . Additional known cancer genes included the oncogene MYC with 13 patients with 3′UTR mutations (driver P-value P UTR3 = 4.8 × 10 −4 ), the transcription factor SMAD3 of the TGF-β pathway with 14 patients with protein-coding mutations (P CDS = 4.0 × 10 −4 ) and the growth inhibitory tumor suppressor gene TSC1 with 23 patients with protein-coding mutations (P CDS = 1.4 × 10 −4 ) as well as candidate cancer genes such as IQGAP1 with ten patients with promoter mutations (P promoter = 8.2 × 10 −4 ) that encodes a signaling protein involved in cell motility and morphology. The additional genes remained below the FDR-adjusted significance cut-off in the gene-focused consensus driver analysis (Q = 0.17-0.62), however were found by ActivePathways due to their associations with kidney development. Thus ActivePathways can exploit functional gene annotations and multiple omics signals to find further candidate genes that remain undetected in genefocused analyses.
We evaluated the effects of our data integration strategy and examined all 333 pathway-associated candidate genes detected in the adenocarcinoma cohort (Fig. 2e). As expected, these included a considerable proportion of known cancer genes. First, as positive controls we found 60/64 significantly mutated genes that were also identified in the PCAWG consensus driver analysis 13 . In addition, we found a set of 47/333 known cancer genes annotated in the COSMIC Cancer Gene Census database 11 that were not detected in the driver analysis, significantly more than expected by chance alone (seven genes expected, Fisher's exact P = 4.0 × 10 −24 ). Those included well-known cancer driver genes MYC, IDH1, NF1, and BCL9. ActivePathways was able to detect these additional genes for several reasons. First, the integrated gene list was filtered using a lenient statistical cut-off in ActivePathways (P gene < 0.1) that allowed a long tail of 273/333 genes with less-frequent mutations to be detected through pathway associations. Second, certain genes were upgraded through the data fusion procedure as a single stronger P-value per gene was derived by combining multiple weaker P-values corresponding to the coding regions, promoters, UTRs, enhancers of the gene. This affected 17/333 pathway-associated genes including six known cancer genes (HNRNPA2B1, STAG2, TCF7L2, SUZ12, CLTC, and ZNF521), Thus, the integration procedure prioritized specific pathway-related genes compared to their original rankings in individual mutation datasets. However, the majority of genes showed reduced significance after the fusion and were excluded from the pathway analysis: 3543 genes had at least one significant P-value prior to data fusion (P gene < 0.1) while 88% of these (3112) were considered insignificant following the Brown P-value combination step. In contrast, the majority of pathway-associated genes (220/333) showed improved rankings in the integrated gene list compared to their original rankings in individual input datasets. This formal combination of P-values across omics datasets is therefore more conservative than a naïve approach of selecting a top P-value for every gene. Thus, ActivePathways finds additional candidate genes that remain undetected in gene-by-gene analyses and are highlighted due to their multiple omics signals in pathways.
Integrating prognostic CNA and mRNA signals in breast cancer. To demonstrate an integrative analysis of patient clinical information with multiple types of omics data, we studied the pathways and processes associated with patient prognosis in breast cancer. We leveraged the METABRIC dataset 15 of 1780 breast cancer samples drawn from all four subtypes (HER2enriched, basal-like, luminal-A, luminal-B) and evaluated all genes using three types of prognostic evidence. mRNA abundance Fig. 2 Pathway enrichment analysis of cancer driver genes with ActivePathways. a Bar plot shows number of significantly enriched pathways (Q < 0.05) among predicted driver genes with coding and non-coding mutations in the PCAWG dataset. The majority of pathways detected by ActivePathways are supported by protein-coding mutations, as expected (dark green bars), while non-coding mutations (orange, red) reveal additional pathways. Pathways shown in dark red are found only in the integrated gene list of coding and non-coding mutations but not in gene lists of individual mutation scores. b Enrichment map shows pathways enriched in frequently mutated genes in the adenocarcinoma cohort of 1773 tumors. Nodes in the network represent pathways and similar pathways with many common genes are connected. Groups of similar pathways are indicated. Nodes are colored by supporting evidence from coding and non-coding cancer mutations. Arrow indicates kidney developmental processes. c The group of enriched kidney developmental processes is apparent from integrated evidence of coding and non-coding mutations but is not found among coding or non-coding candidate genes separately. d Heatmap shows P-values of driver genes involved in kidney developmental processes, including driver genes found in the driver analysis (indicated with #) and additional genes only found in the pathway analysis. Top row shows merged P-values from the Brown procedure. Genes listed in the Cancer Gene Census (CGC) database are indicated in boldface letters. e Pathway analysis recovers most genes of the driver list from PCAWG (orange asterisks), as well as additional infrequently mutated genes apparent due to their pathway associations. Additional known cancer genes detected in the pathway analysis are listed (green dots) and occur more frequently than expected from chance alone. NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-13983-9 ARTICLE NATURE COMMUNICATIONS | (2020) 11:735 | https://doi.org/10.1038/s41467-019-13983-9 | www.nature.com/naturecommunications profiles were deconvolved between mRNA abundance levels in tumor cells (TC) and tumor-adjacent cells (TAC) using the ISOpure algorithm 21,22 . mRNA values were associated with patient survival using median dichotomization and log-rank tests. Gene copy number alterations (CNA) were included as the third type of evidence and associated with patient survival using logrank tests.
ActivePathways identified 192 significantly enriched GO biological processes and Reactome pathways across the four subtypes of breast cancer, of which nine pathways were enriched in multiple cancer subtypes and 33 pathways were only apparent through the integrative pathway analysis but not in any of the CNA or mRNA datasets alone. The major findings enriched in prognostic signatures in breast cancer subtypes involved the processes and pathways of immune response, apoptosis, ribosome biogenesis and chromosome segregation (Fig. 3a).
Immune activity was associated with prognostic genes in basal-like and HER2-enriched breast cancers with significant enrichment of GO processes such as 'immune system development' (Q basal = 3.0 × 10 −4 , 113 genes; Q HER2 = 0.035, 61 genes; ranked hypergeometric test) and 'lymphocyte differentiation' (Q HER2 = 6.8 × 10 −4 , 46 genes; Q basal = 8.4 × 10 −4 , 45 genes). The majority of related genes were associated with improved patient prognosis upon increased mRNA abundance in tumor cells or tumor-adjacent cells, comprising 50/61 genes in the HER2-enriched subtype and 78/113 genes in the basal subtype (Fig. 3b). Interestingly, only a minority of these genes (10) were significant in both of the two breast cancer subtypes, suggesting different modes of immune activity in subtypes and emphasizing the power of our pathway-based approach. Basal-like breast cancers were associated with additional 67 terms involving immune response and blood cells, however no immune-related terms were enriched for luminal subtypes of breast cancers. Prognostic features of immune-related genes in HER2-enriched and basal-like breast cancers are well known 23,24 . Our pathwaybased findings indicate that immune activity in breast tumor cells and in the surrounding microenvironment negatively affects tumor progression and improves prognosis.
Apoptosis was associated with patient prognosis in HER2enriched and luminal-A breast cancers through enriched GO processes such as 'negative regulation of apoptotic process' (Q HER2 = 0.030, 122 genes; Q luminalA = 0.015, 228 genes) and 'programmed cell death' (Q HER2 = 0.015, 125 genes; Q luminalA = 0.016, 231 genes) (Fig. 3c). Interestingly, anti-apoptotic pathways were only detected in the integrative analysis and not in genomic and transcriptomic gene signatures separately. Among the genes negatively regulating apoptosis, DUSP1 provided the strongest prognostic signal in HER2-enriched breast cancers. This was apparent in the molecular stratification of samples by mRNA of tumor cells (log-rank P TC = 0.019, HR = 1.5) and tumor-adjacent cells (P TAC = 8.3 × 10 −4 , HR = 1.8) as well as gene copy number amplifications (P CNA = 9.8 × 10 −4 , HR = 2.8) (Fig. 3d). DUSP1 encodes a phosphatase signaling protein of the MAPK pathway that is over-expressed in malignant breast cancer cells and inhibits apoptotic signaling 25 . HER2 over-expression is known to suppress apoptosis in breast cancer 26 . Anti-apoptotic signaling is a hallmark of cancer and expectedly associated with worse patient prognosis.
ActivePathways also identified prognostic pathway associations that were only apparent in single subtypes of breast cancer. For example, the prognostic genes for luminal-B subtype were enriched in processes related to 'chromosome segregation' (Q luminalB = 0.017, 41 genes) that have been associated with worse outcome in breast cancer 27 . As another example, luminal-A breast cancers were associated with prognosis in ribosomal and RNA processing genes, such as 'ribosome biogenesis' (Q luminalA = 6.9 × 10 −10 , 60 genes) and 'rRNA metabolic process' (Q luminalA = 1.8 × 10 −13 , 64 genes). Although not described specifically in the luminal-A subtype, ribosomal mRNA abundance has been shown to be prognostic in breast cancer as a marker of cell proliferation 28,29 . In summary, ActivePathways can be used for integrating clinical data with multiomics information of molecular alterations. Such analyses can provide leads for functional studies and biomarker development.
Interpreting co-expressed and DNA-bound targets of Hippo TFs. To demonstrate the use of ActivePathways for studying gene regulation, we analyzed transcriptomes of non-cancerous human tissues from the GTEx project 5 . We focused on the Hippo signaling pathway involved in organ size control, tissue homeostasis and cancer 30,31 and studied regulatory networks downstream of the two transcription factors (TFs) YAP and TAZ (encoded by YAP1 and WWTR1). YAP and TAZ are evolutionarily conserved master regulators of Hippo signaling in mammals that respond to intracellular and extracellular signals of cell-cell interactions, cell polarity, mechanical cues, G protein-coupled receptor signaling, and cellular energy status 30,31 .
We performed an integrative pathway enrichment analysis of transcriptomics and epigenomics data of the two master regulators of the Hippo pathway. First, we predicted transcriptional target genes for YAP and TAZ (1898 and 1319 genes, respectively), using co-expression analysis and robust rank aggregration 32 over 9642 transcriptomes of 40 tissue types in GTEx (Q < 0.05). Second, we studied the set of 2356 target genes of YAP that have DNA-binding sites at gene promoters derived from a YAP ChIP-seq study 33 reanalyzed in the ReMap database 34 (Q < 0.05). The three gene lists with corresponding significance values were used as input to ActivePathways for the integrative analysis.
Integrative analysis of transcriptional and DNA-binding target genes of YAP/TAZ resulted in 225 significantly enriched GO processes and Reactome pathways (Q pathway < 0.05) (Fig. 4a). The resulting pathways are expected in the context of Hippo signaling and included development and morphogenesis, cell motility, organization of actin cytoskeleton and cell-cell junctions, signal transduction pathways such as EGFR, Wnt, Robo, TGF-beta, rho GTPase, and others. ActivePathways highlighted 2066 genes in these pathways. We examined known members of the Hippo pathway and found 13 of 32 genes among pathway-associated genes (WWTR1, LATS1, WWC1, YWHAQ, YAP1, MAPK8, MAPK9, STK3, TNIK, TEAD3, MAP4K3, MINK1, TEAD1), more than expected from chance alone (four genes expected, P = 3.6 × 10 −5 , Fisher's exact test). We also compiled an extended list of 106 Hippo-related genes 30,31,35 and confirmed that these were enriched in pathway-associated genes (36/308 genes expected, P = 9.1 × 10 −26 , Fisher's exact test) (Fig. 4b). A large fraction of pathways (55 or 24%) was only identified in the integrated analysis and not in any input dataset alone, underlining the advantage of detecting significantly enriched pathways across multiple complementary omics datasets. Similarly, Hippo-related genes were either supported by ChIP-seq data alone (54 genes), mRNA data alone (30 genes), or both by mRNA and ChIP-seq data (22 genes), concordant with the notion that RNA-seq and ChIP-seq show limited agreement regarding TF target genes and provide complementary insights into gene regulation. Thus, our integrative analysis of co-expression networks and TF-DNA interactions of Hippo master regulators expectedly converges on Hippo-related genes and pathways.
In addition to GO terms and pathways, ActivePathways can be used to interpret omics data with other classes of functional gene sets such as TF target genes. To further elucidate gene regulatory networks downstream of YAP/TAZ, we considered potential enrichments of DNA-binding target genes of 161 TFs profiled in ChIP-seq studies of the ENCODE project 36 . We found a regulatory network of 17 TFs and 1426 target genes enriched in the YAP/TAZ regulome (Q < 0.05, ranked hypergeometric test) (Fig. 4c). These included the DNA-binding targets of the master regulators of pluripotent stem cells 37 NANOG (208/774 target genes, Q = 1.3 × 10 −6 ) and POU5F1 (107/406 target genes, Q = 7.3 × 10 −11 ). This finding is in agreement with the role of Hippo pathway activity in stem cell regulation 38 . The regulatory network was significantly enriched in 50 Hippo-related genes (25 expected, P = 1.2 × 10 −7 , Fisher's exact test) and six core Hippo genes (two   , similarly to the pathway-based analysis above. However, the two analyses revealed distinct genes: 886 genes were commonly found by both analyses, 1180 genes were found only in the pathway-based analysis, and 540 genes were found only in the TF-based analysis (Fig. 4d). Thus, integrative enrichment analysis of TF target genes provides complementary information to GO terms and pathways. In summary, this analysis highlights genes and pathways related to Hippo signaling in human tissues and demonstrates the use of ActivePathways for studying gene regulatory networks across complementary omics datasets and technology platforms.
Evaluating the robustness and sensitivity of ActivePathways. We carefully benchmarked ActivePathways using the dataset of cancer driver genes predicted by PCAWG 13 . First, we compared the performance of ActivePathways with six methods used in the PCAWG pathway and network analysis working group 14 (Hierarchical HotNet 39,40 , SSA−ME 41 , NBDI 42 , induced subnetwork analysis 40 , CanIsoNet 43 , hypergeometric test). These diverse methods used molecular interaction networks, functional gene sets and/or transcriptomics data to analyze the PCAWG pancancer dataset of predicted cancer driver genes. A subsequent consensus analysis defined pathway-implicated driver gene lists with protein-coding and non-coding mutations, based on a majority vote of the pathway and network analysis methods 14 . ActivePathways recovered these consensus gene lists with the highest accuracy: 100% of coding driver genes (87/87) and 85% of non-coding candidate genes (79/93) were detected (Fig. 5a). Thus ActivePathways agreed the most with the ensemble of several other methods in prioritizing known and candidate cancer driver genes using pathway and network context.
We also compared the performance of ActivePathways with a standard approach of pathway enrichment analysis that considers a single statistically-filtered gene list using a ranked hypergeometric test 1,10 (Fig. 5b). To this end, we analyzed individual gene lists of protein-coding and non-coding drivers of the PCAWG adenocarcinoma cohort using multiple gene selection thresholds (Q < 0.05, Q < 0.1, Q < 0.25, P < 0.1). ActivePathways showed increased sensitivity of pathway enrichment analysis compared to the standard approach, in particular for pathways involving noncoding mutations that were not promimently represented in any single gene lists. Even compared to the analysis employing the most lenient gene selection filter (P < 0.1), ActivePathways identified 72 additional pathways that were only apparent through the integration of coding and non-coding mutations and remained undetected in the analysis of individual datasets. Thus, our method provides additional information to common approaches that focus on single gene lists.
We evaluated the robustness of ActivePathways to parameter variations and missing data. We varied the parameter P gene that determines the ranked gene lists used in the pathway enrichment analysis (default threshold P gene < 0.1). The majority of PCAWG cohorts (40/47 or 85%) retrieved significantly enriched pathways even with a considerably more stringent threshold (P gene < 0.001), however 67% fewer pathways were found compared to the default threshold in the median cohort ( Supplementary Fig. 2). We then evaluated the robustness of ActivePathways to missing data by randomly removing subsets of driver scores from the initial dataset. Even when removing 50% of gene driver P-values with P < 0.001, the majority of cohorts (37/47 or 79%) had at least one significantly enriched pathway, however 66% fewer pathways were found on average (Supplementary Fig. 3).
We evaluated the expected false positive rates of ActivePathways. We tested 1,000 simulated datasets for each of 47 patient cohorts and expectedly found no significant pathways in 92% of these simulations (Supplementary Fig. 4). Simulated data were obtained by randomly reassigning P-values of driver predictions to different genomic elements, a conservative approach that disrupts pathway annotations of genes while retaining the presence of strong P-values observed in the real data. The median family-wise false discovery rate of ActivePathways computed across cohorts (7.2%) slightly exceeded the applied multiple testing correction (Q < 0.05). Higher rates were observed in cohorts including melanoma tumors, potentially due to abundant promoter mutations caused by impaired nucleotide excision repair in protein-bound genomic regions 44 . We evaluated quantile-quantile (QQ) plots of pathway-based ranked hypergeometric P-values from ActivePathways and found that these often deviated from the expected uniform distribution and appeared statistically inflated ( Supplementary Fig. 5). However, P-values derived from simulated gene scores showed no inflation in our simulations. Anticipating that the most significant cancer driver genes associated with protein-coding mutations, we performed partial simulations. We constructed datasets with simulated gene P-values for protein-coding mutations and true Pvalues for non-coding mutations. As expected, partially simulated datasets showed a lesser extent of P-value inflation, suggesting that highly significant known cancer genes are responsible for inflation due to their involvement in many pathways. Statistical testing of highly redundant pathways and processes violates the independence assumption of statistical tests and multiple testing procedures, a known caveat of pathway enrichment analysis 1 , which likely explains the observed distribution of significance values of our method. Collectively, these benchmarks show that ActivePathways is a sensitive and robust method for detecting significantly enriched pathways and processes through integrative analysis of multi-omics data.

Discussion
Integrative pathway enrichment analysis helps distill thousands of high-throughput measurements to a smaller number of pathways and biological themes that are most characteristic of the experimental data at hand, ideally leading to mechanistic insights and candidate genes for follow-up studies. In particular, a joint analysis of complementary datasets often leads to insights that are unavailable in any particular dataset. ActivePathways provides a generally available framework for systematically prioritizing genes Fig. 3 Prognosis-associated pathways in four molecular subtypes of breast cancer. a Enrichment maps of prognostic pathways and processes were found in an integrative analysis of mRNA abundance in tumor cells (TC), tumor-adjacent cells (TAC), and gene copy number alterations (CNA) of the METABRIC dataset. Multicolored nodes indicate pathways that were prognostic according to several types of molecular evidence. Blue nodes indicate pathways that were only apparent through merging of molecular signals. b Hazard ratios (HR) of prognostic genes related to immune system development in basal and HER2enriched subtypes of breast cancer. Strongest HR value of TC, TAC is shown. Genes commonly found in basal and HER2-enriched tumors are shown. Known cancer genes are shown in boldface. c Heatmap shows genes, corresponding log-rank P-values, and merged Brown P-values related to the GO process 'negative regulation of apoptotic process' that was found by integrating prognostic omics data in HER2-enriched breast cancer. d Kaplan-Meier plots show the strongest prognostic signal related to apoptotic signaling, the phosphatase DUSP1 that significantly associates with reduced patient survival through increased tumor-adjacent mRNA level (left), increased tumor mRNA level (center) and gene copy number amplification (right). Log-rank P-values are shown. and pathways across multiple omics datasets that utilizes fusion of gene significance. This allows us to identify pathways and processes that stand out when the data are combined, yet are not apparent in any single analyzed dataset. The use of ActivePathways is demonstrated in the three case studies above. In our example of cancer driver discovery, pathway analysis provides complementary insights to gene-focused driver discovery as it also focuses on sub-significant genes that are clustered into relevant biological processes and carry coding and noncoding mutations. In the integrative analysis of molecular alterations in breast cancer subtypes, we find a spectrum of genes and pathways whose molecular signatures in the tumors or the microenvironment have potential prognostic significance. A subset of these findings, such as anti-apoptotic signaling, is only apparent through data integration. In the final example, we use ActivePathways to associate the co-expression and DNA-binding networks of Hippo master regulators to downstream pathways and processes with multi-omics evidence.
Our general pathway analysis strategy is applicable to diverse kinds of datasets where well-calibrated P-values are available for the entire set of genes or proteins. A multi-omics study may quantify genes using a series of genomic, transcriptomic and proteomic experiments and compute corresponding P-values. Data from epigenomic experiments and genome-wide association studies can be also analyzed after signals have been appropriately mapped to genes, for example by identifying ChIP-seq peaks in gene promoters similarly to our GTEx analysis. Clinical and phenotypic information of patients can be also included through association and survival statistics, as shown in our analysis of prognostic signatures of breast cancer subtypes. Our method is expected to work with raw, unadjusted P-values and also with Qvalues adjusted for multiple testing, however it is primarily intended for unadjusted P-values for increased sensitivity. Acti-vePathways conducts multiple testing correction at the pathway level and reports significantly enriched pathways at a family-wise error rate cutoff, regardless of the gene-specific multiple testing correction applied upstream. Quantification of genes and proteins through P-values is more robust than quantification through their abundance measures such as counts or fold-changes. P-values provided to ActivePathways need to be computed using dedicated Fig. 4 Integrative pathway enrichment analysis of Hippo target genes across human tissues. a Enrichment map of GO processes and Reactome pathways enriched in the target genes of transcriptional regulators YAP and TAZ of the Hippo pathway. Co-expressed target genes of YAP and TAZ across normal human tissues of the GTEx dataset (pathways are shown in green and yellow, respectively) and DNA-binding target genes of YAP from ChIP-seq experiments (shown in blue) were analyzed. Pathways only found through the integration of ChIP-seq and RNA-seq data are shown in red. b Euler diagram shows 106 Hippo-related genes that were significantly enriched in the detected pathways and supported by a combination of signals in RNA-seq and ChIPseq datasets. Core Hippo genes detected in the analysis are listed. c Regulatory network of 17 TFs and 1,426 target genes detected in the ActivePathways analysis of gene sets representing transcription factor target genes. Transcription factors with enriched target genes in the YAP/TAZ regulome are shown in multi-colored circles. Target genes are colored by increasing statistical significance (turquoise to red). d Integrative analysis of pathways and GO processes is complementary to the analysis of transcription factor targets. Euler diagram shows total number of pathway-associated identified in the analysis of GO and Reactome terms (left) and TF target genes from ENCODE (right). Numbers of Hippo-related genes are shown in brackets. ActivePathways: integrated gene list + ranked hypergeometric test (Q < 0.05) Single gene lists by cutoff + ranked hypergeometric test (Q < 0.05) Fig. 5 Benchmarking of ActivePathways. a Comparison of ActivePathways with six additional pathway and network analysis methods used in the PCAWG pathway and network consensus analysis. ActivePathways best recovers the consensus lists of pathway-implicated driver genes with coding and noncoding mutations (indicated by asterisk). The consensus lists are shown in the leftmost bars of the plot and have been compiled through a majority vote of the seven methods in the PCAWG pathway and network analysis working group. b Comparison of ActivePathways (leftmost bars) and common pathway enrichment analysis using multiple significance cut-offs of PCAWG gene lists with protein-coding and non-coding mutations. ActivePathways shows increased sensitivity of pathway analysis even at the most lenient gene list significance cut-offs and recovers additional pathways only detected through integration of multiple datasets (dark red). methods for individual omics platforms such that inherent biases in the data are accounted for prior to pathway analysis. In our example of cancer driver discovery, appropriately computed Pvalues of driver gene predictions from the PCAWG project 13 account for confounding factors of somatic mutations, such as gene sequence length and nucleotide content, mutation signatures active in different types of tumors 45 and biological correlates of mutation frequency such as transcription and replication timing 46 . On the one hand, considering all such variations directly in the pathway enrichment analysis would require substantially more complex models. On the other hand, directly analyzing pathways using simpler metrics (such as mutation counts or frequencies) would propagate any upstream biases to the pathway enrichment analysis and cause challenges with false positives and data interpretation. Thius, given appropriately derived P-values for genes, proteins and other molecules, ActivePathways can be applied to a wide range of analyses.
Our method comes with important caveats. First, we only evaluate genes and proteins annotated in pathway databases. Such databases have variable coverage, rely on frequent data updates 47 and may miss sparsely annotated candidate genes. The most general type of pathway enrichment analysis considers biological processes and molecular pathways however many kinds of gene sets available in resources such as MSigDB 48 can be used to expand the scope of ActivePathways. Second, pathway information is highly redundant and analyses of rich molecular datasets often result in many significant results reflecting the same underlying pathway. We address this redundancy by visualizing and summarizing pathway results as enrichment maps 1,20 that summarize multiple similar pathways and processes into general biological themes. Statistical inflation of results accompanied by biological redundancy is addressed by a stringent multiple testing correction. Third, the analysis treats pathways as gene sets and does not consider interactions of genes in pathways. This simplified strategy allows us to consider a wider repertoire of pathways and processes as reliable mechanistic interactions are often context-specific and limited to a small subset of well-studied signaling pathways. Several advanced methods such as HotNet 39 , PARADIGM 49 , and GeneMania 50 model pathways through gene and protein interactions.
Translation of discoveries into improved human health through actionable mechanistic insights, biomarkers, and molecular therapies is a long-standing goal of biomedical research. For example, next-generation cancer genomics projects such as ICGC-ARGO aim to collect multi-omics datasets with detailed clinical profiles that will present new challenges for pathway and network analysis techniques. In summary, ActivePathways is integrative pathway analysis method that improves systems-level understanding of cellular organization in health and disease.

Methods
Integrated and evidence-based gene lists. The main input of ActivePathways is a matrix of P-values where rows include all genes of a genome and columns correspond to evidence from omics datasets. To interpret multiple omics datasets, a combined P-value is computed for each gene using a data fusion approach, resulting in an integrated gene list. The integrated gene list is computed by merging all P-values of a given gene into one combined P-value using the Brown's extension 18 of the Fisher's combined probability test that accounts for overall covariation of P-values from different sources of evidence. The integrated gene list of Brown P-values is then ranked in order of decreasing significance and filtered using a lenient threshold (unadjusted P < 0.1 by default). Evidence-based gene lists representing different omics datasets are based on ranked P-values from individual columns of the input matrix and filtered using the same significance threshold.
Statistical enrichment of pathways. Statistical enrichment of pathways in ranked lists of candidate genes is carried out with the ranked hypergeometric test. The test considers one pathway gene set at a time and analyzes increasing subsets of input genes from the top of the ranked gene list. The same procedure is used for integrated and evidence-based gene lists. At each iteration, the test computes the hypergeometric enrichment statistic and P-value for the set of genes shared by the pathway and top sub-list of the input gene list. For optimal processing speed, only gene lists ending with a pathway-related gene are considered. The ranked hypergeometric statistic selected the input gene sub-list that achieves the strongest enrichment and the smallest P-value as the final result for the given pathway, as: where P pathway stands for the hypergeometric P-value of the pathway enrichment at the optimal sub-list of the significance-ranked candidate genes, G represents the length of the optimal sub-list, i.e., the number of top genes from the input gene list, N is the number of protein-coding genes with annotations in the pathway database, i.e., in Gene Ontology and Reactome, K is the total number of genes in a given pathway, n is the number of genes in a given gene sub-list considered, and k is the number of pathway genes in the considered sub-list. For a conservative estimate of pathway enrichment, we consider as background N the universe of genes contained in input gene sets (terms from pathway databases and ontologies) rather than the complete repertoire of protein-coding genes. To obtain candidate genes involved in the pathway of interest, we intersect pathway genes with the optimal sub-list of candidate genes. The ranked hypergeometric P-value is computed for all pathways and resulting P-values are corrected for multiple testing using the Holm-Bonferroni method of family-wise error rate (FWER) 19 . Significant pathways are reported by default (Q < 0.05).
Evaluating omics evidence of enriched pathways. Each evidence-based gene list derived from a single omics dataset is also analyzed for enriched pathways with the ranked hypergeometric test. Pathways found in the integrated gene list are then labeled for supporting evidence in the case they are also found as significant in any evidence-based gene list. A pathway is considered to be found only through data integration and labeled as combined-only if it is identified as enriched in the integrated gene list but not identified as enriched in any of the evidence-based gene lists at equivalent significance cut-offs (default Q < 0.05). Each detected pathway is additionally annotated with pathway genes apparent in the optimal sub-list of candidate genes separately for the integrated gene list and each evidence-based gene list.
Pathways and processes. We used gene sets corresponding to biological processes of Gene Ontology 16 and molecular pathways of the Reactome database 17 downloaded from the g:Profiler web server 10 on Jan 26th 2018. Large general gene sets with more than a thousand genes and small specific gene sets with less than five genes were excluded to avoid statistical inflation of large gene lists and interpretation challenges with very small lists.
Enrichment map visualization. ActivePathways creates input files for the EnrichmentMap app 20 of Cytoscape 51 for network visualization of similar pathways and their coloring according to supporting evidence. Enrichment maps for adenocarcinoma driver mutations, prognostic quantification of molecular alterations of breast cancer, and transcriptional networks downstream of Hippo signaling were visualized with stringent pathway similarity scores (Jaccard and overlap combined coefficient 0.6) and manually curated for the most representative groups of similar pathways and processes. Singleton pathways that were redundant with larger groups of pathways were merged with the latter or discarded. Coloring of pathways in the adenocarcinoma enrichment map was rearranged by merging colors of pathways supported by non-coding mutation scores of promoters, enhancers and/or UTRs into one group.
Coding and non-coding mutations of the PCAWG dataset. We used Active-Pathways to analyze driver predictions of coding and non-coding mutations across white-listed 2583 whole cancer genomes of the ICGC-TCGA PCAWG project. Pvalues of driver predictions were computed separately for protein-coding sequences, promoters, enhancers and untranslated regions (UTR3, UTR5 across multiple subsets of samples representing histological tumor types and pan-cancer cohorts as reported in the PCAWG driver discovery study 13 . We used geneenhancer mapping predictions provided by PCAWG, excluded enhancers with more than five target genes, and selected the most significant enhancer for each gene, if any. Unadjusted P-values for coding sequences, promoters, enhancers and UTRs were compiled as input matrices and analyzed as described above. Missing Pvalues were conservatively interpreted as ones. Results from ActivePathways were validated with two lists of cancer genes. Predicted drivers from the PCAWG consensus analysis 13 were selected as statistically significant findings (Q < 0.05) following a stringent multiple testing correction spanning all types of elements (exons, UTRs, promoter, enhancers). The curated list of known cancer genes was retrieved from the COSMIC Cancer Gene Census (CGC) database 11 . One-tailed Fisher's exact tests were used to estimate enrichment of these genes in our results, using all human protein-coding genes as the statistical background set.
Prognostic CNA and mRNA signals in breast cancer. ActivePathways was used to evaluate prognostic pathways in breast cancer subtypes. Multiple types of omics data were used for an integrative analysis: mRNA gene expression data and gene copy number alteration (CNA) data of the were derived from the METABRIC cohort of 1991 patients with a single primary fresh frozen breast cancer specimen each 15 . Curtis et al. 15 classified the patients into the intrinsic breast cancer subtypes using the PAM50 mRNA-based classifier 52 WWTR1) were predicted separately in the following two steps. First, we computed pairwise Pearson correlations between a given TF and all other genes within a tissue of interest and ranked these by the significance P-value of positive correlations. Second, the resulting ranked gene lists were then aggregated into one master target gene list across the GTEx tissues using the robust rank aggregation (RRA) method 32 with default parameters (Q gene < 0.05). FDR-adjusted values of genes from RRA for YAP and TAZ were used as the first and second evidence for input of ActivePathways, respectively. For DNA-binding targets of YAP, we retrieved ChIPseq binding sites in three cell lines (CCLP1, MSTO, and HUCCT1) from an earlier study 33 that were reprocessed in the ReMap database 34 . Binding sites were filtered by statistical significance (Q < 0.05) and mapped to gene promoters of the human genome (hg19) using gene promoters defined in the PCAWG driver analysis 13 . If a promoter had multiple binding sites, the site with the strongest FDR-value was selected as a representative site for that gene. FDR-adjusted values of genes with YAP DNA-binding sites were used as the third evidence for input of Active-Pathways. Significantly enriched pathways among the putative target genes of YAP and TAZ were subsequently detected using ActivePathways. We compiled a list of 308 Hippo-related genes from the KEGG pathway database 35 and two recent review papers 30,31 . To validate the analysis, we tested the overall sets of genes identified by ActivePathways for enrichment of Hippo-related genes using Fisher's exact tests. The Hippo analysis was conducted separately for two collections of functional gene sets. First, we tested GO biological processes and Reactome pathways similarly to analyses described above. Second, we used gene sets corresponding to transcription factor binding sites (TFBS) derived from the ENCODE project 36 .
Method benchmarking. We benchmarked ActivePathways using multiple approaches, including simulated datasets of P-values, method parameter variations, and partial replacement of P-values scores with insignificant P-values. Benchmarking was performed on the PCAWG predictions of cancer driver genes with coding and non-coding mutations. To evaluate the false discovery rate of Active-Pathways, we created simulated datasets by reassigning all observed P-values to random genes and their genomic elements. Simulations were conducted separately for different tumor cohorts. One thousand simulated datasets were analyzed with ActivePathways and those with at least one significantly detected pathway counted towards false discovery rates. A separate set of simulations maintained the positions of non-coding P-values among genes and randomly reassigned P-values corresponding to protein-coding mutations, expectedly leading to a reduction in detected pathways as the PCAWG datasets primarily included strong P-values for genes with frequent protein-coding mutations. Quantile-quantile analysis and QQplots were used to compare P-value distributions of pathways discovered from true P-values, partially shuffled P-values (true non-coding and shuffled protein-coding P-values), and fully shuffled P-values. To evaluate robustness of ActivePathways, we randomly replaced a fraction of significant driver P-values in the true dataset (P < 0.001) with insignificant P-values (P = 1). We tested different fractions of missing values (10, 25, and 50%) across a thousand datasets of randomly selected missing data points. We concluded that most PCAWG cohorts included significantly enriched pathways even with large fractions of missing data. To further evaluate robustness, we tested different values of the Brown P-value threshold used to select the integrated gene list for pathway enrichment analysis. The default parameter value (P gene < 0.1) was compared to alternative values (0.001, 0.01, 0.05, and 0.2). We concluded that ActivePathways found enriched pathways in most tumor cohorts even at more stringent gene selection levels.
Ethical review. Sequencing of human subjects' tissue was performed by ICGC and TCGA consortium members under a series of locally approved Institutional Review Board (IRB) protocols as described in Hudson et al. Informed consent was obtained from all human participants. Ethical review of the current data analysis project was granted by the University of Toronto Research Ethics Board (REB) under protocol #30278, "Pan-cancer Analysis of Whole Genomes: PCAWG".
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
Somatic and germline variant calls, mutational signatures, subclonal reconstructions, transcript abundance, splice calls and other core data generated by the ICGC/TCGA Pancancer Analysis of Whole Genomes Consortium is described in the marker paper 4 and available for download at https://dcc.icgc.org/releases/PCAWG. Additional information on accessing the data, including raw read files, can be found at https://docs.icgc.org/ pcawg/data/. In accordance with the data access policies of the ICGC and TCGA projects, most molecular, clinical and specimen data are in an open tier which does not require access approval. To access potentially identifiable information, such as germline alleles and underlying sequencing data, researchers will need to apply to the TCGA Data Access Committee (DAC) via dbGaP (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page = login) for access to the TCGA portion of the dataset, and to the ICGC Data Access Compliance Office (DACO; http://icgc.org/daco) for the ICGC portion. In addition, to access somatic single nucleotide variants derived from TCGA donors, researchers will also need to obtain dbGaP authorization. Derived PCAWG datasets described specifically in this manuscript can be found at the locations listed below. Additional relevant datasets are listed in the PCAWG study of pathways and networks 14 .