Integrative pathway enrichment analysis of multivariate omics data

Multi-omics datasets quantify complementary aspects of molecular biology and thus pose challenges to data interpretation and hypothesis generation. ActivePathways is an integrative method that discovers significantly enriched pathways across multiple omics datasets using a statistical data fusion approach, rationalizes contributing evidence and highlights associated genes. We demonstrate its utility by analyzing coding and non-coding mutations from 2,583 whole cancer genomes, revealing frequently mutated hallmark pathways and a long tail of known and putative cancer driver genes. We also studied prognostic molecular pathways in breast cancer subtypes by integrating genomic and transcriptomic features of tumors and tumor-adjacent cells and found significant associations with immune response processes and anti-apoptotic signaling pathways. ActivePathways is a versatile method that improves systems-level understanding of cellular organization in health and disease through integration of multiple molecular datasets and pathway annotations.


Introduction 27
Pathway enrichment analysis is an essential step for interpreting high-throughput (omics) data 28 that uses current knowledge of genes and biological processes. A common application 29 determines statistical enrichment of molecular pathways, biological processes and other 30 functional annotations in long lists of candidate genes 1,2 . Genomic, transcriptomic, proteomic and 31 epigenomic experiments emphasize distinct and complementary aspects of underlying biology 32 and are best analyzed integratively, as is now routinely done in large-scale projects such as The 33 Cancer Genome Atlas (TCGA) 3 , Clinical Proteome Tumor Analysis Consortium (CPTAC), 34 International Cancer Genome Consortium (ICGC) 4 , Genotype-Tissue Expression (GTEx) 5 and 35 others. Thus, simultaneous analysis of multiple candidate gene lists for characteristic pathways 36 is increasingly needed. Numerous approaches are available for interpreting single gene lists. For 37 example, the GSEA algorithm can detect up-and down-regulated pathways in gene expression 38 datasets 6 . Web-based methods such as Panther 7 , ToppCluster 8 and g:Profiler 9 detect significantly 39 enriched pathways amongst ranked or unranked gene lists and are generally applicable to genes 40 and proteins from various analyses. Some approaches allow analysis of multiple input gene lists 41 however these primarily rely on visualization rather than data integration to evaluate the 42 contribution of distinct gene lists towards each detected pathway 8,9 . Finally, no methods are 43 available for unified pathway analysis of coding and non-coding mutations from whole-genome 44 sequencing (WGS) data, or integrating these with other types of DNA aberrations such as copy 45 number changes and balanced genomic rearrangements. We report the development of the 46 ActivePathways method that uses data fusion techniques to address the challenge of integrative 47 pathway analysis of multi-omics data. We demonstrate the method by analyzing known and 48 candidate cancer driver genes with coding and non-coding somatic mutations in 2,583 whole 49 cancer genomes of the ICGC-TCGA PCAWG project 10,11 , prognostic pathways in breast cancer 50 subtypes, and regulatory networks of tissue transcriptomes using the GTEx 5 compendium. 51 Characterization of genes and somatic mutations that drive oncogenesis is a central goal of 52 cancer genomics research. Cancer genomes are characterized by few frequently mutated pan-53 cancer drivers such as TP53, less-frequent drivers with primarily tissue-specific effects and 54 numerous infrequently mutated genes often referred to as the long tail. The majority of currently 55 known driver mutations affect protein-coding sequence 12 and only few high-confidence non-56 coding drivers have been found, such as the mutation hotspots in the TERT promoter 13 . Discovery 57 of non-coding driver mutations is a major goal of large cancer whole genome sequencing efforts 58 such as PCAWG 10,11 . Pathway and network analysis of cancer mutations is a powerful approach that uses knowledge of coding driver genes and their pathway annotations as priors to assist in 60 detection of weak driver variants including those in the non-coding genome 1 . The PCAWG project 61 has produced a consensus dataset of predicted protein-coding driver genes (CDS) and non-62 coding regions of 5' and 3' untranslated elements (UTRs), promoters and enhancers of protein-63 coding genes across 2,583 whole cancer genomes of multiple cancer types 14 . Driver gene p-64 values in the dataset reflect the frequency and functional impact of somatic single nucleotide 65 variants (SNVs) and small insertions-deletions (indels) in these protein-coding and non-coding 66 genomic regions. Here we used our ActivePathways method to interpret these driver predictions 67 with pathway information including biological processes of Gene Ontology 15 and molecular 68 pathways defined by Reactome 16 . Two further case studies focused on prognostic molecular 69 pathways of breast cancer through integration of genomic and transcriptional alterations, and 70 gene regulatory networks associated with organ growth control in healthy human tissues. 71

Multi-omics pathway enrichment analysis with ActivePathways 74
ActivePathways is a simple four-step method that extends our earlier work 9 (Figure 1) These could include p-values evaluating the significance of differential gene expression in tissues 78 of interest, gene essentiality, mutation or copy number alteration burden, and many others. 79 Second, a collection of gene sets represents molecular pathways, biological processes and other 80 gene annotations we refer to as pathways. Depending on the hypothesis, pathways may also 81 include other types of gene sets such as targets of transcription factors or microRNAs. In the first 82 step of ActivePathways, we derive an integrated gene list that aggregates significance from all 83 types of evidence for each input gene. The integrated gene list is compiled by fusion of gene 84 significance from different types of evidence using the Brown's extension 17 of the Fisher's 85 combined probability test, which conservatively adjusts for overall correlations of p-values in 86 estimating the overall significance of every gene. The integrated input gene list is then ranked by 87 decreasing significance and filtered using a lenient cut-off to capture a long tail of candidate genes 88 and to filter the bulk of insignificant ones (unadjusted Pgene<0.1). The integrated gene list is 89 analyzed with a ranked hypergeometric test for each pathway to capture smaller pathways tightly 90 associated with few top-ranking genes and broader processes with abundant albeit weaker 91 signals from larger subsets of input genes. The stringent family-wise multiple testing correction method by Holm 18 is applied across pathways to reduce false positives (Qpathway<0.05). In the third 93 step, candidate gene lists corresponding to distinct types of evidence are separately evaluated 94 using the above procedure. This step determines which pathways are significantly supported by 95 each of the input omics datasets and also reveals corresponding genes in each pathway. 96 Importantly, the step also highlights pathways that are only found through data integration and 97 are not apparent in any single type of omics evidence alone. In the fourth step, the method 98 provides input files for Enrichment Map 19 for visualizing and reducing the redundant set of all 99 detected pathways to a narrower, focused network of biological themes. 100   Qpathway<0.05) (Figure 2a). We analyzed the omics evidence supporting predictions of enriched 120 pathways and found that most cohorts showed enrichments in pathways supported by protein-121 coding driver scores of genes (37/47 or 79%). This serves as a positive control since the majority 122 of currently known cancer driver genes have frequent protein-coding mutations. 123 Non-coding mutations in genes also contributed to the discovery of frequently mutated biological 124 processes and pathways: 24/47 cohorts (51%) showed significantly enriched pathways that were 125 apparent when only analyzing non-coding driver scores separately for UTRs, promoters or 126 enhancers. The majority of cohorts (41/47 or 87%) revealed enriched pathways that were 127 apparent in the integrated gene list but not in any gene lists ranked by element-specific driver 128 scores, emphasizing the value of our integrative approach. As expected, cohorts with more patient 129 tumor samples generated more significantly enriched pathways (Spearman ρ=0.74, P=2.3x10 -9 ; 130 Supplementary Figure 1), suggesting that larger datasets are better powered to distinguish 131 rarely mutated genes involved in biological pathways and processes. Discovery of pathways 132 enriched in non-coding mutations suggests that pathway analysis is an attractive strategy for 133 illuminating the dark matter of the non-coding cancer genome.

160
We studied the adenocarcinoma meta-cohort with 1,773 samples of 16 tumor types whose 161 integrated list of 432 candidate genes (unadjusted Pgene<0.1) associated with 526 significantly 162 enriched pathways (Qpathway<0.05) (Figure 2b). As expected, the majority of pathways were only 163 supported by genes with frequent coding mutations (328/526 or 62%). However, 101 pathways 164 were supported by both coding and non-coding gene mutations, 72 were only apparent in the 165 integrated analysis of all evidence, and 25 were only found among genes with significant non-166 coding mutations, thus expanding the set of candidate driver mutations in the non-coding cancer 167 genome and demonstrating the value of integrative pathway analysis. 168 The major biological themes with frequent protein-coding mutations included hallmark cancer 169 processes like apoptotic signaling pathway (24 genes; Qpathway=4.3x10 -5 ) and mitotic cell cycle (8 170 genes; Qpathway=0.0026), and additional biological processes such as chromatin modification and 171 RNA splicing that are increasingly recognized in cancer biology. Thus, our method captures the 172 expected cancer pathways among driver genes with protein-coding mutations as positive controls.
In contrast to these solely protein-coding driver associations, a large group of developmental 174 processes and signal transduction pathways was enriched in genes with coding as well as non-175 coding mutations; for example embryo development process was supported by mutations in 176 exons, 3'UTRs and gene promoters (68 genes; Qpathway=2.9x10 -12 ), while repression of WNT target 177 genes was only apparent in the integrated analysis of coding and non-coding mutations but not 178 in either alone (5 genes, Qpathway=0.016; REAC:4641265). Thus, our method evaluates 179 contribution of omics evidence towards pathway enrichments and finds additional associations 180 that are not apparent in any provided dataset. protein that regulates cell motility and morphology. The additional genes remained below the 194 FDR-adjusted significance cut-off in the gene-focused consensus driver analysis, however were 195 found by ActivePathways due to pathway associations with frequently mutated developmental 196 genes. These results highlight the potential of our method to find known and candidate cancer 197 genes with rare coding and non-coding driver mutations through pathway-driven data integration. genes that did not pass significance cut-offs of the integration procedure. Thus, ActivePathways 218 finds additional cancer genes in the long tail of mutations that are highlighted due to their pathway 219 associations but remain below the significance cut-off in the gene-by-gene analysis. 220 221

Benchmarking demonstrates the robustness and sensitivity of ActivePathways 222
We carefully benchmarked ActivePathways using multiple approaches. First, we compared its 223 performance with six diverse methods used in the PCAWG pathway and network analysis working 224 group 20 (Hierarchical HotNet 21,22 , SSA−ME 23 , NBDI 24 , induced subnetwork analysis 22 , 225 CanIsoNet [Kahraman et al, in prep] , and hypergeometric test). The methods used molecular pathway and 226 network information to analyze the PCAWG dataset of predicted cancer driver genes 14 , and a 227 subsequent consensus procedure derived pathway-implicated driver (PID) gene lists with coding 228 (PID-C) and non-coding (PID-N) mutations based on a majority vote. Our method recovered PID-229 C and PID-N gene lists with the highest accuracy: 100% of coding driver genes (87/87) and 85% 230 of non-coding candidates (79/93) were detected (Figure 2f). 231 We evaluated the robustness of ActivePathways to parameter variations and missing data. We 232 varied the parameter Pgene that determines the ranked gene lists used in the pathway enrichment 233 analysis (default threshold Pgene<0.1). The majority of cohorts (40/47 or 85%) retrieved 234 significantly enriched pathways even with a considerably more stringent threshold (Pgene<0.001), 235 however 67% fewer pathways were found compared to the default threshold in the median cohort 236 (Supplementary Figure 2). We then evaluated the robustness of ActivePathways to missing data 237 by randomly removing subsets of driver scores from the initial dataset. Even when removing 50% 238 of gene driver scores with P<0.001, the majority of cohorts (37/47 or 79%) were found to have at least one significantly enriched pathway however 66% fewer pathways were found on average 240 (Supplementary Figure 3). 241 We tested ActivePathways with data simulations through 1,000 datasets for each of 47 patient 242 cohorts and found no significant pathways in 92% of simulations (Supplementary Figure 4). 243 Simulated data were obtained by randomly reassigning driver scores to different genomic 244 elements, a conservative approach that disrupts gene and pathway annotations while retaining 245 strong scores in the data. The median family-wise false discovery rate across cohorts (7.2%) 246 slightly exceeded the applied multiple testing correction (Q<0.05). Higher rates were observed in 247 cohorts including melanoma tumors, potentially due to abundant promoter mutations caused by 248 impaired nucleotide excision repair in protein-bound genomic regions 25 . We evaluated quantile-249 quantile (QQ) plots of pathway-based p-values from ActivePathways and found that p-values from 250 observed gene scores often deviated from the expected uniform distribution and appeared 251 statistically inflated (Supplementary Figure 5). However, p-values derived from simulated gene 252 scores showed no inflation in our simulations. Anticipating that the strongest cancer driver scores 253 associate with protein-coding sequence, we studied datasets with simulated protein-coding gene 254 scores and true non-coding scores. As expected, these partially simulated datasets expectedly 255 showed less p-value inflation, suggesting that highly significant known cancer genes involved in 256  (Figure 3d).

DUSP1 encodes a phosphatase signaling protein of the MAPK pathway that is over-expressed in 305 malignant breast cancer cells and inhibits apoptotic signaling 31 . HER2 over-expression is known 306
to suppress apoptosis in breast cancer 30 . Anti-apoptotic signaling is a hallmark of cancer and 307 expectedly associated with worse patient prognosis. 308 ActivePathways also identified prognostic pathway associations in single subtypes of breast 309 cancer. For example, the prognostic genes for luminal-B subtype were enriched for chromosome 310 segregation (QluminalB=0.017, 41 genes) and related biological processes of GO. In agreement with 311 this finding, problems with chromosome segregation have been associated with worse outcome 312 in breast cancer 32 . As another example, luminal-A breast cancers were associated with prognosis 313 in ribosomal and RNA processing genes, such as ribosome biogenesis (QluminalA=6.9x10 -10 , 60 314 genes), and rRNA metabolic process (QluminalA=1.8x10 -13 , 64 genes). Although not described 315 specifically in the luminal-A subtype, ribosomal mRNA abundance has been shown to be 316 prognostic in breast cancer as a marker of cell proliferation 33,34 . In summary, ActivePathways can 317 be used for integrating clinical data with multi-omics information of molecular alterations. Such 318 analyses can provide leads for functional studies and biomarker development.

Co-expression analysis of Hippo master regulators across 54 human tissues recovers 333 associated biological processes and genes 334
To study the use of ActivePathways in the context of healthy human tissues, we analyzed the 335 dataset of 11,688 transcriptomes of 54 tissues from the GTEx project 5 , focusing on the Hippo 336 signaling pathway involved in organ size control, tissue homeostasis and cancer 35  Integrative pathway enrichment analysis helps distill thousands of high-throughput measurements 374 to a smaller number of pathways and biological themes that are most characteristic of the 375 experimental data, ideally leading to mechanistic insights and novel candidate genes for follow-376 up studies. The primary advantage of our method is the fusion of gene significance across multiple omics datasets. This allows us to identify additional pathways and processes that are not apparent 378 individually in any analyzed dataset. In our example of cancer driver discovery, pathway analysis 379 is complementary to gene-focused driver discovery as it also focuses on sub-significant genes 380 with coding and non-coding mutations clustered into known and novel biological processes of 381 cancer. In the clinical analysis of breast cancer subtypes, we find prognostic genes and pathways 382 active in tumor cells, the microenvironment, or both. A subset of these findings, such as anti-383 apoptotic signaling, is only apparent through data integration. 384 Our general pathway analysis strategy is applicable to diverse kinds of omics datasets where 385 well-calibrated p-values are available for the entire set of genes or proteins. One may study a 386 series of genomic, transcriptomic, or proteomic experiments or combine these into a multi-omics 387 analysis. Data from epigenomic experiments and genome-wide association studies can be 388 analyzed after genome-wide signals have been appropriately mapped to genes. Clinical and 389 phenotypic information of patients can be also included through association and survival statistics. 390 Our method is expected to work with unadjusted as well as multiple-testing adjusted p-values, 391 however it is primarily intended for un-adjusted p-values for increased sensitivity. P-value 392 adjustment for multiple testing is conducted at the pathway level rather than at a gene level. P-393 values from omics datasets are easier to interpret than raw signals as gene- ActivePathways were validated with two lists of cancer genes. Predicted drivers from the gene-500 focused PCAWG driver analysis 14 were selected as statistically significant findings (Q<0.05) 501 following a stringent multiple testing correction spanning all types of elements (exons, UTRs, 502 promoter, enhancers). The curated list of known cancer genes was retrieved from the COSMIC 503 Cancer Gene Census (CGC) database 12 . One-tailed Fisher's exact tests were used to estimate 504 enrichment of these genes using all protein-coding genes as background. 505 Analysis of prognostic genes in breast cancer. ActivePathways was used to evaluate 506 prognostic pathways in breast cancer using multiple types of omics data. mRNA gene expression data and gene copy number alteration (CNA) data of the were derived from the METABRIC cohort 508 of 1,991 patients with a single primary fresh frozen breast cancer specimen each 26  dichotomized based on mRNA abundance. Dichotomization was either based on the median 518 mRNA abundance for that gene or a fixed value of 6.5. Based on the mRNA abundance 519 distribution of genes on the Y chromosome in female samples, 6.5 was estimated as the threshold 520 for noise for non-expressed genes. Median dichotomization was used if the median was above 521 6.5 or if there were no events in one of the groups when dichotomizing based on 6.5. The high 522 and low mRNA abundance groups were compared by univariate log-rank tests for overall survival. 523 TC and TAC mRNA abundance were evaluated independently. Survival modelling was performed 524 in the R statistical environment (v3.4.3) using the survival package (v2.42-3). The CNA univariate 525 survival analysis was conducted as follows. For each gene, we assessed whether more gains or 526 losses were apparent. The copy number status with a higher count was subsequently used to 527 separate patients into two groups: those with the chosen copy number status and the remaining 528 patients. The two groups were then used for overall survival modelling with log-rank tests in the 529 R statistical environment (v3.4.3) using the survival package (v2.42-3). individual tissues and ranked by statistical significance of correlation tests. Tissue-specific ranked 537 correlations of target genes were then integrated into two master lists of target genes of YAP and 538 TAZ, respectively, reflecting target genes that were consistently positively co-regulated with 539 corresponding transcripts across a significant subset of considered human tissues. We used the 540 robust rank aggregation (RRA) method developed by Kolde et al 37 and filtered co-expressed genes by significance using the default parameters of RRA (Qgene<0.05). Significantly enriched 542 pathways among the putative target genes of YAP and TAZ were detected using ActivePathways. 543 We validated the pathways by investigating their agreement with known Hippo-related genes from 544 recent review papers 35,36 . We tested each pathway for enrichment of literature-derived Hippo 545 genes using Fisher's exact tests and filtered significant findings after multiple testing correction 546 (Q<0.05). 547 Method benchmarking. We benchmarked ActivePathways using multiple approaches, including 548 simulated datasets, parameter variations, and partial replacement of strong scores with missing 549 values. Benchmarking was carried out with the PCAWG dataset of coding and non-coding cancer 550 driver predictions. To evaluate false discovery rates of ActivePathways, we created simulated 551 datasets by randomly reassigning all observed driver scores to random genes and genomic 552 elements. Simulations were conducted separately for different tumor cohorts. One thousand 553 simulated datasets were analyzed with ActivePathways and those with at least one significantly 554 detected pathway counted towards false discovery rates. Additional simulations maintained the 555 positions of non-coding driver scores among gene scores and randomly reassigned protein-556 coding driver scores, expectedly leading to a reduction in detected pathways as the input datasets 557 primarily included strong scores in protein-coding gene regions. Quantile-quantile analysis and 558 QQ-plots were used to compare p-value distributions of pathways discovered from true driver 559 scores, driver scores with shuffled driver scores, and driver scores shuffled entirely. To evaluate 560 robustness of ActivePathways, we randomly replaced a fraction of significant driver p-values in 561 input matrices (P<0.001) with insignificant p-values (P=1). We tested different fractions of missing 562 values (10%, 25%, 50%) across a thousand datasets of driver scores with randomly selected 563 missing data points and concluded that most cohorts included significantly enriched pathways 564 even with large fractions of missing data. To further evaluate robustness, we tested different 565 values of the Brown P-value threshold used to select the integrated gene list for pathway 566 enrichment analysis. The default parameter value (Pgene<0.1) was compared to alternative values 567 (0.001, 0.01, 0.05, 0.2). We concluded that ActivePathways found enriched pathways in most 568 tumor cohorts even at more stringent gene selection levels. 569 Availability. ActivePathways is freely available as an R package and source code on the GitHub 570 repository https://github.com/reimandlab/ActivePathways and the Comprehensive R Archive 571 Network (CRAN). 572 573