Robust and clinically relevant prediction of response to anti-cancer drugs via network integration of molecular profiles

In order to tackle heterogeneity of cancer samples and high data space dimensionality, we propose a method NEAmarker for finding sensitive and robust biomarkers at the pathway level. In this method, scores from network enrichment analysis transform the original space of altered genes into a lower-dimensional space of pathways, which is then correlated with phenotype variables. The analysis was first done on in vitro anti-cancer drug screen datasets and then on clinical data. In parallel, we tested a panel of state-of-the-art enrichment methods. In this comparison, our method proved superior in terms of 1) universal applicability to different data types with a possibility of cross-platform integration, 2) consistency of the discovered correlates between independent drug screens, and 3) ability to explain differential survival of treated patients. Our new in vitro screen validated performance of the discovered multivariate models. Finally, NEAmarker was the only method to discover predictors of both in vitro response and patient survival given administration of the same drug.


130
The placement of three cancer cell lines HuH-7, NCI-H684, and RT-112 in a 2-dimensional space of pathways 'PPAR signaling' 131 and 'WNT signaling' (KEGG#03320 and KEGG#04310) (A, B, C, and D) or, alternatively, in a space of two key genes from these 132 pathways (E) was done by using cell line-specific altered gene sets, AGS, which originated from transcriptomics data and 133 contained 226, 143, and 48 member genes, respectively (AGS of class significant.affymetrix_ccle).

134
A. ORA: enrichment of the three AGSs was analyzed against the two pathways (or, more generally, functional gene sets, FGS) 135 using the overrepresentation analysis. The pathway enrichment scores were calculated from overlap between the gene sets.

136
For clarity we here denote the pathway size NPW which corresponds to NFGS elsewhere in the article. Due to the relatively small 137 gene sets sizes (NPW and NAGS), a noticeable (N∩ > 1) and significant overlap was observed in just one out of six cases, which 138 could limit the ORA sensitivity.

143
The fold change values determine relative influence of the pathway genes.   The methods in Figure 1 implied different input, processing and output (Table 1). Accordingly, our data 158 analysis procedure included the method-specific steps for sample/patient characterization, enrichment 159 analysis, and phenotype modeling. In order to maximally adapt GSEA to our applications, we tested two  (Table 2).
163 Table 2. Steps of analysis using alternative methods from Table 1.

165
Step What was evaluated Figure Fig. 2F). The high score for MPP89 ( Fig. 2A) [31]. Since FDCD itself is a member of the "One carbon pool by folate", 191 the pathway could be, in principle, detected by another enrichment algorithm. But how have the alternative 192 tested enrichment methods dealt with this pattern? Any noticeable correlations were absent. This might be 193 explained by the fact that FDCD was the only consistently deregulated gene out of the whole pathway, which 194 was a challenging situation for each of these methods. ORA is not well fit for cases of such an overlap (N=1).

195
In GSEA, enrichment via a single highly ranked list member is usually not detectable. In its turn, SPIA could 196 not gain enough statistical power in absence of consistent (adjoining) patterns of dysregulation in multiple 197 genes. Finally, expression of FTCD itself did not significantly correlate with methotrexate sensitivity in CCLE 198 and CGP transcriptomics datasets. More broadly, we did not find any genes of the "One carbon pool by 199 folate" and the adjoining pathway KEGG#00790 "Folate biosynthesis" which would significantly (by requiring 200 q-value <0.05) correlate with methotrexate sensitivity at either gene expression or somatic mutation levels.

201
For comparison, AGS of the other resistant cell line, ECGI10, did not share any genes with the target 202 pathway (Fig. 2B), although still received a higher NEA score. In this case, the summarized connectivity was 203 not dominated by a single network node of the AGS or of the FGS. Here the drug resistance could potentially 204 have been mediated by the DNA repair protein XRCC5 or by the adenosylhomocysteine hydrolase AHCY, 205 which were earlier reported to be implicated in methotrexate resistance [32] and folate metabolism [33], 206 respectively. Unlike the upregulated FTCD in MPP89, both these genes were strongly downregulated in 207 ECGI10. This emphasizes another feature of NEA: genes may be included into AGS due to alterations in an 208 arbitrary direction, i.e. both over-and under-expression, hyper-and hypo-methylation, increased and 209 decreased copy number etc. Therefore higher and, respectively, lower NEA scores cannot be traditionally 210 interpreted as activation or suppression of the given pathway (FGS) but rather indicate a general 'pathway 211 perturbation'. Hence the pathway "One carbon pool by folate" was unperturbed in the low-scoring cell lines 212 A2780 and RS411, i.e. the latter did not exhibit features that could connect specifically to the pathway. Figure 2 Network enrichment analysis of four cell line AGSs with differential response to methotrexate.

215
While using AGSs of class significant.affymetrix_ccle, the response of cancer cell lines to methotrexate in 216 CGP screen correlated with NEA scores (pane F) in regard to FGS "One carbon pool by folate" (pane C

227
In order to analyze data from the in vitro cancer cell screens and the primary tumor samples (TCGA) in the 228 same manner, we constructed AGSs by following the same platform-specific approaches. Intuitively, having 229 an AGS that is too big or too small could deteriorate specificity or sensitivity of NEA. Therefore, in order to 230 prove that differences are not due to selecting AGS genes in a specific way, we tested and compared a

241
An alternative to using gene copy number data would be to account for respective mRNA expression levels.

242
While this approach is subject of ongoing discussion, we have observed [34] that many known copy 243 number drivers did not exhibit this correlation and therefore we decided not to filter copy number data by the

259
Applying this approach to the in vitro drug screen data, we evaluated features of different types and classes.

260
Respectively, in TCGA data we measured correlations of features with survival of patients who received one 261 of the 42 frequently used drugs in any of the eight cohorts. We systematically compared different feature 262 types, i.e. original data from high-throughput platforms and NEA scores as well as classes within the types 263 (e.g. transcriptomics data from Affymetrix vs. Agilent vs. RNA sequencing). We also analyzed the relative 264 performance of different AGS classes.

265
Overall, the NEA scores at both pathway level (PWNEA) and individual gene node level (GNEA) contained 266 either approximately the same or larger amounts of information on drug sensitivity compared to the original 267 gene profiles (Fig. 3). In the drug screen data analysis, the ORA, PWNEA, and GNEA features performed 268 apparently better than the respective original point mutation, gene copy number, and gene expression data.

269
In the TCGA data analysis, the advantage of PWNEA and GNEA over both ORA and original gene profiles 270 for particular drugs was even more pronounced, although not always overall significant. Among the platforms 271 for the in vitro screens, Affymetrix data by far outperformed mutation data, copy number, and combined 272 AGSs. In TCGA datasets, RNA sequencing performed better than Affymetrix (the former data was not 273 available for the cell lines). In general, transcriptomics datasets much more frequently manifested 274 correlations with drug sensitivity than gene mutations and copy number datasets (Suppl. Fig. 3). While this 275 observation is not new [9], the most obvious explanation should be that most of the genome alterations 276 were insufficiently frequent for the statistical tests. As an example, less than 10% of the genes in the BRCA 277 cohort had point mutations in more than 1% of the tumors and therefore the analysis did not gain enough 278 statistical power. mRNA expression profiles were, on the contrary, available for most of the genes.We also  Kolmogorov-Smirnov test as in Figure 3 on the gene copy number and expression datasets for cell lines and 286 TCGA samples (Suppl. Table 1). This evaluation, however, did not lead to an unequivocal conclusion. In the cell line datasets, the fixed size AGSs performed significantly better, while in the TCGA datasets the situation 288 was rather opposite. Figure 3. Comparison of the potential performance of different features, methods, and data types.

290
The top 5 boxplot rows (labeled ".kegg") present results obtained using the limited set of 197 KEGG pathways using gene 291 expression data (label "GE"). Next, since ORA, PWNEA, and GNEA did not require intra-pathway topology and could 292 accept any data type, the rest of boxplots present tests on the full set of 328 FGS (the respective PWNEA and ORA values 293 for GE might differ, since the KGML gene sets were somewhat different from the core KEGG version).

294
Each boxplot element combines correlation values between either features for a given class (labeled at the vertical axis) 295 and for either the in vitro response to drugs in the three screens (left pane; in total 365 tests of 320 distinct drugs) or for the 296 survival of patients who had been treated with drugs (right pane; 42 drugs in the eight TCGA cohorts). As an example, we

306
(original data, ORA, PWNEA, and GNEA as grey, brown, dark green and bright green, respectively) and by data type (point 307 mutations, PM; copy number alterations, CN; and gene expression, GE). The p-value is shown when a category produced 308 significantly (p<0.001 by Kolmogorov-Smirnov test) more non-zero patterns (i.e. fractions with FDR<0.1) than the 309 respective baseline category (labeled "BL"). The boxes contain data points within 25-75th percentile intervals (i.e. between 310 quartiles Q1 and Q3). The maximal whisker length, MWL, is defined as 1.5 times the Q1-Q3 interquartile range (i.e. the box 311 length). Whiskers can extend to either the MWL or the maximal available data point when the latter is below MWL. Markers 312 thus correspond to data points that extend off the box by more than the MWL value.

314
While the original features manifested considerable correlations in a number of classes, fractions of 316 significant correlations were largely inferior when compared to NEA classes. In general, the different 317 methods could be ranked by potential sensitivity in the following order: original gene profiles < [either ORA or 318 ZGSEA] < [either AGSEA or SPIA]< [either PWNEA or GNEA]. However even upon adjustment for multiple 319 testing, we did not draw ultimate conclusions from significance of these correlations. This exploratory 320 analysis only informed us on the relative Type II error rates (i.e. sensitivity, or statistical power to detect 321 correlation), suggesting that multiple alternative methods and data types were potentially predictive of drug 322 sensitivity. In order to evaluate robustness of these predictions we proceeded to the validation step as 323 described below. We also note that only ORA, PWNEA, and GNEA could provide means for integrating 324 omics data from different platform (by simple merging of AGS lists), where ORA was apparently inferior.

348
For each drug shared by any two of the three in vitro drug screens (in total 47 cases), we calculated rank correlation 349 between drug-feature rank correlation coefficients in the two screens.

394
The predictive models for four compounds tested in the published CTD screen were validated in our ACT screen.

395
Elastic net models were built under multiple cross-validation inside the training set (columns 1 and 3, blue) and then 396 tested on non-overlapping sets of cell lines of the ACT screen (columns 2 and 4, green). Input variables were either

406
Overall, the performance of the original profile models on the validation sets appeared comparable to that of 408 PWNEA. However importantly, the former had much more freedom in model term selection since the initial 409 feature space was around two orders of magnitude larger than that in PWNEA. Consequently, despite the   Table 3). Therefore we coupled this calculation with a significance test 440 by randomly permuting feature and sample labels. Altogether, the permutation tests indicated that point 441 mutation and copy number data had zero true discovery rates (TDR), i.e. their correlation p-values were 442 preserved not more than expected by chance (see column 3 in Table 3). On the contrary, the TDR levels 443 were substantial (0.02…0.805) for gene expression data and for AGSs processed with each of the 444 enrichment analyses.

445
At the next step (remaining columns of Table 3) we calculated the numbers of significant cases that would 446 also be practically usable, i.e. had both lower p-values (<0.001) and rank correlation values higher than 0.2.

447
No such cases were identified in the gene expression data. ORA, PWNEA, and GNEA yielded 0.8%, 3.5%,   The purpose of this analysis was to prove systematically significance of the produced correlates, and we 464 reiterate that using the original data did not seem efficient: although many transcriptomics profiles correlated 465 with drug sensitivity, those patterns could not be traced back to the in vitro screens.

466
Most of the consistent NEA features were obtained for AGS based on gene expression data (Suppl. Table 2).

467
They were identified for docetaxel, gemcitabine, and paclitaxel in BRCA (see the cancer cohort notation in 468 Table 3

475
The cancer emergence and progression were earlier linked to tissue inflammation through the NOD-like 476 receptor signaling [40] . We found that the corresponding pathway score correlated with survival in ovarian 477 carcinoma patients treated with topotecan (Fig. 6A).
478 Figure 6. Clinical performance of NEA features discovered in drug screens.

480
Each TCGA cohort was split into four categories by two factors: 1) administration of the specific drug and 2) predictive 481 feature value (pathway or individual gene score, indicated in the plot header), each above and below a threshold. The 482 primary feature evaluation employed p-values calculated in the continuous score space, i.e. without splitting the 483 patient cohort into binary classes by factor (2). Then the binary classifications by the both factors were used for 484 visualization (as "treated/untreated" for the drug and at the quantile "optimal threshold" value for the quantitative NEA  ) and with overall survival of OV patients (p(H 0 )=0.003; Fig. 6D).

514
This setup could not eliminate possible confounding effects from multi-drug treatment history and clinical 515 factors that might determine administration of specific drugs. Nonetheless, the NEA scores apparently 516 explained the differential sensitivity to anti-cancer drugs in a much more robust and efficient manner than the 517 original data.

518
A visual inspection of the survival curves in Fig. 6 sheds light on usefulness of these tentative biomarkers in 519 a clinical setting. As an example, in a 1-year survival perspective, relative risks (RR) would either increase 520 (Fig. 6A,C) or decrease (Fig. 6B,D) given higher NEA scores of the patient samples. By using this fixed 521 follow-up interval and the cohorts of limited size, the confidence intervals at the 95% level would be rather our analysis these methods demonstrated performance comparable to that of NEA ( Fig. 3 and 4).

562
A common problem of method benchmarking is the unavailability of ground truth. In our case, too, we did not  Table 1).

578
Further in the analysis of consistency in vitro versus clinical results, these classes were almost equally 579 represented (Suppl. Table 2). We have also seen differences between different filtering approaches in AGSs 580 of classes significant.mini and significant.maxi (Suppl. Fig. 2). Therefore an issue to be

589
The statistical power of NEA was obviously far from full. As an example, there were 13 drugs for which the 590 numbers of tested cell lines and patients treated in TCGA cohorts were sufficient for a significant estimation.

591
For four drugs out of these 13, no reliable correlates could be found. One instructive example could be 592 irinotecan, prescribed to 25 and 22 patients in COAD and GBM cohorts, respectively. The interesting feature 593 of irinotecan is that its pharmacokinetic pathway involves the same enzymes as that of gemcitabine ( Fig.   594 6B), namely CES1, CES2, CYP3A4, CYP3A5 and some others 595 (https://en.wikipedia.org/wiki/Irinotecan#Interactive_pathway_map)although the enzymes here work in an opposite direction: they activate irinotecan rather than degrade as they do to gemcitabine. Nonetheless, 597 relevant GNEA scores might have been informative for response to irinotecan. The patients' response was 598 sufficiently differential, too: while all the irinotecan-treated patients relapsed, the time to relapse varied from 599 78 to 1265 days. However, we did not observe almost any sensible correlation of the pathway genes neither 600 as GNEA features nor as raw gene expression profiles. In the GNEA framework, this elucidated a lack of 601 network linkage between the AGSs of responders (or non-responders) to the irinotecan pathway.

602
Further, our FGSs were created by third-party sources and never meant to be used in NEA. Thus, another 603 step for NEA-based biomarker discovery would be the compilation of novel, specifically optimized FGSs.

604
Ultimately, one could compile de novo pathways -similarly to the approach by

CTD screen
The authors (Basu et al., 2013) [27] mainly used areas under curve (AUC) for their quantitative analysis of 667 203 drugs in 242 cell lines. We reproduced this approach in our study. In completely insensitive cases, the 668 full area under eight experimental points reached 8, whereas 0 stood for full sensitivity. Thus, the scale of 669 this screen was inverted compared to the other screens, which was considered in all calculations.   Following the same approach, we employed TCGA data on somatic point mutations reported in MAF files.

694
The column 'Variant_Classification' contained a number (more than 15) different codes, most frequent being 695 Missense_Mutation, Nonsense_Mutation, and Silent. Since the latter constituted around 25% of the total 696 number of somatic mutations reported in the eight cancers -which would not significantly affect the false 697 positive and true discovery rates -we analyzed records with any such codes as potentially associate ed with 698 drug response.

700
We evaluated a number of existing multivariate, enrichment-based, and/or network analysis methods that

746
Even though individual enrichment scores can be correlated with phenotypes, they have still been rarely 747 used in predictor models. In the case of ORA and GSEA, the major reason was that the enrichment is mostly 748 detectable for large FGSs (hundreds to thousands genes), but such are unlikely to characterize functional 749 differences between tumors -while compact, specific, and discriminatory gene sets tend to escape their 750 limits of statistical power. Nonetheless, Drier et al. [68] have explored cancer cohorts with pathway-level 751 sample scores derived from gene expression data in a quantitative way and found that certain sample clusters can be associated with patient survival. On the other hand, the network-based methods have been 753 developed only recently and are therefore 'too young' to have been exploited fully. Above, we have also 754 mentioned the network-based regularization of multiple regression models where inclusion of gene terms into 755 the models was essentially coupled to their co-expression.

756
We finally decided to include in our testing, in parallel with PWNEA (pathway level NEA) and GNEA (gene 757 node level NEA), the following methods:

774
[69]), mostly due to the latter broadly using prokaryotic evidence and therefore less specific in cancer-related 775 analyses. The second conclusion from the benchmark was that adding to the FunCoup network edges of 776 curated databases significantly improved its performance. We therefore added the FunCoup-based network

816
WikiPathways and literature). Another approach was applied to enable compatibility with GSEA and SPIA.

817
These methods were designed and are most suitable for analyzing expression data and, apart from that,

818
SPIA was applicable only to pathways with well characterized intra-pathway topology. We therefore   Node connectivity values (numbers of all edges for each given node) were pre-calculated by the algorithm in 833 advance, given the input network. Then N AGS and N FGS reported the sums of connectivities of member nodes of AGS and FGS, respectively, and N total was the number of edges in the whole network. Since it was models. We thus included only covariates most likely associated with the disease prognosis, such as tumor 916 degree, pathological tumor stage, immunohistochemical statuses in BRCA, Gleason score in PRAD,

917
Karnofsky score in GBM (Suppl . Table 4). Next, we reasoned that when the association "feature -drug 918 response" truly exists, we should observe it specifically in the patients who did receive the drug in the given 919 TCGA cohort. Our survival models of the form 920 contained, apart from the covariates C 1 …C k and the residual term ε, main effects "drug" D and "feature" F as 921 well as the interaction term D*F. A significant main effect of a drug could be interpreted as patients' benefit in 922 total and irrespective of the feature value, e.g. regardless of a gene mutation, or a gene expression, or a 923 NEA-based pathway score. Conversely, a significant feature effect indicated that the feature correlated with 924 survival directly, i.e. no matter if the drug was administered or not. Finally, significance of the interaction 925 indicated efficacy of the drug specifically in patients with feature values either above or below a threshold, so 926 that respective patterns could be explained by neither of the main effects. The interaction term was thus 927 central for our purpose of detecting drug-feature correlations, whereas the significance of main effects of 928 "feature" and "drug" was allowed although not required. As an example, a feature may or may not exhibit a 929 significant correlation with survival in patients who did not receive the drug.