Pan-cancer characterisation of microRNA across cancer hallmarks reveals microRNA-mediated downregulation of tumour suppressors

microRNAs are key regulators of the human transcriptome across a number of diverse biological processes, such as development, aging and cancer, where particular miRNAs have been identified as tumour suppressive and oncogenic. In this work, we elucidate, in a comprehensive manner, across 15 epithelial cancer types comprising 7316 clinical samples from the Cancer Genome Atlas, the association of miRNA expression and target regulation with the phenotypic hallmarks of cancer. Utilising penalised regression techniques to integrate transcriptomic, methylation and mutation data, we find evidence for a complex map of interactions underlying the relationship of miRNA regulation and the hallmarks of cancer. This highlighted high redundancy for the oncomiR-1 cluster of oncogenic miRNAs, in particular hsa-miR-17-5p. In addition, we reveal extensive miRNA regulation of tumour suppressor genes such as PTEN, FAT4 and CDK12, uncovering an alternative mechanism of repression in the absence of mutation, methylation or copy number changes.

gene signatures generally consist of a set of tens to 11 several hundred coding genes, for which a summary 12 metric of their collective expression is associated 13 with a known hallmark, and may help with defining 14 therapeutic strategies [3]. Encapsulated within this 15 methodology and these signatures is a vast amount 16 of biological discovery for particular genes impli-17 cated in the development and progression of these 18 hallmarks. However, since the more recent publica- 19 tion of the updated hallmarks in 2011 [25], there has 20 been a second revolution in the field of genomics; 21 namely, the discovery of the diverse, critical roles of 22 non-coding RNAs in cancer. 23 Previously thought to be 'junk DNA,' non- 24 coding RNA are those RNA derived from DNA 25 that do not code for proteins, and consists of a di- 26 verse family of evolutionarily conserved species, in-27 cluding long non-coding RNA (lncRNA), circular 28 RNA (circRNA), and microRNA (miRNA), among 29 others [23,40,41]. Much effort has focused on 30 the characterisation of these non-coding RNA, and 31 early work has shown that these species, particularly 32 miRNA, are involved in a number of cellular de- 33 velopmental, and differentiation processes [50]. In 34 addition, miRNA have been implicated in a num- 35 ber of human diseases, ranging from diabetes to 36 cancer, and in oncology, recent work has led to 37 the discovery of tumour-suppressive and oncogenic 38 miRNA [7,42,44,49]. miRNA exert their function 39 within the cell primarily as repressors of protein 40 production, functioning as post-transcriptional reg-41 ulators of mRNA, inhibiting translation or encour-42 aging transcript degradation. miRNA exert their 43 effects by complementary base-pair binding to a 44 short 7-8 nucleotide 'seed region' typically located 45 on the 3' untranslated region of the messenger RNA 46 which they inhibit [40]. A single miRNA is thought 47 Our results point towards a scenario wherein the 94 trancriptome of the cancer cell, known to be driven 95 by dysregulation of tumour suppressor genes and 96 oncogenes, is heavily regulated by miRNAs. We 97 show that predicted miRNA-target associations that 98 retain significance across multiple cancer types in-99 volve a number of critical tumour suppressor genes 100 and oncogenes. Study of these tumour suppressor 101 genes yields novel conclusions about their regula-102 tion, particularly with respect to their repression by 103 miRNA, methylation and mutation, and the exclu-104 sivity of the occurrence of these modes of regulation 105 across human cancers. 106

107
Evaluation of Hallmark gene signatures 108 across cancers 109 The first prerequisite to our study was to identify 110 suitable biomarkers to infer cancer phenotype. In or-111 der to achieve this, we chose 24 previously identified 112 gene expression signatures (Supplementary S1) that 113 have already been shown to be representative for a 114 wide number of samples, and a number of fundamen-115 tal phenotypic properties, with the hopes of alleviat-116 ing issues related to highly tissue-specific expression 117 patterns. With this in mind, we applied sigQC, an 118 R package encapsulating a robust methodology for 119 the evaluation of gene signatures on various datasets 120 for the basic statistical properties underlying their 121 applicability [16]. We ran this package on all combi-122 nations of 15 datasets and 24 signatures considered 123 in this study, and tested the consistency of signature 124 performance across cancer types, giving confidence 125 in the application of the signatures to these datasets. 126 All summary plots from the sigQC quality control 127 protocol are presented in Supplementary Section S2. 128 Each of the signatures considered over the 15 epithe-129 lial cancer datasets showed good applicability, strong 130 signature gene expression, moderate-strong compact-131 ness, and good gene signature score variability, as 132 well as strong autocorrelation of signature metrics. 133 The previous validation of these signatures, and our 134 study-specific quality control results, justify our sub-135 sequent use of these signatures in a pan-cancer man-136 ulatory network 141 To determine the association of gene signatures to 142 miRNA expression, we set the signature score (see 143 Online Methods) for each signature equal to a linear  To verify the validity of these predictions, we 176 considered the example case of miRNAs found to 177 associate significantly with the hypoxia signatures 178 considered. Hypoxia is one of the most studied mi-179 croenvironmental perturbations in the context of 180 miRNA regulation, and one with a very well-defined 181 pathway, controlled largely by a single transcrip-182 tion factor, HIF-1α [48]. Taking the intersection 183 of the sets of miRNAs found to associate positively 184 with the two previously validated hypoxia gene sig-185 natures (Hypoxia, Buffa et al. [5], and Hypoxia, 186 MSigDb [34]), we obtained high confidence predic-187 tions for hypoxia-associated miRNAs.

202
In the context of all gene signatures considered, 203 we identify a global underlying 'map' connecting 204 each miRNA to each gene signature with which we 205 have found an association. As shown in Figure 1d, 206 this is a highly interconnected and complex network, 207 with the conservation of a core set of miRNAs shared 208 across the hallmarks of cancer. A similar analysis re-209 veals an analogous result for the miRNA-hallmarks 210 network for the miRNA negatively associated with 211 both signatures, as described in Supplementary Sec-212 tion S4. To validate the reproducibility of these re-213 sults, we rebuilt the signature-miRNA linear model 214 using a large independent dataset, the Metabric breast 215 cancer cohort [13]. The miRNA identified as posi-216 tively and negatively associated with gene signatures 217 in this dataset show highly significant concordance 218 over a majority of signatures with the correspond-219 ing miRNAs identified from analysis of the TCGA 220 dataset (Supplementary Figure ??, Section S5). 221    ilies [28,29]. Interestingly, grouping the miR- was p = 0.017, again suggesting strong significance 287 in the enrichment for TSG among the repressed 288 targets of the signature-associated miRNA. This sug-289 gests that miRNA-mediated repression of tumour 290 suppressor genes may be relatively common, signifi-291 cant, and associated with the phenotypic hallmarks 292 of cancer.

293
A different picture emerged upon repeating this 294 analysis for oncogenes, and for the miRNAs found 295 to be significantly negatively associated with one 296 or more hallmark signature. We identified 1283 sig-297 nificantly anti-correlated miRNA-target pairs for 298 these downregulated hallmark-associated miRNAs. 299 Likewise, analysing all predicted miRNA-oncogene 300 interactions among the 231 COSMIC oncogenes, 301 there were only 2 showing significant anticorrela-302 tion across tumour types with their predicted target 303 miRNA (ESR1 and ABL2). Taking the intersection 304 of these lists of 2 COSMIC oncogenes and the 1283 305 miRNA-oncogene pairs associated with gene signa-306 tures identified only ESR1 (interacting with miR-18a-307 5p and miR-130b-3p) in common (p = 1.2 · 10 −5 , 308 Fisher's exact test). This suggests that ESR1, estro-309 gen receptor alpha, may play a significant role across 310 the hallmarks of cancer, and de-repression by reduc-311 tion of its miRNA-mediated repression may play 312 a role in cancer phenotype, and ultimately, onco-313 genesis [35,52]. On the other hand, this result is 314 also a strong negative control for our analysis, and it 315 concurs in supporting the common oncogenic role 316 of miRNAs via co-ordinated repression of tumour 317 suppressor genes.

318
A core set of tumour suppressor genes are 319 associated with the hallmark gene signatures 320 across cancer types 321 Next, we asked whether our results could be bi-322 ased by the initial selection of miRNA, namely the 323 ones associated with the cancer hallmarks. To an-324 swer this, we conducted a complementary analy-325 sis, namely we sought to determine which of the 326 miRNA-mediated tumour suppressor genes showed 327 significance in downregulation, in the context of 328 all other tumour suppressor genes. Thus, we re-329 peated the previous analysis extended to all predicted  Across signatures, an overlap of on average 54% was 375 observed for signature-associated miRNA, showing 376 high statistical significance for miRNA positively 377 and negatively associated with signatures (p < 10 −19 378 in all cases, by Fisher's exact test).

379
Examining the targets of these positively 380 signature-associated miRNA from normal tissues, 381 we identified 233 recurrently negatively correlated 382 miRNA-target pairs, of which two contain miRNA-383 TSG pairs (CEBPA and NCOA4). However, this 384 overlap of the 142 unique genes among the 233 385 miRNA-target pairs with the 141 COSMIC tumour 386 suppressor genes does not show significance, and 387 may be due to chance alone (p = 0.26 by Fisher's 388 exact test). Thus, while the biology captured by 389 the phenotypes of the gene signatures may be con-390 sistent, more than chance alone would predict, be-391 tween tumour and normal samples, the resultant 392 miRNA-target interactions are significantly differ-393 ent, and miRNA-TSG enrichment is not retained 394 among normal tissue samples, highlighting the con-395 text dependency of these associations.

396
Analysis of modes of regulation confirms 397 that copy number and mutational status are 398 key determinants of TSG expression 399 With a set of TSG purported to be significantly regu-400 lated by miRNA in relation to phenotype identified, 401 we next sought to characterise the determinants of 402 their expression. In particular, we consider an ap-403 proach integrating multiple lines of genomic infor-404 mation; namely, methylation status, copy number, 405 miRNA expression, and mutational status (see Meth-406 ods), with the linear model depicted in Figure 3a. 407 Notably, when considering the impact of miRNA in 408 this model, we considered all reported miRNA to po-409 tentially discover novel miRNA-target interactions. 410 We then fit this model with penalised linear regres-411 sion over the various cancer types, and then subse-412 quently aggregated coefficients by the rank product 413 statistic to identify recurrently positive and nega-414 tive predictors across cancer types, for each of the 415 8 tumour suppressor genes identified in Figure 2e.
Common, signature-associated, significantly negatively regulated TSG

428
Likewise, the identified modes of negative regula-429 tion give expected results, with non-sense mutations 430 and frame shift deletions consistently negatively as-431 sociated with TSG mRNA expression. Further, be-432 cause this analysis was done with all miRNA, and 433 not just those predicted to have a given TSG target, 434 these results may implicate novel miRNA-TSG in-435 teractions. The complete rank product tables and 436 all autocorrelation matrices can be found in Supple-437 mentary Section S8. Next, we sought to identify associations with tu-492 mour molecular subtypes, and as an initial analy-493 sis chose the molecular subtypes of breast cancer, 494 owing to both the well-defined subtypes and the 495 relatively large number of cases available for each 496 subtype. An analysis of the eight identified tumour 497 suppressor genes consistently negatively downregu-498 lated by miRNA across cancer types shows that in 499 many cases, their mRNA levels are inversely associ-500 ated with breast cancer molecular subtype. In par-501 ticular, the basal subtype shows the lowest median 502 expression of ARHGEF12, SFRP4, and TGFBR2, as 503 compared to normal tissue, luminal A, B, Her2 am-504 plified, or normal subtypes of breast cancer as shown 505 in Supplementary Figure ?? in Section S11, and this 506 association is retained when cases are restricted to 507 wildtype expression of ARHGEF12, SFRP4, and 508 TGFBR2. At the level of the associated miRNA iden-509 tified as negative regulators of these TSG, we show 510 that the median expression of these miRNA is also 511 significantly associated with breast cancer molecular 512 subtype, and inversely related to TSG mRNA expres-513 sion by subtype. We have also shown that these asso-514 ciations are preserved when samples with non-silent 515 mutations in the TSG are removed. For further vali-516 dation, we also show reproducibility of these TSG 517 and miRNA associations to breast cancer subtype in 518 the independent Metabric dataset (N = 1293) [13]. 519   -log(p)

541
Our work begins with ensuring applicability of 542 the gene signatures, and then for each signature, we 543 gain an understanding of the miRNA both signifi-544 cantly up-and down-regulated in association with 545 the signature score. From this, we obtain the net-546 work shown in Figure 1, which describes for the 547 first time in a detailed fashion, and across cancer 548 types, the contribution of individual miRNA to the 549 complex cancer phenotype. We also show repro-550 ducibility of this network in an independent dataset, 551 by considering the overlap with the network recon-552 9/22 Empiric CDF percentile   using more than a single miRNA as a therapeutic 622 agent, the off-target effects that have significantly 623 limited development in this field may be mitigated, 624 by buffering for this with other miRNA in off-target 625 tissues [1].

626
In this work we further the knowledge of which 627 miRNA are involved in creating the phenotypes 628 of cancer, across tissue types, to identify miRNA-629 TSG targets showing exclusive miRNA-mediated 630 suppression. This suggests that a phenomenon simi-631 lar to that of the previously described 'BRCA-ness,' 632 wherein a miRNA, miR-182, has been shown to 633 repress BRCA and confer sensitivity to PARP in-634 hibitors in a subset of tumours [43], may be at work 635 within many cases, and across multiple tumour sup-636 pressor genes. Additionally, recent work has shown 637 how 'epimutations' may result in aberrantly methy-638 lated sites that can recapitulate the phenotype of a 639 mutated tumour suppressor such as DNMT3A in 640 leukaemia [27]. This raises the suggestion that there 641 are tumour suppressor genes for which a mutation 642 is not requisite for inactivation, but rather, inactiva-643 tion is achieved through miRNA-mediated repres-644 sion or methylation-mediated repression alone. For 645 the TSG we have identified, we have also shown (see 646 Online Materials), that the TSG mutations are oc-647 curring independently of MYC amplification status, 648 which has been recently identified as an independent 649 regulator of miRNAs. In addition, we show that 650 such MYC amplification status is indeed associated 651 with miRNA expression for the miRNA found to 652 be negatively associated with each of the TSG in 653 a majority of cases (Supplementary Figure ??, Sec-654 tion S12). Further, we have shown that in partic-655 ular tumours, for PTEN, CDK12, and FAT4, this 656 miRNA or methylation-based suppression happens 657 independently of other gene regulatory factors, such 658 as mutations and copy number changes.

659
Lastly, we show how using generally validated, 660 and specifically quality-controlled, gene signatures 661 describing biologically conserved phenotypes can be 662 used to collate large datasets to derive inference about 663 miRNAs, a species whose signal has been tradition-664 ally hard to detect. The ability of this approach to 665 capture tumour biology is highlighted through the 666 identification of tumour suppressor genes showing 667 miRNA-mediated regulation across tumour types,  Table 3 [34]. We note that 695 while many of these signatures were derived for a 696 particular tumour type, we have applied them across 697 many different tumour types, but before doing so, 698 we have performed an evaluation step (sigQC) to 699 ensure that each signature used is applicable to ev- and limited based on origin of neoplasm and num-711 ber of patients for whom miRNA-sequencing was 712 carried out [55]. The RSEM normalised gene ex-713 pression, mature miRNA normalised expression 714 data, copy number, mutation, and methylation 715 data were accessed from the Firebrowse database at 716 http://www.firebrowse.org. In particular, we con-717 sidered all cancer types which were epithelial or glan-718 dular with respect to histology, and with at least 200 719 samples with miRNA-sequencing data. These two 720 filters limit the cancers considered to a total of 15 721 epithelial or glandular neoplasms, comprising a wide 722 variety of cancer types, enabling the strong detec-723 tion of fundamental biology. Furthermore, among 724 these tumour types, there are 7,738 clinical samples, 725 for which 7,316 have miRNA-sequencing data. The 726 tumour types, along with their sample counts are 727 presented in Table 1. Details of the number of sam-728 ples included for each data type are presented in 729 Table 2, and we note that for any analysis presented, 730 any dataset present with fewer than 9 samples was 731 excluded from analysis. This restriction excluded the 732 analysis of COAD, OV, and UCEC datasets from 733 the analysis of tumour suppressor genes, oncogenes, 734 and exclusivity of regulation. 735 miRNA family database 736 miRNA ranked across different cancer types were 737 further grouped together by microRNA family, as 738 defined by the targetscan database, implemented in 739 R as the targetscan.Hs.eg.db package [12,33].   between cancer types and signatures. 776 We used multivariate penalised linear regres-777 sion, with 10-fold cross validation, as previously de-778 scribed [6] to infer significant relationships between 779 miRNA and gene signatures without overfitting our 780 model. Specifically, first a univariate model filter 781 was applied to the data to select miRNA used for 782 penalised multivariate linear regression. Then, the 783 penalised multivariate linear model with the least 784 predictive error (as assessed on the validating folds) 785 was selected, and coefficients for these miRNA were 786 used for further analysis. All model-fitting, includ-787 ing the initial filtering, was done with 10-fold cross-788 validation, and was carried out using the penalized 789 package in R [21,22]

816
The rank product statistic, as described by Breitling 817 et al., in 2004, for these fractional ranks was then con-818 sidered, and the coefficients were ranked in terms 819 of their significance of rank product test statistic, 820 as implemented by the RankProd R package [4,9]. In order to ensure reproducibility of the approach 826 used to identify gene signature-associated miRNA, 827 we repeated the linear modeling procedure across the 828 independent Metabric matched miRNA and mRNA 829 microarray dataset of 1293 samples [13]. We mapped 830 each gene signature to corresponding Ensembl IDs, 831 and repeated the combined univariate-multivariate 832 linear modeling approach over all miRNA probes. 833 The miRNA probes identified as positive and neg-834 ative coefficients were then identified, and mapped 835 to their corresponding mature miRNA ID. The sta-836 tistical significance of this overlap is shown in Sup-837 plementary Figure ?? [46].

852
TargetScan [19], and miRDB [56], with a minimum 853 source number of 2 were used, and the union of all 854 targets found was taken as the set of targets for a 855 given miRNA.

856
For each of these target-miRNA pairs, the Spear-857 man correlation coefficient was calculated across ev-858 ery cancer type for miRNA versus target mRNA 859 expression, partial to mutation status of the mRNA, 860 and if this value reached statistical significance of 861 p < 0.05, it was recorded, and otherwise was omit-862 ted and recorded as NA. Note that mutational status 863 was reported as a binary variable with a value of 864 1 for any non-silent, non-intronic mutation, and 0 865 otherwise. The target-miRNA pairs with at least 5 866 non-zero entries across cancer types were kept for 867 further analysis, and subsequently were analysed us-868 ing the rank product statistic, to identify those pairs 869 with consistently negative correlations, across cancer 870 types, with respect to all other hallmarks-miRNA 871 15/22 pairs. Partial correlations were done in R using the ppcor package [30]. In analysing the regulation of the TSG identified as 927 related to the hallmarks of cancer and potentially 928 amenable to miRNA regulation, we first limited the 929 samples under consideration to those where copy 930 number data, gene expression data, miRNA expres-931 sion, mutation data, and methylation data were all 932 present. Mutation data was again taken as a bi-933 nary variable, but as opposed to the partial correla-934 tion analysis, mutations were stratified into their re-935 ported types (e.g. missense mutations are all grouped 936 together, etc.). That is, the missense mutation vari-937 able would only contain a value of 1 if the sample 938 had a missense mutation in the gene of interest, and 939 0 otherwise. All variables considered in the linear 940 regression were standardised to a mean of 0, and a 941 standard deviation of 1.

942
L1/2 penalty-based penalised linear regression 943 was then performed, in the same manner as above, 944 for the linear model described in Figure 3a. Sub-945 sequently, coefficients were aggregated across the 946 various cancer types and after the rank product test 947 was applied, those predictors showing statistically 948 consistent positive or negative coefficients were iden-949 tified. Following this, the autocorrelation of each 950 of these predictor variables was considered, for each 951 of the TSG in each cancer type, as depicted by the 952 heatmap in Figure 4a. To determine the exclusivity of gene regulation, we 955 calculated the empiric distributions of the variables 956 Π ρ k as defined graphically in Figure 4. These repre-957 sent the proportion of miRNA-miRNA or miRNA-958