Although cancer genomes are replete with noncoding mutations, the effects of these mutations remain poorly characterized. Here we perform an integrative analysis of 930 tumor whole genomes and matched transcriptomes, identifying a network of 193 noncoding loci in which mutations disrupt target gene expression. These ‘somatic eQTLs’ (expression quantitative trait loci) are frequently mutated in specific cancer tissues, and the majority can be validated in an independent cohort of 3,382 tumors. Among these, we find that the effects of noncoding mutations on DAAM1, MTG2 and HYI transcription are recapitulated in multiple cancer cell lines and that increasing DAAM1 expression leads to invasive cell migration. Collectively, the noncoding loci converge on a set of core pathways, permitting a classification of tumors into pathway-based subtypes. The somatic eQTL network is disrupted in 88% of tumors, suggesting widespread impact of noncoding mutations in cancer.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The results published here are in whole or part based upon data generated by the TCGA Research Network (see URLs). We would also like to acknowledge the clinical contributors and the data producers from the ICGC who have generated the particular datasets and made them available for public analysis. This work was supported by NIH grants to T.I. (U24CA184427, U54CA209891, P50GM085764, P41GM103504 and R01HG009979) and H.C. (DP5OD017937). G.X. is supported by a UCSD CTRI grant (UL1TR001442). S.I.F. and D.O.V. are supported by a Burroughs Wellcome Fund Career Award at the Scientific Interface (1012027), an NSF CAREER Award (1651855), and UCSD CTRI and FISP pilot grants. We would like to thank members of the Ideker laboratory for valuable comments and critical reading of the manuscript. Finally, we wish to thank the patients and their families for their contributions of valuable data without which this project would not have been possible.
Integrated supplementary information
a–e, Distribution (mean ± s.d.) of loci, enhancers and genes in somatic eQTL analysis. a, The sizes of recurrently mutated loci are typically <200 bp (52.0 ± 48.9 bp; n = 8800 loci). b,c, Most loci contain <10 SNV sites (4.0 ± 2.3; n = 8800 loci; b) and <40 patients with mutations (12.7 ± 10.4; n = 8800 loci; c). d, The length of enhancers (1,397 ± 1,934 bp; maximum = 86 kb; n = 247,345 enhancers) from the GeneHancer database17. e, Number of loci tested per gene in the somatic eQTL analysis (2.6 ± 3.1; maximum = 53; n = 9445 genes). f, Enrichment of mutations at each consecutive step of locus selection.
a, Associations for which significance was better than a threshold (20% FDR) define a global network of somatic eQTLs. This network has been deposited in NDEx (http://www.ndexbio.org/; UUID: bd9e210f-dc84-11e7-adc1-0ac135e8bacf). b, The quantile–quantile plot shows the observed P values (F test, n = 783 tumors) versus random expectation for associations between somatic eQTLs and the expression levels of their target genes. FDR is calculated using the Storey approach50. c, Power analysis. All locus–gene pairs were plotted, evaluated by the number of patients with mutations (x axis) versus the change in gene expression given the mutation (y axis; one unit represents 1 s.d. of change in residual gene expression). Power was defined as 1 – P(type II error) at a significance level of P(type I error) = 0.0085, which is approximately at 20% FDR (n = 783 tumors).
a,b, A somatic eQTL on the promoter of TERT is associated with upregulation of TERT gene expression (a), with three mutation sites leading to the creation of binding motifs for Ets family members (b). c,d, A somatic eQTL in the 5′ UTR of KCNJ5 is associated with downregulation of this gene (c), with a gain of a Smad4 motif by a G-to-C mutation (d). e,f, Another somatic eQTL in the 5′ UTR of IQUB is associated with its transcriptional downregulation (e), by an A-to-G substitution creating an N-Myc binding motif (f). a,c,e, Violin plots were generated using the residual expression, which is the z-score-standardized RNA-seq data subtracting fitted values by all known and hidden covariates. P values were calculated using the F test (without multiple-testing correction; n = 783 tumors). Box plot elements are defined as in Fig. 1b. The 95% confidence intervals for the mean are (0.26, 0.45) and (–0.10, 0.00) (a), (–0.98, –0.26) and (–0.03, 0.06) (c), and (–0.56, –0.05) and (–0.03, 0.05) (e) for mutant and wild-type sequences, respectively.
a, Flow cytometry analysis of MDA-MB-231 breast cancer cells 48 h after transient transfection with the different GFP reporter constructs. A ‘live’ gate was drawn based on forward scatter versus side scatter. The first of three triplicate samples is shown here. b, For all three of the triplicates, the polygon delineated by black lines shows the gated region used to define GFP+ cells. The first row corresponds to the samples depicted in a. d,f, Flow cytometry analysis of U2OS osteosarcoma cells (d) and RPMI-7951 metastatic melanoma cells (f). These plots are representative of three independent cell culture experiments. c,e,g, Bar graphs (average ± s.d. across three cell culture replicates; P values are from two-tailed t tests with three wild-type and three mutant samples) showing the percentage of GFP+ cells and the median fluorescence intensity of GFP+ events in MDA-MB-231 (c), U2OS (e) and RPMI-7951 (g). Individual data points are available in Supplementary Table 5.
a, Representative trajectories of breast cancer cells migrating through 3D culture conditions. In low-density collagen (2.5 mg/ml) mimicking the stiffness of normal tissue, both MDA-MB-231 breast cancer cells and HT-1080 fibrosarcoma cells display a less invasive phenotype, whereas in high-density collagen (6 mg/ml) mimicking the stiffness of tumor tissue, cells display a highly invasive phenotype26, 27. Trajectories shown are representative of observations made in at least 60 cells from three independent experiments. b, RNA-seq analysis reveals a marked upregulation of DAAM1 in the high-density collagen condition where enhanced invasion is observed. DAAM1 is one of the most upregulated transcripts associated with this invasive migration phenotype across the genome28 (mean ± s.e.m. across three independent cell cultures; P values from two-tailed t tests; the 95% confidence intervals of the mean are (–2.0, 39.4) and (13.9, 63.3) in HT-1080 cells and (–0.4, 28.0) and (22.6, 32.9) in MDA-MB-231 cells for lowly and highly invasive phenotypes, respectively).
a, Persistence of the motility of wild-type and DAAM1-overexpressing MDA-MB-231 cells, defined as the ratio between the total invasion distance and the path length (P = 0.008, two-tailed Mann–Whitney U test; the 95% confidence intervals of the mean are (0.07, 0.10) and (0.10, 0.14) for wild-type and DAAM1-overexpressing cells, respectively). b, Length of the trajectories travelled by the cells. c, Single-cell mean velocity. In a–c, 74 and 83 cells were imaged in the wild-type and DAAM1 overexpression groups, respectively. d, Total invasion distance travelled by individual cells with additional Wnt5a signaling, defined as the distance from start to end of the cell’s trajectory (P = 0.0002, two-tailed Mann–Whitney U test; the 95% confidence intervals of the mean are (11.4 µm, 17.5 µm) and (18.8 µm, 60.3 µm) for wild-type and DAAM1-overexpressing cells, respectively). Here 63 and 15 cells were imaged in the wild-type and DAAM1 overexpression groups, respectively. In a–d, box plot elements are defined as in Fig. 1b. e, Uncropped pictures of protein the electropherogram presented in the main figure Fig. 3e. Bands were cropped to save space in the main figure as indicated by dashed boxes.
a, The number of patients (y axis) is plotted against cancer tissues (x axis). NBS subtypes are represented by colors. b, Pairwise one-sided Fisher’s exact test between cancer tissues and molecular subtypes (n = 810 tumors). c, Percentage of patients shared between the subtypes derived from both coding and noncoding mutations and those derived from coding mutations only.
Log-likelihood ratio test P values (n = 793 patients) of significance (y axis) are plotted against the numbers of subtypes (x axis). Each blue bar represents a P value of a simple Cox proportional-hazard model, where survival time is a function of subtypes. Each orange bar represents a P value by comparing a complete model, which takes into account subtypes and tissues, against a null model that includes tissues only.
a, Pathways characterizing the TERT–BRAF–IDH1 subtype, defined as subnetwork regions extracted from ReactomeFI by NBS. b, Mutation matrix of the TERT–BRAF–IDH1 pathway subtype showing individual tumors (columns; ordered by cancer tissues) with indicated types of mutations on signature genes (rows).