Main

Previous large-scale sequencing projects have identified many putative cancer genes, but most efforts have concentrated on mutations and copy-number alterations in protein-coding genes, mainly using whole-exome sequencing and single-nucleotide polymorphism arrays1,2,3,4. Whole-genome sequencing has made it possible to systematically survey non-coding regions for potential driver events, including single-nucleotide variants (SNVs), small insertions and deletions (indels) and larger structural variants. Whole-genome sequencing enables the precise localization of structural variant breakpoints and connections between distinct genomic loci (juxtapositions). Although previous whole-genome sequencing analyses of modestly sized cohorts have revealed candidate non-coding regulatory driver events8,9,10,11,12,13,14,15, the frequency and functional implications of these events remain understudied6,7,13,16,17.

Driver identification remains a far greater challenge in non-coding regions than in coding genes, owing to sequencing and mapping artefacts, poorly understood localized hypermutation processes14,18,19, incomplete annotation of regulatory regions, inaccurate estimation of the background mutation rate and the unknown functional effect of non-coding mutations. The discovery of drivers from structural variants is further complicated by their sparsity, the lack of obvious neutral events to build background models and their complex functional effects. Adequate statistical methods that address these issues are needed to reliably identify non-coding drivers.

The ICGC and TCGA PCAWG effort, which has collected and systematically analysed cancer genome sequences from 2,658 patients across 38 types of cancer5, offers an opportunity to characterize putative non-coding driver events that cannot be found using data from whole-exome sequencing or single-nucleotide polymorphism arrays. Here we describe a comprehensive search for non-coding somatic drivers. For point mutations (SNVs and indels), we combine results from multiple driver-discovery algorithms and, by carefully evaluating the significant hits, reveal that recurrent artefacts and poorly understood mutational processes have led to common false positives among previously reported non-coding drivers. For structural variants, we introduce two new methods for identifying both regions with significantly recurrent breakpoints (SRBs) and with significantly recurrent juxtapositions (SRJs), accounting for genomic heterogeneity in the rates of DNA break and repair and the three-dimensional architecture of the genome. Finally, to assess the potential for future non-coding driver discoveries, we quantify our statistical power in the PCAWG dataset and estimate the overall excess of point mutations in non-coding regulatory regions around known cancer genes.

Hotspot mutations across cancer types

Many protein-coding driver mutations occur in single-site ‘hotspots’. In the PCAWG dataset, only 12 single-nucleotide positions were mutated in >1%, and 106 in >0.5%, of patients (Extended Data Fig. 1a, Methods). Although protein-coding regions span only about 1% of the genome, 15 out of 50 (30%) of the most frequently mutated sites were well-studied hotspots in cancer genes (KRAS, BRAF, PIK3CA, TP53 and IDH1) (Fig. 1a, Extended Data Fig. 1b), along with the two canonical TERT promoter hotspots6,7.

Fig. 1: Non-coding point mutations in PCAWG.
figure 1

a, The bar chart (left) shows the total number of patients across PCAWG with mutations at a particular genomic hotspot (chromosome:position). The top 25 hotspots are grouped as known drivers or induced by mutational processes. The table (middle) shows the frequency of mutations across a subset of PCAWG cohorts. Lymphoid malignancies comprise Lymph–BNHL and Lymph–CLL. The stacked bar chart (right) shows the contribution of mutational processes to the hotspot mutations (Methods). Gene names are given when hotspots overlap functional elements (colour-coded), with amino acid (AA) alterations for protein-coding genes (solidus denotes substitution with any one of the indicated amino acids). Extended Data Fig. 1b shows the top 50 hotspots, and all cohorts. b, Significant non-coding elements (Q < 0.1 of Brown’s combined P values of up to 13 driver discovery methods; Methods) identified before manual review in cohorts with at least one hit. Colour represents significance levels. Details are provided in Supplementary Table 5. *Potential technical artefact; #targets affected by mutational processes. AdenoCA, adenocarcinoma; CNS, central nervous system; Eso, oesophageal; GBM, glioblastoma; HCC, hepatocellular carcinoma; Medullo, medulloblastoma; Panc, pancreatic; Prost, prostate; RCC, renal cell carcinoma; Repr., reproductive organs; SCC, squamous cell carcinoma; TCC, transitional cell carcinoma; Thy, thyroid. HIST1H2AM is also known as H2AC17; Ala.TGC as TRA-TGC3-1Met.CAT as TRM-CAT1-1; and Gly.GCC as TRG-GCC2-3. PTDSS1/MTERF3 denotes that 5′ UTR mutations in PTDSS1 also overlap the MTERF3 promoter.

The remaining non-coding hotspots could be attributed to the following localized mutational processes associated with passenger events: (i) damage from ultraviolet (UV) light and impaired nucleotide excision repair in melanoma at sites occupied by transcription factors5,18,19,20; (ii) somatic hypermutation by activation-induced cytosine deaminase (AID) in B-cell non-Hodgkin lymphoma (Lymph–BNHL) and chronic lymphocytic leukaemia (Lymph–CLL); (iii) palindromic sequence contexts believed to form hairpin DNA structures targeted by APOBEC enzymes (in an intron of GPR126 (also known as ADGRG6) and the PLEKHS1 promoter)10; and (iv) presumed technical artefacts (Fig. 1a, Supplementary Note 1). These findings suggest that—besides TERT promoter events—non-coding single-site hotspot drivers are infrequent or fall in regions with low sensitivity to detect mutations.

Discovery of point-mutation drivers

To identify recurrently mutated genomic elements, we first analysed somatic SNVs and indels in protein-coding regions, RNA genes (long and short non-coding RNAs and microRNAs (miRNAs)), and regulatory regions (promoters, 5′ untranslated regions (UTRs), 3′ UTRs and enhancers), totalling about 4% of the genome (Extended Data Fig. 2a–c, Methods, Supplementary Table 1). We analysed 2,583 tumours from 27 individual tumour types, and 15 meta-cohorts that grouped cancers by tissue of origin or organ system (Extended Data Fig. 2d, Methods). We identified candidate drivers—that is, cohort–element combinations with Q < 0.1 (10% false discovery rate (FDR))—by integrating 13 discovery algorithms, circumventing biases introduced by any one method (Extended Data Figs. 2e, 11, Supplementary Tables 2, 3, Supplementary Note 2). We benchmarked this approach by evaluating its ability to detect 603 known cancer genes (from the Cancer Gene Census (CGC)21, v.80), and found that combining methods improved performance compared to single algorithms (Extended Data Fig. 3a, b, Methods). Overall, we identified 1,294 significant hits that involved 520 unique candidates (Supplementary Tables 4, 5).

Filtering the significant hits

Even after conservative FDR control, false-positive ‘driver’ loci can remain, owing to inaccurate background models, sequencing and mapping artefacts, or local increases in mutations due to unaccounted-for mutational processes. We therefore systematically filtered the candidate driver elements on the basis of technical and biological criteria, followed by careful review (Extended Data Fig. 3c, Methods, Supplementary Note 3). Examples of filtered elements include the promoters of PIM1 (lymphoid tumours) and RPL13A (melanoma) because of associations with localized AID and UV-light mutational processes, respectively; PLEKHS1, GPR126, TBC1D12 and LEPROTL1 because of palindromic APOBEC target sequences9,10; and the WDR74 5′ UTR and promoter8,10,14, owing to mapping problems detected in downstream manual review (Supplementary Table 5, Supplementary Note 4). In combination, filtering and reapplying FDR control discarded 589 out of 1,294 (46%) of the original cohort–element hits and 341 out of 520 (66%) unique elements (Extended Data Fig. 3c, Supplementary Tables 4, 5).

Candidate coding and non-coding drivers

Our stringent combination and filtering strategy yielded 705 hits in 179 genomic elements: 602 hits in 143 protein-coding genes and 103 hits in non-coding elements. We observed wide variability across different types of cancer, from one hit in clear-cell renal cancer to 80 in the pan-cancer meta-cohort (Fig. 1b, Supplementary Tables 4,5). Although most candidate drivers gained significance in larger meta-cohorts, some genes—such as DAXX (pancreatic endocrine tumour), NRAS (melanoma), SPOP (prostate adenocarcinoma), FGFR1 (pilocytic astrocytoma) and MIR142 (Lymph–BNHL)—scored higher in individual tumour types (Extended Data Fig. 3d). These results emphasize the trade-off between limiting driver discovery analyses to particular types of tumour and maximizing cohort size.

The candidate coding drivers we identified agreed with previous results: of the 143 genes that were significant in at least 1 cohort, 69% are in the CGC and nearly all have previously been implicated in cancer. In contrast to large whole-exome sequencing datasets, the fewer patients per cancer type in this dataset provided power sufficient only to detect genes with the strongest signal. We found 116 additional hits in 84 unique elements that were ‘near significance’ (0.1 < Q < 0.25). Fifty-one per cent of the 63 unique protein-coding genes in this set are in the CGC, which suggests that they would have been discovered in larger cohorts (Supplementary Table 4).

To nominate a significant non-coding element as a candidate driver, we reviewed the supporting evidence from the mutation calls, additional genomic data (chromosomal breakpoints, copy number, loss-of-heterozygosity and expression data), cancer gene databases and the literature (Methods, Supplementary Tables 6, 10). We describe the key candidates below, and in Supplementary Note 4.

The TERT promoter was the most frequently mutated non-coding driver in this dataset (14 cohorts) (Fig. 1b), and these mutations were strongly associated with higher TERT expression, as has previously been reported9 (Extended Data Fig. 4a, Supplementary Table 10). Mutations in the promoter and/or 5′ UTR of MTG2 (which encodes a GTPase involved in the mitochondrial ribosome) were associated with an expression of MTG2 that was marginally significantly lower, in both the pan-cancer (P = 0.036, fold difference = 0.8) and carcinoma (P = 0.029, fold difference = 0.8) meta-cohorts (Extended Data Figs. 4a, 5a). Mutations in the 5′ UTR have previously been shown to decrease MTG2 expression in vitro22.

Recurrent somatic events were identified in the 3′ UTRs of TOB1 (carcinoma and pan-cancer meta-cohorts), NFKBIZ (lymphomas) and ALB (liver cancer) (Fig. 1b). TOB1 encodes an anti-proliferation regulator that associates with ERBB2, and also affects migration and invasion in gastric cancer23. TOB1 regulates other mRNAs through binding to their 3′ UTR and promoting deadenylation24. Tumours with 3′ UTR mutations in TOB1 showed a trend towards decreased expression (P = 0.053, fold difference = 0.7). The mutations did not concentrate in known miRNA-binding sites; however, the region is extremely conserved and thus probably functional (Fig. 2a). TOB1 and its neighbouring gene WFIKKN2 are focally amplified in breast cancer and pan-cancer, suggesting a complex role in cancer (Extended Data Fig. 4b). NFKBIZ is a transcription factor that is mutated in diffuse large B cell lymphoma and amplified in primary lymphomas25. Mutations in the 3′ UTR accumulated in a hotspot proximal to the stop codon and upstream of conserved miRNA-binding sites (Extended Data Fig. 5b). The enrichment of indels next to the stop codon suggests that this hotspot is not due to AID off-target activity. Previous functional experiments have associated these mutations with increased NFKBIZ expression25, which we observed in our lymphoma cohort (P = 0.035, fold difference = 3.2; after correction for copy number, P = 0.03) (Extended Data Fig. 5b).

Fig. 2: Newly identified non-coding driver candidates and localized transcription-associated mutational process.
figure 2

a, Recurrent mutations and associated gene expression in the highly conserved TOB1 3′ UTR. Tracks showing conservation score (PhyloP, grey), miRNA-binding sites (TargetScan (top track) and Ago-Clip (bottom track)), and observed SNVs (blue) and indels (green). Expression of TOB1 in mutated (n = 13) and wild-type (n = 886) cases (right). P value based on two-sided Wilcoxon rank-sum test. Bars represent means. CNA, copy-number alteration. b, Indels and SNVs overlapping the TP53 5′ region and their effect on gene expression. H3K4me3 from the GM12878 cell line (ENCODE). Event numbers match with gene expression in the right panel (red dot, mutated sample; black bar, median). P value represents Fisher’s combination of permutation tests within each tumour type. ChRCC, chromophobe renal cell carcinoma; FPKM, fragments per kilobase of transcript per million mapped reads. c, Overall pan-cancer distribution of indels and SNVs in ALB, NEAT1 and MALAT1 genomic loci (lymphoid tumour samples were excluded owing to AID). d, Quantification of average indel rates for genes with significantly mutated 3′ UTRs. Error bars represent 95% binomial confidence intervals. e, Contribution of indels of different sizes in: all protein-coding and long non-coding RNA genes; ALB; NEAT1; MALAT1; MIR122; and the remaining genes enriched in 2–5-bp indels. f, SNV and indel rates (total events per Mb per patient) in different functional regions of 18 protein-coding genes enriched in 2–5-bp indels (without ALB, which contributed 47% of indels). Red lines indicate background indel and SNV rates estimated from all protein-coding genes. Error bars as in d; raw counts provided in Supplementary Table 18. cf, Mutations analysed in all unique cases (n = 2,583).

Both the exon and promoter of the non-coding RNA RMRP were significantly mutated in multiple types of cancer (Fig. 1b, Extended Data Fig. 5c). Germline RMRP mutations cause cartilage–hair hypoplasia, and previous in vitro studies have shown that some somatic promoter mutations are functional16. The RMRP locus is also focally amplified in several types of tumour (Extended Data Fig. 4b). The enrichment of mutations in sites that can affect secondary structure suggests that these mutations are functional (P = 0.011, permutation test) (Extended Data Fig. 5c), although caution is required because this locus also appears to be affected by mapping artefacts or increased mutation rates (Supplementary Note 4).

The miR-142 precursor miRNA was significant in Lymph–BNHL and the lymphatic and haematopoietic cohorts (Fig. 1b; Extended Data Fig. 5d). The locus is a known AID off-target region in lymphoma12,26, but 7 out of 8  mutations in the mature miRNA mir-142-p3––for which the largest functional effect is expected––were not assigned to AID, which suggests that these mutations are under selection12.

Unbiased genome-wide driver screen

To test whether we missed drivers by focusing on functionally annotated regions, we applied an unbiased genome-wide survey to all non-overlapping 2-kb windows for excess point mutations. Twenty-two of the resulting 67 significant windows overlap with known protein-coding drivers, and 28 overlap highly transcribed regions with an excess of 2–5-bp indels (described in the ‘Transcription-associated indel signature’ section below) (Extended Data Fig. 5e, Supplementary Table 9, Supplementary Note 5). The remaining 17 windows have no obvious link to cancer, and several appear to be affected by mapping artefacts. A separate analysis of 4,351 ultra-conserved non-coding regions did not yield new candidate drivers (Extended Data Fig. 5e, Supplementary Note 5). Both screens suggest that the paucity of non-coding point-mutation drivers found in this study is not due to the annotation of functional elements.

Increasing power for known cancer genes

Finally, we performed restricted hypothesis testing to boost the statistical power to detect cis-regulatory driver mutations near cancer genes from the CGC21 (Supplementary Table 7). Restricted hypothesis testing of cancer gene promoters revealed a significant recurrence of TP53 promoter mutations (11 patients in pan-cancer, Q = 0.044), mostly comprising SNVs and deletions that affect the transcription start site or donor splice site of the first non-coding exon. In 10 out of 11 cases, the mutation occurred in combination with loss-of-heterozygosity, and all samples with expression data showed decreased mRNA levels (Fig. 2b). None of these patients contained additional coding mutations that could instead be responsible for the downregulation of TP53. To our knowledge, this is the first report of a relatively infrequent—but impactful—form of TP53 inactivation by non-coding mutations.

Focal gains or losses in cancer are selected for modulating expression levels of their target genes. Restricting the hypothesis testing to the non-coding elements of such genes (n = 216,986 cohort–element combinations, representing 5,201 unique elements) (Methods) yielded only one new hit, the 3′ UTR of the oncogene FOXA1 in prostate cancer (Supplementary Table 11).

Transcription-associated indel signature

Several significant non-coding elements (the ALB 3′ UTR, NEAT1, MALAT1 and MIR122) were hit by many indels; all have previously been reported to be mutated in cancer10,15,27 (Figs. 1b, 2c). To explore whether ALB 3′ UTR events are under selection, we calculated indel rates across the functional regions of this gene. The indel rate is notably high throughout the UTRs, introns and exons, and even downstream of the polyadenylation site—a pattern inconsistent with selection (Fig. 2c, d). Similarly, FOXA1 has high indel rates throughout its locus, whereas the indels in NFKBIZ and TOB1 are in their 3′ UTRs, suggesting that these are driver events (Fig. 2d). ALB, NEAT1 and MALAT1 mutations were not associated with changes in gene expression (Extended Data Fig. 4a) and were not associated with high cancer cell fractions or biallelic loss (Extended Data Fig. 6a, b). Likewise, indels in MIR122 were downstream of the mature miRNA, and were not associated with altered expression of the targets of this miRNA (Supplementary Note 5).

If the indels in these genes were due to a mutational process rather than selection, they might exhibit distinct features. Indeed, indels in NEAT1, MALAT1, MIR122 and ALB were strongly enriched in 2–5-bp-long events (Fisher’s P < 6.8 × 10−5, for all) (Fig. 2e). A systematic search of coding and non-coding genes with significantly (Q < 0.1) increased rates of 2–5-bp indels revealed that this mutational process affects at least 18 additional genes in different types of tumour, most of which are highly expressed and tissue-specific (as has previously been reported for some of these genes15) (Extended Data Fig. 6e, f). Although less enriched, SNVs also occur at high frequencies in these regions (Fig. 2f). Overall, our findings suggest that the indels in MALAT1, NEAT1, ALB and MIR122 are not driver events and are the result of a transcription-associated mutational process. The previously reported oncogenic effect of altered MALAT1 and NEAT1 expression27,28,29 may thus be unrelated to these mutations. Our findings also suggest that although FOXA1 protein-coding indels are drivers, 3′ UTR indels might be passengers30.

Breakpoints at driver and fragile sites

Driver structural variants may act by disrupting one or both of their breakpoint loci (for example, deactivating a tumour suppressor), or by generating a novel juxtaposition between loci. We thus searched both for genomic regions with SRBs and for pairs of regions with SRJs (Extended Data Fig. 7).

For SRBs, we first defined a background model to predict breakpoint density, using eight explanatory variables (Methods, Supplementary Table 13) and accounting for unexplained sources of variation31 (Supplementary Note 6). We identified 53 disjoint regions with SRBs (Q < 0.1) (Fig. 3a, Supplementary Table 14), which cleanly divided into two groups on the basis of the variability of the breakpoints at the other side of the rearrangements. Eight SRBs had partner breakpoints that were tightly clustered (had low rearrangement dispersion scores; Methods) and represented known oncogenic fusions. The remaining 45 SRBs had dispersed partner breakpoints (had high rearrangement dispersion scores), and were largely associated with previously identified somatic copy-number alterations (SCNAs) (Fig. 3b).

Fig. 3: Significantly recurrent breakpoints and juxtapositions.
figure 3

a, Relative enrichment (Fisher’s exact test) for events per tumour type for the 20 most-significant SRBs (circle size). Loci are labelled by the likely driver gene from the CGC21. For gene symbols separated by a solidus, both or either of the genes are intended. b, Rearrangement dispersion score versus mean replication timing of the 53 SRBs. Colours indicate fusion (purple), fragile-like (green), deletion (blue), amplification (red) or copy-neutral (black) events. c, Tumour-to-normal read coverage ratio in an ovarian tumour with a BRD4 microdeletion; red arrow indicates the rearrangement (top). Breakpoint density across PCAWG breast and ovarian cancers (middle). Enhancer locations from breast (BRCA) and ovarian (OV) tissue51 (bottom). d, Somatic copy number at the BRD4 and NOTCH3 locus in breast and ovarian cancers with (SV+) and without (SV−) rearrangements. e, Gene expression per absolute copy number for BRD4 and NOTCH3. f, The 30 most-significant SRJs, with their relative enrichment (circle size) per tumour type, annotated with oncogenic fusions from the Catalogue of Somatic Mutations in Cancer (COSMIC) (left), CGC gene (centre) and protein disruption (right) (Methods). ATP5E is also known as ATP5F1E. g, Expression correlates of rearrangements in SRJs from COSMIC (purple), other SRJs (pink) or not in any SRJ (grey). For each rearrangement (R), the primary locus (left) is defined as the breakpoint within 100 kb of the gene that is most overexpressed in rearranged samples; the secondary locus (right) is the other breakpoint. Expression at the primary locus in samples with the rearrangement relative to samples without the rearrangement is greater for SRJs than for other rearrangements (left). The tissue-specific expression at the secondary locus in wild-type (WT) samples, relative to samples of different tissue types, is greater for SRJs than other rearrangements (right). P values represent comparisons to ‘not in SRJ’. d, e, g, Box plots show the interquartile range, median and 95% confidence interval; two-sided t-test. h, TERT promoter mutations and rearrangements across PCAWG melanomas. i, Rearrangements between TERT promoter and BASP1 and MYO10 locus result in focal amplification and relocation of distal enhancers to TERT. AML, acute myeloid leukaemia; Colorect, colorectal; Leiomyo, leiomyosarcoma; MPN, myeloproliferative neoplasm; Osteosarc, osteosarcoma; PiloAstro, pilocytic astrocytoma.

It has been difficult to distinguish recurrent driver SCNAs from passenger events at fragile sites32. At the resolution afforded by whole-genome sequencing, late replication timing predicted fragility-associated SRBs better than existing fragile site annotations (Supplementary Note 7), identifying 12 fragile-like SRBs (Fig. 3b). The remaining 33 SCNA-like SRBs comprised 14 amplifications, 8 deletions and 11 copy-neutral events (Supplementary Table 14).

The different classes of SRB were associated with different effects on neighbouring genes. Five of the eight deletion-associated SRBs were associated with biallelic inactivation of nearby known tumour suppressors, compared to none of the 12 fragile-like SRBs (P = 0.039) (Extended Data Fig. 8a). The fragile-like SRBs were furthest from tissue-matched enhancers and caused the weakest expression changes, consistent with them being passenger events32. By contrast, fusion-like SRBs were closer to tissue-matched enhancers than the other SRBs (P < 0.01) (Extended Data Fig. 8b) and were associated with greater changes in expression than all other SRBs except amplifications (P < 0.05 for all types) (Extended Data Fig. 8c, Methods). Our analyses indicate that SRB driver events can be classified using rearrangement dispersion scores, replication timing and gene expression. Notably, neither rearrangement dispersion scores nor association with replication time can be accurately determined from microarrays or whole-exome sequencing, which highlights the importance of whole-genome sequencing. Altogether, we identified SRBs at 34 sites of known oncogenic fusions and recurrent SCNAs, 5 additional sites that are probably due to DNA fragility and 14 novel driver candidates (Supplementary Note 8).

Novel structural-variant driver candidates

Although most SCNA-like SRBs act by altering gene copy numbers, several appeared to target regulatory elements. We identified three that were significantly (Q < 0.05) associated with expression changes of nearby genes after controlling for copy number (Methods), two of which we discuss here. The first comprised structural variants at 10p15, which were associated with a greater than twofold upregulation of AKR1C1, AKR1C2 and AKR1C3 in seven cases of lung squamous cell carcinoma and two cases of liver hepatocellular carcinoma (Extended Data Fig. 8d). AKR1C proteins are aldo-keto reductases involved in steroid homeostasis. Ectopic expression transforms cell lines, and germline mutations have previously been linked to an increased risk of developing lung cancer33,34. Three-quarters of the breakpoints are near (<10 kb) lineage-specific enhancers, potentially altering promoter–enhancer interactions (and hence gene expression). However, because the highest density of breakpoints lies between two long inverted repeats, the structural variants may have been induced by DNA secondary structure.

The second SRB contains recurrent microdeletions (<50 kb) involving the 5′ end of BRD4 in ovarian (eight cases, P < 10−7) and breast tumours (six cases, P < 0.04) (Fig. 3c, Extended Data Fig. 8e). These deletions were highly enriched in cancers that amplified a segment that includes BRD4 and NOTCH3 (P < 0.004) (Fig. 3d, Extended Data Fig. 8f) but were not a direct consequence of these amplifications (Supplementary Note 9). BRD4 is a chromatin regulator and a therapeutic target in several types of cancer35,36, including ovarian and triple-negative breast cancer37,38. Given the increased copy number of the full BRD4 gene, we would expect increased gene expression. However, the microdeletions are associated with a lower expression of BRD4 in breast (P = 0.001) and ovarian tumours (P = 0.04), but not of the neighbouring gene NOTCH3 (Fig. 3e). The focal deletions in BRD4 overlap a prominent exon-1 H3K4me3 peak and intron-1 enhancer elements in HMEC (normal breast) and MCF-7 (breast tumour) cells (Extended Data Fig. 8e), which suggests that these deletions disrupt regulatory elements. To our knowledge, this is the first evidence of a recurrent microdeletion limiting expression of an amplified gene.

Recurrent fusions target gene regulation

Motivated by the detection of fusion-like SRBs, we specifically looked for genomic loci that were juxtaposed more often than expected by chance, after controlling for both the rate of breakpoints at each locus and the distance between them (Methods). We identified 90 such SRJs (Fig. 3f, Supplementary Table 15), including 13 known oncogenic fusions (including all 8 fusion-like SRBs) and 77 novel hits—18 of which linked to at least one known cancer gene (Supplementary Note 8). Previously reported oncogenic SRJs were observed more frequently (average 24 patients per fusion, range 2–98) than novel ones (most often 2 patients per fusion, range 2–4). As juxtapositions are unlikely to occur by chance, observing even two becomes highly significant. However, it is possible that some SRJs reflect inaccuracies in our background model rather than true drivers. We therefore further evaluated the SRJs on the basis of (i) a ‘robustness factor’ that indicates how much the background rate could increase before the SRJ would become insignificant, and (ii) the ratio between the observed and expected numbers of events under the current background model (‘effect size’) (Extended Data Fig. 9a). Twenty-six SRJs, including 11 of the 13 known drivers and 15 newly identified SRJs, are robust to tripling the expected background rate, and 22 others would remain significant with a doubled rate.

Most canonical driver rearrangements have previously been found in single tumour types, often associated with tissue-specific expression39,40. We found that 9 of our top 10 SRJs are tissue-specific, despite searching across 30 different types of tumour. Such tissue specificity is not observed for cancer genes affected by SCNAs, for which the top 10 are altered in 11.9 cancer types (on average), or by point mutations (for which the top 10 are altered in 6.7 cancer types, on average) (Supplementary Table 16).

The tissue specificity of SRJs suggests that they are strongly shaped by epigenetic state, either owing to mechanistic reasons (for example, tissue-specific three-dimensional proximity of the two DNA breakpoints) or to selection that connects tissue-specific regulatory elements with oncogenes13,41,42,43. The latter seems to be more likely because: (i) SRJs are associated with significant overexpression of only one of the rearrangement partners (the ‘primary locus’) relative to randomly selected rearrangements (primary locus, P < 10−4 (Fig. 3g left); secondary locus, P > 0.05 (Extended Data Fig. 9b left)); (ii) the rearrangement partner, in the secondary locus, tends to be highly expressed in that tissue type relative to others (Fig. 3g right); and (iii) the distance to the nearest tissue-specific enhancer is smaller for SRJs than for rearrangements overall (Extended Data Fig. 9b). These observations suggest that SRJs act in general by bringing regulatory elements to an oncogene that is otherwise expressed at a low level.

In many cases, SRJs generate truncated or chimeric proteins, and breakpoints within introns or exons were indeed overrepresented (68% versus 56% expected, P < 10−7). However, only 11 of the 30 (37%) most significant SRJs generated novel proteins in all samples, and 6 others sometimes generated novel proteins; the rest were either non-disruptive or contained breakpoints within the first two introns of the disrupted gene, leaving most of the protein intact44 (Fig. 3f). Moreover, SRJs that generate novel proteins exhibited expression changes similar to those that do not (P = 0.4) (Extended Data Fig. 9c). We conclude that altering gene expression is a key function of both classes of SRJs, and that SRJs are akin to non-coding driver point mutations that act on regulatory elements.

We found several SRJs that involve amplified oncogenes, including MDM2, EGFR and TERT (Fig. 3f, h, i, Extended Data Fig. 9d–f, Supplementary Table 15). The TERT promoter region was juxtaposed in four melanomas (P < 10−7) to a region in the BASP1 gene (both on chromosome 5), and to a region near NDUFC2 (t(5,11)) in two melanomas and one medulloblastoma (P < 10−8). Both juxtaposed regions were marked with melanocyte enhancers, which suggests that they could drive TERT expression. Among melanomas, these rearrangements are mutually exclusive with the C228T and C250T mutations of the TERT promoter (P < 10−3) (Fig. 3h). Because the juxtapositions were always part of complex events that also amplified TERT, increased TERT expression may be due to amplification, the juxtapositions or both.

Paucity of non-coding drivers in cancer

Our analyses of genomic hotspots, functional elements, genomic windows and SRJs all suggest that non-coding drivers are rare compared to protein-coding drivers. This might, in part, be due to a lack of discovery power3. We therefore evaluated the discovery power of mutational-burden tests for recurrent events across the different types of element in our tumour cohorts, focusing first on point mutations3,16. We found that the fraction of mutated patients required for a driver to reach 90% discovery power ranged from <1% in large cohorts with low background-mutation densities to 25% in small cohorts with high background-mutation densities (Fig. 4a). Different types of element were similarly powered, suggesting that the paucity of drivers in non-coding versus coding elements is not due to a lack of power. Similarly, our power to detect SRJs was higher in large cohorts with low rearrangement rates, and for long and interchromosomal rearrangements owing to their lower overall rates (Extended Data Fig. 10a): we were only powered to detect events that recur in 5–20% of samples in most types of cancer (Fig. 4b). Moreover, beginning with about 2,500 tumours, we expect to find a new SRJ with every 25 additional genomes (Fig. 4c).

Fig. 4: Power considerations and paucity of non-coding drivers.
figure 4

a, Heat map shows the minimal frequency of a driver element with ≥90% discovery power. Power is dependent on the background mutation frequency (above the heat map), the element length (median length depicted in Extended Data Fig. 2c) and the number of patients with mutations (cell numbers). For example, the pan-cancer cohort is powered to discover a protein-coding driver gene (coding sequence (CDS)) present in <1% (18 patients), whereas the Bladder–TCC cohort is only powered to discover drivers present in at least 27% (6 patients). b, Number of samples required to detect 90% of recurrent juxtapositions across 90% of pairs of loci, as a function of the median number of rearrangements per sample and the rate above background at which the fusion recurs (solid lines). The vertical dashed lines represent the median rearrangement rates of each cancer type, and the stars on these lines indicate the numbers of whole genomes analysed for that cancer type. c, Number of SRJs detected after downsampling the data to various sample sizes, separately indicating rearrangements that recur at high (≥12%; red) and low (<12%; black) rates above background; their sum (blue). d, Number of observed mutations (SNVs and indels) in cis-regulatory and coding regions of 603 protein-coding cancer genes with the expected numbers shown in lighter colours (left). Right, the number of excess mutations (that is, the estimated number of driver mutations) (right). The grey fraction of promoter mutations indicates TERT events. Error bars show 95% binomial confidence intervals. Only samples with high detection sensitivity were included (n = 936).

Low sequencing coverage (for example, in GC-rich regions45) also limits driver discovery. To measure this effect in the PCAWG data, we quantified our ability to detect mutations (detection sensitivity)16 in cancer gene promoters. Although the mean detection sensitivity in promoters is high (41.9% of genomic positions have mean detection sensitivity >80% across tumours), only 4.1% of the promoters had detection sensitivity >90% in >90% of bases. In particular, the two canonical TERT promoter hotspots had highly variable detection sensitivity among patients and cohorts, from only 3% of patients in the central-nervous-system pilocytic astrocytoma cohort to 100% in the thyroid adenocarcinoma cohort (Extended Data Fig. 10b). From these data, we inferred the expected number of TERT events in each tumour type (Extended Data Fig. 10c) and found that about 263 (95% confidence interval 232–295) TERT hotspot mutations were probably missed owing to a lack of detection sensitivity. Moreover, on average 9.9% (1.3–13.0% interquartile range) of the cancer gene promoter territory in the tumour of each patient was severely underpowered (an average detection sensitivity of <10%). Therefore, the lack of coverage in promoters may contribute to the paucity of non-coding drivers.

To determine whether the paucity of non-coding drivers discovered thus far could be due to the limited statistical power of current datasets, we estimated the overall excess of point mutations above background (that is, the expected number of driver events) in coding and cis-regulatory non-coding sequences in 603 cancer genes46 (Methods, Supplementary Table 7, Supplementary Note 11). To minimize the effect of samples with low detection sensitivity, we included only 936 samples with >90% detection sensitivity at the two TERT promoter hotspots (Extended Data Fig. 10c, d, Supplementary Note 11). Overall, this approach predicted more than 1,475 driver mutations (95% confidence interval 1,410–1,687; 1,069 SNVs and 406 indels) in the protein-coding sequences of these cancer genes (Fig. 4d), compared to only 96 (95% confidence interval 30–190) estimated driver mutations in promoters (73 attributed to TERT), 22 (95% confidence interval 0–88) in 5′UTRs, and 68 (95% confidence interval 0–178) in 3′ UTRs. Non-coding mutations in cancer-gene promoters were also not generally associated with loss-of-heterozygosity or altered expression, as one would expect if they were enriched with drivers (Supplementary Note 12). These results collectively indicate that, independently of statistical power, non-coding cis-regulatory driver mutations in known cancer genes besides TERT are much less frequent than protein-coding drivers.

Discussion

The accurate and reliable discovery of genomic drivers in tumours may have critical implications for patients with cancer. Our findings and the methods introduced here for the discovery of point-mutation and structural-variant drivers, method integration, vetting of candidates and identification of local hypermutation and fragile sites represent an important contribution to the collective effort towards charting all malignant changes that drive the cancer of each patient5.

Among the most interesting candidate non-coding driver elements we uncovered are the 5′-end mutations in TP53; 3′ UTR mutations in NFKBIZ and TOB1; and rearrangements involving AKR1C genes and BRD4. By careful analysis of the whole-genome sequencing data, we found that several previously reported and frequently altered non-coding elements may not be genuine drivers, including (i) the non-coding RNAs, NEAT1 and MALAT1 (which contain a high density of indels, seemingly owing to a transcription-associated mutational process) and (ii) recurrent structural variants in regions of late replication, indicating DNA fragility.

This study yielded unexpectedly few non-coding driver point mutations and structural variants. SRJs, which appear to act largely through the rearrangement of regulatory elements, are less frequent than SCNA-like SRBs, which directly amplify or delete coding sequences. The results from five analyses––hotspot recurrence, driver-element discovery, structural variants, discovery power and aggregated mutational excess––suggest that this paucity is not caused by a particular analysis strategy, but that regulatory elements truly contribute a much smaller number of recurrent cancer-driving events than protein-coding sequences. This paucity of non-coding drivers contrasts with the distribution of germline polymorphisms associated with heritability of complex traits, which are most frequently located outside of protein-coding genes47.

At least two factors contribute to the relative paucity of non-coding driver mutations in cancer: (i) the differential fitness effects of coding and non-coding mutations and (ii) the target size of functional elements. The paucity of promoter driver mutations in well-established cancer genes suggests that point mutations markedly affect the function of non-coding regulatory elements only rarely. This highlights TERT as a notable exception, perhaps because even a modest increase in TERT expression may suffice to circumvent normal telomere shortening. For other cancer genes, directly mutating protein-coding sequences or altering expression levels by copy-number change may provide larger phenotypic effects. For example, complete loss-of-function by nonsense mutations or deletions may be easier to achieve than by disrupting or translocating regulatory regions.

Technical shortcomings (such as coverage ‘blind spots’ in GC-rich promoters and different filtering strategies) may cause genuine drivers to be missed48. Therefore, the discovery of non-coding drivers will benefit from technical improvements, including even sequence coverage, longer and accurate reads, and improved variant-calling methods. Moreover, better annotation of functional non-coding elements will increase both the power to discover infrequently mutated driver elements and their interpretability. As datasets grow, yet-unidentified mutational mechanisms targeting particular genomic regions will emerge and require improved background models, including additional covariates and more-sophisticated statistical models. The analysis of structural variants has greater challenges because (i) accurately modelling their background density is complicated by their lower frequency and larger fraction of drivers (Supplementary Note 6); (ii) their target genes may be far from the breakpoints, as in SCNAs; (iii) the space for modelling SRJs is much larger (the genome squared); and (iv) many structural variants are part of complex events that often involve multiple chromosomes31, so that the resultant topology cannot be deduced without technologies such as long- or linked-read sequencing49,50. For these reasons, experimental validation remains important for all—and especially for non-coding—candidate drivers.

Our work suggests that larger datasets and technological advances will continue to identify new non-coding drivers, albeit at considerably lower frequencies than protein-coding drivers. We anticipate that the approaches developed here will provide a solid foundation for the incipient era of driver discovery from ever-larger numbers of cancer whole genomes.

Methods

No statistical methods were used to predetermine sample size. The experiments were not randomized and investigators were not blinded to allocation during experiments and outcome assessment.

Detailed methods are provided as Supplementary Methods.

Dataset generation

Out of 2,955 samples, we selected 2,583 unique donor samples for SNV and indel driver-discovery analysis on the basis of SNV quality control (Supplementary Methods). We found that 110 additional myeloid–AML samples had robust structural variant calls despite SNV artefacts; we included these in structural variant analyses, for a total of 2,693 samples. For tumour-type cohort analyses, we used only cohorts with at least 20 patients. Tumour meta-cohorts were defined by cell type of origin or by organ system (for example, lung for lung adenocarcinoma and lung squamous cell carcinoma). A pan-cancer meta-cohort was created by combining all tumour cohorts except for Skin–Melanoma and lymphoid tumours (Supplementary Methods).

Hotspot SNV analysis

We selected the 50 most-frequent SNV hotspots. These were analysed to identify known driver events; mutational signature biases related to sequence palindromes, immunoglobulin loci and so on; and potential artefacts, including regional mapping problems (Supplementary Methods).

Mutational signatures

We performed de novo global-signature discovery and signature attributions with SignatureAnalyzer’s Bayesian non-negative matrix factorization method52, based on 1,697 channels—including 1,536 pentanucleotide sequence contexts for single-base substitutions, 83 indel features, and 78 doublet-nucleotide substitution classes (Supplementary Methods).

Definition of genomic elements

GENCODE v.19 (ref.53) and other genomic resources were used to define functional genomic elements, including protein-coding genes (CDS, splice sites, 5′ UTR, 3′ UTR and promoters), long non-coding RNAs (gene body, splice site and promoters), short RNAs, miRNAs and enhancers (Supplementary Methods).

Candidate-driver-mutation identification methods and combination of results

We obtained results (P values) from 13 methods of driver discovery, including ActiveDriverWGS54, CompositeDriver, DriverPower55, dndscv46, ExInAtor56, LARVA57, MutSig tools3, NBR10, ncdDetect58, ncDriver59, OncodriveFML60 and regDriver61. We integrated the results of all these methods using a custom framework based on a previously published method62 for combining P values. Results from individual methods that showed large deviations from the expected uniform null distribution of P values were excluded. This approach was evaluated on real and simulated data. We controlled the FDR within each of the sets of tested genomic elements by concatenating all combined Brown’s P values from across all tumour-type cohorts and applying the Benjamini–Hochberg procedure63. Cohort–element combinations with Q values < 0.1 were designated as significant hits, and combinations with 0.1 ≤ Q < 0.25 as ‘near significance’. Extensive details are provided in the Supplementary Methods. In addition, we tested for element-independent recurrence with the NBR method on 2-kb bins spanning the entire genome, and non-coding ultraconserved regions64.

Post-filtering of driver mutation candidates

We applied stringent filters to discern positive selection from technical artefacts and mutational processes. We required at least three mutations to be present in candidate elements, in at least three patients of the tested cohort; more than 50% of mutations in mappable regions; less than 50% of mutations in palindromic DNA; and less than 50% of mutations attributed to APOBEC activity. For lymphoid tumours and skin melanoma, we required that <35% and <50% of mutations were attributed to the AID and UV-light mutational signatures, respectively. The FDR was recalculated after post-filtering.

Candidate driver structural-variant analyses

We applied separate analyses to detect recurrent structural variant breakpoints and recurrent juxtapositions. For each analysis, we first binned breakpoints, accepting only one breakpoint per sample per bin. We then determined which bins had more breakpoints than expected by chance (the SRB analysis), and which pairs of bins (or ‘tiles’) were joined by more rearrangements than expected by chance (the SRJ analysis).

Candidate driver breakpoints

We calculated the background rate of breakpoints per bin based on a Gamma–Poisson model15 that took into account genomic covariates, breakpoint counts normalized by the number of bases within each bin that had sufficient mappability to be eligible for breakpoint detection and accounted for an observed overdispersion of breakpoint counts that probably reflects unaccounted-for covariates (Supplementary Methods). We used the Gamma–Poisson model to calculate the P value for each bin (that is, the probability that each bin would exhibit the observed number of breakpoints (or greater) by chance alone), applying the Benjamini–Hochberg procedure63 to correct for multiple hypotheses.

Post-filtering of driver breakpoint candidates

We scored each recurrent breakpoint locus on the basis of the average replication timing of its breakpoints, and filtered those loci with scores >0.5 as probable fragile sites65.

Candidate driver juxtapositions

We developed a background model to indicate the probability that two loci would be joined, taking into account the observed rate at which each locus underwent DNA breaks (from the breakpoint analysis), the distance between them and the propensity for these rearrangements to reflect a break followed by invasion versus two breaks that were then joined. We determined the probability that each tile would contain the observed number of rearrangements using a binomial test, followed by controlling for multiple hypothesis testing using the Benjamini–Hochberg procedure63.

Gene-expression analyses

Gene-expression data were provided by the PCAWG Transcriptome Core Group66, and also generated using the same approach for an extended set of non-coding transcripts (Supplementary Methods).

Additional evidence for selection

In addition to associations between mutations or structural variants and expression, we looked for signals of copy-number-alteration recurrence using the GISTIC2 algorithm67. We also tested whether driver candidates showed significantly higher frequency of loss-of-heterozygosity in mutated samples using Fisher’s exact test. We calculated cancer allelic fractions using ploidy and tumour purity predictions from a previous publication68.

Mutational process and indel enrichment

For every gene, we calculated the proportion of indels of length 2–5 bp out of the total number of indels. This proportion was compared to the genome background proportion using a binomial test. We also compared the indel rate per gene (not distinguishing by length) to the background. Both sets of P values were corrected with the FDR method.

Power calculations

We estimated our power to discover driver elements mutated at a particular frequency in the population as previously described3,16, but solving for the lowest frequency for a driver element in the patient population that is powered (≥90%) for discovery. The calculation of this lowest frequency takes into account (i) the average background mutation frequencies for each cohort–element combination; (ii) the median length and average detection sensitivity for each element type and patient cohort size; and (iii) a global desired false-positive rate of 10%. The effect of element length is discussed in Supplementary Note 10, and details are provided in Supplementary Methods. Power calculations for detection of recurrent juxtapositions was performed similarly, except over a two-dimensional genomic fusion map divided into 100 × 100-kb tiles (Supplementary Methods). We performed this analysis first as a function of the distance between breakpoints (Extended Data Fig. 10a) and second as a function of the median number of rearrangements per sample, spanning values represented by histologies with more than 15 samples (Fig. 4b).

Estimation of the number of mutations in non-coding regions of known cancer genes

NBR was used to estimate the background mutation rate expected across cancer genes, using a conservative list of 19,082 putative passenger genes as background and including as covariates the local mutation rate, gene expression and averaged copy-number states. The resulting model predicted the number of passenger SNVs and indels expected by chance. By aggregating the expected numbers over 603 known cancer genes from the CGC69 (CGC v.80) (Supplementary Table 7), we compared the observed and expected numbers of mutations. For this analysis, we excluded samples with problems of low detection sensitivity (Supplementary Methods).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.