Author Correction: Analyses of non-coding somatic drivers in 2,658 cancer whole genomes

The discovery of drivers of cancer has traditionally focused on protein-coding genes1–4. Here we present analyses of driver point mutations and structural variants in non-coding regions across 2,658 genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium5 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). For point mutations, we developed a statistically rigorous strategy for combining significance levels from multiple methods of driver discovery that overcomes the limitations of individual methods. For structural variants, we present two methods of driver discovery, and identify regions that are significantly affected by recurrent breakpoints and recurrent somatic juxtapositions. Our analyses confirm previously reported drivers6,7, raise doubts about others and identify novel candidates, including point mutations in the 5′ region of TP53, in the 3′ untranslated regions of NFKBIZ and TOB1, focal deletions in BRD4 and rearrangements in the loci of AKR1C genes. We show that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-coding genes, additional examples of these drivers will be found as more cancer genomes become available. Analyses of 2,658 whole genomes across 38 types of cancer identify the contribution of non-coding point mutations and structural variants to driving cancer.

The discovery of drivers of cancer has traditionally focused on protein-coding genes [1][2][3][4] . Here we present analyses of driver point mutations and structural variants in non-coding regions across 2,658 genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium 5 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). For point mutations, we developed a statistically rigorous strategy for combining significance levels from multiple methods of driver discovery that overcomes the limitations of individual methods. For structural variants, we present two methods of driver discovery, and identify regions that are significantly affected by recurrent breakpoints and recurrent somatic juxtapositions. Our analyses confirm previously reported drivers 6,7 , raise doubts about others and identify novel candidates, including point mutations in the 5′ region of TP53, in the 3′ untranslated regions of NFKBIZ and TOB1, focal deletions in BRD4 and rearrangements in the loci of AKR1C genes. We show that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-coding genes, additional examples of these drivers will be found as more cancer genomes become available. the frequency and functional implications of these events remain understudied 6,7,13,16,17 .
Driver identification remains a far greater challenge in non-coding regions than in coding genes, owing to sequencing and mapping artefacts, poorly understood localized hypermutation processes 14,18,19 , incomplete annotation of regulatory regions, inaccurate estimation of the background mutation rate and the unknown functional effect of non-coding mutations. The discovery of drivers from structural variants is further complicated by their sparsity, the lack of obvious neutral events to build background models and their complex functional effects. Adequate statistical methods that address these issues are needed to reliably identify non-coding drivers.
The ICGC and TCGA PCAWG effort, which has collected and systematically analysed cancer genome sequences from 2,658 patients across 38 types of cancer 5 , offers an opportunity to characterize putative non-coding driver events that cannot be found using data from wholeexome sequencing or single-nucleotide polymorphism arrays. Here we describe a comprehensive search for non-coding somatic drivers. For point mutations (SNVs and indels), we combine results from multiple driver-discovery algorithms and, by carefully evaluating the significant hits, reveal that recurrent artefacts and poorly understood mutational processes have led to common false positives among previously reported non-coding drivers. For structural variants, we introduce two new methods for identifying both regions with significantly recurrent breakpoints (SRBs) and with significantly recurrent juxtapositions (SRJs), accounting for genomic heterogeneity in the rates of DNA break and repair and the three-dimensional architecture of the genome. Finally, to assess the potential for future non-coding driver discoveries, we quantify our statistical power in the PCAWG dataset and estimate the overall excess of point mutations in non-coding regulatory regions around known cancer genes.

Hotspot mutations across cancer types
Many protein-coding driver mutations occur in single-site 'hotspots'. In the PCAWG dataset, only 12 single-nucleotide positions were mutated in >1%, and 106 in >0.5%, of patients (Extended Data Fig. 1a, Methods). Although protein-coding regions span only about 1% of the genome, 15 out of 50 (30%) of the most frequently mutated sites were well-studied hotspots in cancer genes (KRAS, BRAF, PIK3CA, TP53 and IDH1) (Fig. 1a, Extended Data Fig. 1b), along with the two canonical TERT promoter hotspots 6,7 .
The remaining non-coding hotspots could be attributed to the following localized mutational processes associated with passenger events: (i) damage from ultraviolet (UV) light and impaired nucleotide excision repair in melanoma at sites occupied by transcription factors 5,[18][19][20] ; (ii) somatic hypermutation by activation-induced cytosine deaminase (AID) in B-cell non-Hodgkin lymphoma (Lymph-BNHL) and chronic lymphocytic leukaemia (Lymph-CLL); (iii) palindromic sequence contexts believed to form hairpin DNA structures targeted by APOBEC enzymes (in an intron of GPR126 (also known as ADGRG6) and the PLEKHS1 promoter) 10 ; and (iv) presumed technical artefacts (Fig. 1a, Supplementary Note 1). These findings suggest that-besides TERT promoter events-non-coding single-site hotspot drivers are infrequent or fall in regions with low sensitivity to detect mutations.

Discovery of point-mutation drivers
To identify recurrently mutated genomic elements, we first analysed somatic SNVs and indels in protein-coding regions, RNA genes (long and short non-coding RNAs and microRNAs (miRNAs)), and regulatory regions (promoters, 5′ untranslated regions (UTRs), 3′ UTRs and enhancers), totalling about 4% of the genome (Extended Data Fig. 2a-c, Methods, Supplementary Table 1). We analysed 2,583 tumours from 27 individual tumour types, and 15 meta-cohorts that grouped cancers by tissue of origin or organ system (Extended Data Fig. 2d, Methods). We identified candidate drivers-that is, cohort-element combinations with Q < 0.1 (10% false discovery rate (FDR))-by integrating 13 discovery algorithms, circumventing biases introduced by any one method (Extended Data Figs. 2e, 11, Supplementary Tables 2, 3, Supplementary Note 2). We benchmarked this approach by evaluating its ability to detect 603 known cancer genes (from the Cancer Gene Census (CGC) 21 , v.80), and found that combining methods improved performance compared to single algorithms (Extended Data Fig. 3a, b, Methods). Overall, we identified 1,294 significant hits that involved 520 unique candidates (Supplementary Tables 4, 5).

Filtering the significant hits
Even after conservative FDR control, false-positive 'driver' loci can remain, owing to inaccurate background models, sequencing and mapping artefacts, or local increases in mutations due to unaccounted-for mutational processes. We therefore systematically filtered the candidate driver elements on the basis of technical and biological criteria, followed by careful review (Extended Data Fig. 3c, Methods, Supplementary Note 3). Examples of filtered elements include the promoters of PIM1 (lymphoid tumours) and RPL13A (melanoma) because of associations with localized AID and UV-light mutational processes, respectively; PLEKHS1, GPR126, TBC1D12 and LEPROTL1 because of palindromic APOBEC target sequences 9,10 ; and the WDR74 5′ UTR and promoter 8,10,14 , owing to mapping problems detected in downstream manual review (Supplementary Table 5, Supplementary Note 4). In combination, filtering and reapplying FDR control discarded 589 out of 1,294 (46%) of the original cohort-element hits and 341 out of 520 (66%) unique elements (Extended Data Fig. 3c, Supplementary Tables 4, 5).

Candidate coding and non-coding drivers
Our stringent combination and filtering strategy yielded 705 hits in 179 genomic elements: 602 hits in 143 protein-coding genes and 103 hits in non-coding elements. We observed wide variability across different types of cancer, from one hit in clear-cell renal cancer to 80 in the pan-cancer meta-cohort (Fig. 1b, Supplementary Tables 4,5). Although most candidate drivers gained significance in larger metacohorts, some genes-such as DAXX (pancreatic endocrine tumour), NRAS (melanoma), SPOP (prostate adenocarcinoma), FGFR1 (pilocytic astrocytoma) and MIR142 (Lymph-BNHL)-scored higher in individual tumour types (Extended Data Fig. 3d). These results emphasize the trade-off between limiting driver discovery analyses to particular types of tumour and maximizing cohort size.
The candidate coding drivers we identified agreed with previous results: of the 143 genes that were significant in at least 1 cohort, 69% are in the CGC and nearly all have previously been implicated in cancer. In contrast to large whole-exome sequencing datasets, the fewer patients per cancer type in this dataset provided power sufficient only to detect genes with the strongest signal. We found 116 additional hits in 84 unique elements that were 'near significance' (0.1 < Q < 0.25). Fifty-one per cent of the 63 unique protein-coding genes in this set are in the CGC, which suggests that they would have been discovered in larger cohorts (Supplementary Table 4).
To nominate a significant non-coding element as a candidate driver, we reviewed the supporting evidence from the mutation calls, additional genomic data (chromosomal breakpoints, copy number, lossof-heterozygosity and expression data), cancer gene databases and the literature (Methods, Supplementary Tables 6, 10). We describe the key candidates below, and in Supplementary Note 4.
The TERT promoter was the most frequently mutated non-coding driver in this dataset (14 cohorts) (Fig. 1b), and these mutations were strongly associated with higher TERT expression, as has previously been reported 9 (Extended Data Fig. 4a, Supplementary Table 10).

Article
Mutations in the promoter and/or 5′ UTR of MTG2 (which encodes a GTPase involved in the mitochondrial ribosome) were associated with an expression of MTG2 that was marginally significantly lower, in both the pan-cancer (P = 0.036, fold difference = 0.8) and carcinoma (P = 0.029, fold difference = 0.8) meta-cohorts (Extended Data Figs. 4a, 5a). Mutations in the 5′ UTR have previously been shown to decrease MTG2 expression in vitro 22 .
Recurrent somatic events were identified in the 3′ UTRs of TOB1 (carcinoma and pan-cancer meta-cohorts), NFKBIZ (lymphomas) and ALB (liver cancer) (Fig. 1b). TOB1 encodes an anti-proliferation regulator      that associates with ERBB2, and also affects migration and invasion in gastric cancer 23 . TOB1 regulates other mRNAs through binding to their 3′ UTR and promoting deadenylation 24 . Tumours with 3′ UTR mutations in TOB1 showed a trend towards decreased expression (P = 0.053, fold difference = 0.7). The mutations did not concentrate in known miRNAbinding sites; however, the region is extremely conserved and thus probably functional (Fig. 2a). TOB1 and its neighbouring gene WFIKKN2 are focally amplified in breast cancer and pan-cancer, suggesting a complex role in cancer (Extended Data Fig. 4b). NFKBIZ is a transcription factor that is mutated in diffuse large B cell lymphoma and amplified in primary lymphomas 25 . Mutations in the 3′ UTR accumulated in a hotspot proximal to the stop codon and upstream of conserved miRNA-binding sites (Extended Data Fig. 5b). The enrichment of indels next to the stop codon suggests that this hotspot is not due to AID offtarget activity. Previous functional experiments have associated these mutations with increased NFKBIZ expression 25 , which we observed in our lymphoma cohort (P = 0.035, fold difference = 3.2; after correction for copy number, P = 0.03) (Extended Data Fig. 5b).
Both the exon and promoter of the non-coding RNA RMRP were significantly mutated in multiple types of cancer (Fig. 1b, Extended Data  Fig. 5c). Germline RMRP mutations cause cartilage-hair hypoplasia, and previous in vitro studies have shown that some somatic promoter mutations are functional 16 . The RMRP locus is also focally amplified in several types of tumour (Extended Data Fig. 4b). The enrichment of mutations in sites that can affect secondary structure suggests that these mutations are functional (P = 0.011, permutation test) (Extended Data Fig. 5c), although caution is required because this locus also appears to be affected by mapping artefacts or increased mutation rates (Supplementary Note 4).
The miR-142 precursor miRNA was significant in Lymph-BNHL and the lymphatic and haematopoietic cohorts ( Fig. 1b; Extended Data Fig. 5d). The locus is a known AID off-target region in lymphoma 12,26 , but 7 out of 8 mutations in the mature miRNA mir-142-p3--for which the largest functional effect is expected--were not assigned to AID, which suggests that these mutations are under selection 12 .

Unbiased genome-wide driver screen
To test whether we missed drivers by focusing on functionally annotated regions, we applied an unbiased genome-wide survey to all nonoverlapping 2-kb windows for excess point mutations. Twenty-two of the resulting 67 significant windows overlap with known protein-coding drivers, and 28 overlap highly transcribed regions with an excess of 2-5-bp indels (described in the 'Transcription-associated indel signature' section below) (Extended Data Fig. 5e Supplementary  Table 18. c-f, Mutations analysed in all unique cases (n = 2,583).

Increasing power for known cancer genes
Finally, we performed restricted hypothesis testing to boost the statistical power to detect cis-regulatory driver mutations near cancer genes from the CGC 21 (Supplementary Table 7). Restricted hypothesis testing of cancer gene promoters revealed a significant recurrence of TP53 promoter mutations (11 patients in pan-cancer, Q = 0.044), mostly comprising SNVs and deletions that affect the transcription start site or donor splice site of the first non-coding exon. In 10 out of 11 cases, the mutation occurred in combination with loss-of-heterozygosity, and all samples with expression data showed decreased mRNA levels ( Fig. 2b). None of these patients contained additional coding mutations that could instead be responsible for the downregulation of TP53. To our knowledge, this is the first report of a relatively infrequent-but impactful-form of TP53 inactivation by non-coding mutations. Focal gains or losses in cancer are selected for modulating expression levels of their target genes. Restricting the hypothesis testing to the non-coding elements of such genes (n = 216,986 cohort-element combinations, representing 5,201 unique elements) (Methods) yielded only one new hit, the 3′ UTR of the oncogene FOXA1 in prostate cancer (Supplementary Table 11).

Transcription-associated indel signature
Several significant non-coding elements (the ALB 3′ UTR, NEAT1, MALAT1 and MIR122) were hit by many indels; all have previously been reported to be mutated in cancer 10,15,27 (Figs. 1b,2c). To explore whether ALB 3′ UTR events are under selection, we calculated indel rates across the functional regions of this gene. The indel rate is notably high throughout the UTRs, introns and exons, and even downstream of the polyadenylation site-a pattern inconsistent with selection ( Fig. 2c, d).
Similarly, FOXA1 has high indel rates throughout its locus, whereas the indels in NFKBIZ and TOB1 are in their 3′ UTRs, suggesting that these are driver events (Fig. 2d). ALB, NEAT1 and MALAT1 mutations were not associated with changes in gene expression (Extended Data Fig. 4a) and were not associated with high cancer cell fractions or biallelic loss (Extended Data Fig. 6a, b). Likewise, indels in MIR122 were downstream of the mature miRNA, and were not associated with altered expression of the targets of this miRNA (Supplementary Note 5).
If the indels in these genes were due to a mutational process rather than selection, they might exhibit distinct features. Indeed, indels in NEAT1, MALAT1, MIR122 and ALB were strongly enriched in 2-5-bp-long events (Fisher's P < 6.8 × 10 −5 , for all) (Fig. 2e). A systematic search of coding and non-coding genes with significantly (Q < 0.1) increased rates of 2-5-bp indels revealed that this mutational process affects at least 18 additional genes in different types of tumour, most of which are highly expressed and tissue-specific (as has previously been reported for some of these genes 15 ) (Extended Data Fig. 6e, f). Although less enriched, SNVs also occur at high frequencies in these regions (Fig. 2f). Overall, our findings suggest that the indels in MALAT1, NEAT1, ALB and MIR122 are not driver events and are the result of a transcription-associated mutational process. The previously reported oncogenic effect of altered MALAT1 and NEAT1 expression 27-29 may thus be unrelated to these mutations. Our findings also suggest that although FOXA1 protein-coding indels are drivers, 3′ UTR indels might be passengers 30 .

Breakpoints at driver and fragile sites
Driver structural variants may act by disrupting one or both of their breakpoint loci (for example, deactivating a tumour suppressor), or by generating a novel juxtaposition between loci. We thus searched both for genomic regions with SRBs and for pairs of regions with SRJs (Extended Data Fig. 7).
For SRBs, we first defined a background model to predict breakpoint density, using eight explanatory variables (Methods, Supplementary Table 13) and accounting for unexplained sources of variation 31 (Supplementary Note 6). We identified 53 disjoint regions with SRBs (Q < 0.1) (Fig. 3a, Supplementary Table 14), which cleanly divided into two groups on the basis of the variability of the breakpoints at the other side of the rearrangements. Eight SRBs had partner breakpoints that were tightly clustered (had low rearrangement dispersion scores; Methods) and represented known oncogenic fusions. The remaining 45 SRBs had dispersed partner breakpoints (had high rearrangement dispersion scores), and were largely associated with previously identified somatic copy-number alterations (SCNAs) (Fig. 3b).
It has been difficult to distinguish recurrent driver SCNAs from passenger events at fragile sites 32 . At the resolution afforded by whole-genome sequencing, late replication timing predicted fragility-associated SRBs better than existing fragile site annotations (Supplementary Note 7), identifying 12 fragile-like SRBs (Fig. 3b). The remaining 33 SCNA-like SRBs comprised 14 amplifications, 8 deletions and 11 copy-neutral events (Supplementary Table 14).
The different classes of SRB were associated with different effects on neighbouring genes. Five of the eight deletion-associated SRBs were associated with biallelic inactivation of nearby known tumour suppressors, compared to none of the 12 fragile-like SRBs (P = 0.039) (Extended Data Fig. 8a). The fragile-like SRBs were furthest from tissuematched enhancers and caused the weakest expression changes, consistent with them being passenger events 32 . By contrast, fusion-like SRBs were closer to tissue-matched enhancers than the other SRBs (P < 0.01) (Extended Data Fig. 8b) and were associated with greater changes in expression than all other SRBs except amplifications (P < 0.05 for all types) (Extended Data Fig. 8c, Methods). Our analyses indicate that SRB driver events can be classified using rearrangement dispersion scores, replication timing and gene expression. Notably, neither rearrangement dispersion scores nor association with replication time can be accurately determined from microarrays or whole-exome sequencing, which highlights the importance of whole-genome sequencing. Altogether, we identified SRBs at 34 sites of known oncogenic fusions and recurrent SCNAs, 5 additional sites that are probably due to DNA fragility and 14 novel driver candidates (Supplementary Note 8).

Novel structural-variant driver candidates
Although most SCNA-like SRBs act by altering gene copy numbers, several appeared to target regulatory elements. We identified three that were significantly (Q < 0.05) associated with expression changes of nearby genes after controlling for copy number (Methods), two of which we discuss here. The first comprised structural variants at 10p15, which were associated with a greater than twofold upregulation of AKR1C1, AKR1C2 and AKR1C3 in seven cases of lung squamous cell carcinoma and two cases of liver hepatocellular carcinoma (Extended Data Fig. 8d). AKR1C proteins are aldo-keto reductases involved in steroid homeostasis. Ectopic expression transforms cell lines, and germline mutations have previously been linked to an increased risk of developing lung cancer 33,34 . Three-quarters of the breakpoints are near (<10 kb) lineage-specific enhancers, potentially altering promoter-enhancer interactions (and hence gene expression). However, because the highest density of breakpoints lies between two long inverted repeats, the structural variants may have been induced by DNA secondary structure.
The second SRB contains recurrent microdeletions (<50 kb) involving the 5′ end of BRD4 in ovarian (eight cases, P < 10 −7 ) and breast tumours (six cases, P < 0.04) (Fig. 3c, Extended Data Fig. 8e). These deletions were highly enriched in cancers that amplified a segment that includes BRD4 and NOTCH3 (P < 0.004) (Fig. 3d, Extended Data Fig. 8f) but were not a direct consequence of these amplifications (Supplementary Note 9). BRD4 is a chromatin regulator and a therapeutic target in several types of cancer 35,36 , including ovarian and triple-negative breast cancer 37,38 . Given the increased copy number of the full BRD4 gene, we would expect increased gene expression. However, the microdeletions are associated with a lower expression of BRD4 in breast (P = 0.001) and ovarian tumours (P = 0.04), but not of the neighbouring gene NOTCH3 (Fig. 3e). The focal deletions in BRD4 overlap a prominent exon-1 H3K4me3 peak and intron-1 enhancer elements in HMEC (normal breast) and MCF-7 (breast tumour) cells (Extended Data Fig. 8e), which suggests that these deletions disrupt regulatory elements.   Expression at the primary locus in samples with the rearrangement relative to samples without the rearrangement is greater for SRJs than for other rearrangements (left). The tissue-specific expression at the secondary locus in wild-type (WT) samples, relative to samples of different tissue types, is greater for SRJs than other rearrangements (right). P values represent comparisons to 'not in SRJ'. d, e, g, Box plots show the interquartile range, median and 95% confidence interval; two-sided t-test. h, TERT promoter mutations and rearrangements across PCAWG melanomas. i, Rearrangements between TERT promoter and BASP1 and MYO10 locus result in focal amplification and relocation of distal enhancers to TERT. AML, acute myeloid leukaemia; Colorect, colorectal; Leiomyo, leiomyosarcoma; MPN, myeloproliferative neoplasm; Osteosarc, osteosarcoma; PiloAstro, pilocytic astrocytoma.

Article
To our knowledge, this is the first evidence of a recurrent microdeletion limiting expression of an amplified gene.

Recurrent fusions target gene regulation
Motivated by the detection of fusion-like SRBs, we specifically looked for genomic loci that were juxtaposed more often than expected by chance, after controlling for both the rate of breakpoints at each locus and the distance between them (Methods). We identified 90 such SRJs (Fig. 3f, Supplementary Table 15), including 13 known oncogenic fusions (including all 8 fusion-like SRBs) and 77 novel hits-18 of which linked to at least one known cancer gene (Supplementary Note 8). Previously reported oncogenic SRJs were observed more frequently (average 24 patients per fusion, range 2-98) than novel ones (most often 2 patients per fusion, range 2-4). As juxtapositions are unlikely to occur by chance, observing even two becomes highly significant. However, it is possible that some SRJs reflect inaccuracies in our background model rather than true drivers. We therefore further evaluated the SRJs on the basis of (i) a 'robustness factor' that indicates how much the background rate could increase before the SRJ would become insignificant, and (ii) the ratio between the observed and expected numbers of events under the current background model ('effect size') (Extended Data Fig. 9a). Twenty-six SRJs, including 11 of the 13 known drivers and 15 newly identified SRJs, are robust to tripling the expected background rate, and 22 others would remain significant with a doubled rate.
Most canonical driver rearrangements have previously been found in single tumour types, often associated with tissue-specific expression 39,40 . We found that 9 of our top 10 SRJs are tissue-specific, despite searching across 30 different types of tumour. Such tissue specificity is not observed for cancer genes affected by SCNAs, for which the top 10 are altered in 11.9 cancer types (on average), or by point mutations (for which the top 10 are altered in 6.7 cancer types, on average) (Supplementary Table 16).
The tissue specificity of SRJs suggests that they are strongly shaped by epigenetic state, either owing to mechanistic reasons (for example, tissue-specific three-dimensional proximity of the two DNA breakpoints) or to selection that connects tissue-specific regulatory elements with oncogenes 13,41-43 . The latter seems to be more likely because: (i) SRJs are associated with significant overexpression of only one of the rearrangement partners (the 'primary locus') relative to randomly selected rearrangements (primary locus, P < 10 −4 (Fig. 3g left); secondary locus, P > 0.05 (Extended Data Fig. 9b left)); (ii) the rearrangement partner, in the secondary locus, tends to be highly expressed in that tissue type relative to others (Fig. 3g right); and (iii) the distance to the nearest tissue-specific enhancer is smaller for SRJs than for rearrangements overall (Extended Data Fig. 9b). These observations suggest that SRJs act in general by bringing regulatory elements to an oncogene that is otherwise expressed at a low level.
In many cases, SRJs generate truncated or chimeric proteins, and breakpoints within introns or exons were indeed overrepresented (68% versus 56% expected, P < 10 −7 ). However, only 11 of the 30 (37%) most significant SRJs generated novel proteins in all samples, and 6 others sometimes generated novel proteins; the rest were either nondisruptive or contained breakpoints within the first two introns of the disrupted gene, leaving most of the protein intact 44 (Fig. 3f). Moreover, SRJs that generate novel proteins exhibited expression changes similar to those that do not (P = 0.4) (Extended Data Fig. 9c). We conclude that altering gene expression is a key function of both classes of SRJs, and that SRJs are akin to non-coding driver point mutations that act on regulatory elements.
We found several SRJs that involve amplified oncogenes, including MDM2, EGFR and TERT (Fig. 3f, h, i, Extended Data Fig. 9d-f, Supplementary Table 15). The TERT promoter region was juxtaposed in four melanomas (P < 10 −7 ) to a region in the BASP1 gene (both on chromosome 5), and to a region near NDUFC2 (t(5,11)) in two melanomas and one medulloblastoma (P < 10 −8 ). Both juxtaposed regions were marked with melanocyte enhancers, which suggests that they could drive TERT expression. Among melanomas, these rearrangements are mutually exclusive with the C228T and C250T mutations of the TERT promoter (P < 10 −3 ) (Fig. 3h). Because the juxtapositions were always part of complex events that also amplified TERT, increased TERT expression may be due to amplification, the juxtapositions or both.

Paucity of non-coding drivers in cancer
Our analyses of genomic hotspots, functional elements, genomic windows and SRJs all suggest that non-coding drivers are rare compared to protein-coding drivers. This might, in part, be due to a lack of discovery power 3 . We therefore evaluated the discovery power of mutational-burden tests for recurrent events across the different types of element in our tumour cohorts, focusing first on point mutations 3,16 . We found that the fraction of mutated patients required for a driver to reach 90% discovery power ranged from <1% in large cohorts with low background-mutation densities to 25% in small cohorts with high background-mutation densities (Fig. 4a). Different types of element were similarly powered, suggesting that the paucity of drivers in noncoding versus coding elements is not due to a lack of power. Similarly, our power to detect SRJs was higher in large cohorts with low rearrangement rates, and for long and interchromosomal rearrangements owing to their lower overall rates (Extended Data Fig. 10a): we were only powered to detect events that recur in 5-20% of samples in most types of cancer (Fig. 4b). Moreover, beginning with about 2,500 tumours, we expect to find a new SRJ with every 25 additional genomes (Fig. 4c).
Low sequencing coverage (for example, in GC-rich regions 45 ) also limits driver discovery. To measure this effect in the PCAWG data, we quantified our ability to detect mutations (detection sensitivity) 16 in cancer gene promoters. Although the mean detection sensitivity in promoters is high (41.9% of genomic positions have mean detection sensitivity >80% across tumours), only 4.1% of the promoters had detection sensitivity >90% in >90% of bases. In particular, the two canonical TERT promoter hotspots had highly variable detection sensitivity among patients and cohorts, from only 3% of patients in the centralnervous-system pilocytic astrocytoma cohort to 100% in the thyroid adenocarcinoma cohort (Extended Data Fig. 10b). From these data, we inferred the expected number of TERT events in each tumour type (Extended Data Fig. 10c) and found that about 263 (95% confidence interval 232-295) TERT hotspot mutations were probably missed owing to a lack of detection sensitivity. Moreover, on average 9.9% (1.3-13.0% interquartile range) of the cancer gene promoter territory in the tumour of each patient was severely underpowered (an average detection sensitivity of <10%). Therefore, the lack of coverage in promoters may contribute to the paucity of non-coding drivers.
To determine whether the paucity of non-coding drivers discovered thus far could be due to the limited statistical power of current datasets, we estimated the overall excess of point mutations above background (that is, the expected number of driver events) in coding and cis-regulatory non-coding sequences in 603 cancer genes 46 (Methods, Supplementary Table 7, Supplementary Note 11). To minimize the effect of samples with low detection sensitivity, we included only 936 samples with >90% detection sensitivity at the two TERT promoter hotspots (Extended Data Fig. 10c, d, Supplementary Note 11). Overall, this approach predicted more than 1,475 driver mutations (95% confidence interval 1,410-1,687; 1,069 SNVs and 406 indels) in the proteincoding sequences of these cancer genes (Fig. 4d), compared to only 96 (95% confidence interval ) estimated driver mutations in promoters (73 attributed to TERT), 22 (95% confidence interval 0-88) in 5′UTRs, and 68 (95% confidence interval 0-178) in 3′ UTRs. Non-coding mutations in cancer-gene promoters were also not generally associated with loss-of-heterozygosity or altered expression, as one would expect if they were enriched with drivers (Supplementary Note 12).
These results collectively indicate that, independently of statistical power, non-coding cis-regulatory driver mutations in known cancer genes besides TERT are much less frequent than protein-coding drivers.

Discussion
The accurate and reliable discovery of genomic drivers in tumours may have critical implications for patients with cancer. Our findings and the methods introduced here for the discovery of point-mutation and structural-variant drivers, method integration, vetting of candidates and identification of local hypermutation and fragile sites represent an important contribution to the collective effort towards charting all malignant changes that drive the cancer of each patient 5 .
Among the most interesting candidate non-coding driver elements we uncovered are the 5′-end mutations in TP53; 3′ UTR mutations in NFKBIZ and TOB1; and rearrangements involving AKR1C genes and BRD4. By careful analysis of the whole-genome sequencing data, we found that several previously reported and frequently altered noncoding elements may not be genuine drivers, including (i) the noncoding RNAs, NEAT1 and MALAT1 (which contain a high density of indels, seemingly owing to a transcription-associated mutational process) and (ii) recurrent structural variants in regions of late replication, indicating DNA fragility.
This study yielded unexpectedly few non-coding driver point mutations and structural variants. SRJs, which appear to act largely through the rearrangement of regulatory elements, are less frequent than SCNA-like SRBs, which directly amplify or delete coding sequences. The results from five analyses--hotspot recurrence, driver-element discovery, structural variants, discovery power and aggregated mutational excess--suggest that this paucity is not caused by a particular analysis strategy, but that regulatory elements truly contribute a much smaller number of recurrent cancer-driving events than protein-coding sequences. This paucity of non-coding drivers contrasts with the distribution of germline polymorphisms associated with heritability of complex traits, which are most frequently located outside of proteincoding genes 47 .
Article cancer genes suggests that point mutations markedly affect the function of non-coding regulatory elements only rarely. This highlights TERT as a notable exception, perhaps because even a modest increase in TERT expression may suffice to circumvent normal telomere shortening. For other cancer genes, directly mutating protein-coding sequences or altering expression levels by copy-number change may provide larger phenotypic effects. For example, complete loss-of-function by nonsense mutations or deletions may be easier to achieve than by disrupting or translocating regulatory regions.
Technical shortcomings (such as coverage 'blind spots' in GC-rich promoters and different filtering strategies) may cause genuine drivers to be missed 48 . Therefore, the discovery of non-coding drivers will benefit from technical improvements, including even sequence coverage, longer and accurate reads, and improved variant-calling methods. Moreover, better annotation of functional non-coding elements will increase both the power to discover infrequently mutated driver elements and their interpretability. As datasets grow, yet-unidentified mutational mechanisms targeting particular genomic regions will emerge and require improved background models, including additional covariates and more-sophisticated statistical models. The analysis of structural variants has greater challenges because (i) accurately modelling their background density is complicated by their lower frequency and larger fraction of drivers (Supplementary Note 6); (ii) their target genes may be far from the breakpoints, as in SCNAs; (iii) the space for modelling SRJs is much larger (the genome squared); and (iv) many structural variants are part of complex events that often involve multiple chromosomes 31 , so that the resultant topology cannot be deduced without technologies such as long-or linked-read sequencing 49,50 . For these reasons, experimental validation remains important for all-and especially for non-coding-candidate drivers.
Our work suggests that larger datasets and technological advances will continue to identify new non-coding drivers, albeit at considerably lower frequencies than protein-coding drivers. We anticipate that the approaches developed here will provide a solid foundation for the incipient era of driver discovery from ever-larger numbers of cancer whole genomes.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41586-020-1965-x.

Methods
No statistical methods were used to predetermine sample size. The experiments were not randomized and investigators were not blinded to allocation during experiments and outcome assessment.
Detailed methods are provided as Supplementary Methods.

Dataset generation
Out of 2,955 samples, we selected 2,583 unique donor samples for SNV and indel driver-discovery analysis on the basis of SNV quality control (Supplementary Methods). We found that 110 additional myeloid-AML samples had robust structural variant calls despite SNV artefacts; we included these in structural variant analyses, for a total of 2,693 samples. For tumour-type cohort analyses, we used only cohorts with at least 20 patients. Tumour meta-cohorts were defined by cell type of origin or by organ system (for example, lung for lung adenocarcinoma and lung squamous cell carcinoma). A pan-cancer meta-cohort was created by combining all tumour cohorts except for Skin-Melanoma and lymphoid tumours (Supplementary Methods).

Hotspot SNV analysis
We selected the 50 most-frequent SNV hotspots. These were analysed to identify known driver events; mutational signature biases related to sequence palindromes, immunoglobulin loci and so on; and potential artefacts, including regional mapping problems (Supplementary Methods).

Mutational signatures
We performed de novo global-signature discovery and signature attributions with SignatureAnalyzer's Bayesian non-negative matrix factorization method 52 , based on 1,697 channels-including 1,536 pentanucleotide sequence contexts for single-base substitutions, 83 indel features, and 78 doublet-nucleotide substitution classes (Supplementary Methods).

Definition of genomic elements
GENCODE v.19 (ref. 53 ) and other genomic resources were used to define functional genomic elements, including protein-coding genes (CDS, splice sites, 5′ UTR, 3′ UTR and promoters), long non-coding RNAs (gene body, splice site and promoters), short RNAs, miRNAs and enhancers (Supplementary Methods).

Candidate-driver-mutation identification methods and combination of results
We obtained results (P values) from 13 methods of driver discovery, including ActiveDriverWGS 54 , CompositeDriver, DriverPower 55 , dndscv 46 , ExInAtor 56 , LARVA 57 , MutSig tools 3 , NBR 10 , ncdDetect 58 , ncDriver 59 , OncodriveFML 60 and regDriver 61 . We integrated the results of all these methods using a custom framework based on a previously published method 62 for combining P values. Results from individual methods that showed large deviations from the expected uniform null distribution of P values were excluded. This approach was evaluated on real and simulated data. We controlled the FDR within each of the sets of tested genomic elements by concatenating all combined Brown's P values from across all tumour-type cohorts and applying the Benjamini-Hochberg procedure 63 . Cohort-element combinations with Q values < 0.1 were designated as significant hits, and combinations with 0.1 ≤ Q < 0.25 as 'near significance'. Extensive details are provided in the Supplementary Methods. In addition, we tested for elementindependent recurrence with the NBR method on 2-kb bins spanning the entire genome, and non-coding ultraconserved regions 64 .

Post-filtering of driver mutation candidates
We applied stringent filters to discern positive selection from technical artefacts and mutational processes. We required at least three mutations to be present in candidate elements, in at least three patients of the tested cohort; more than 50% of mutations in mappable regions; less than 50% of mutations in palindromic DNA; and less than 50% of mutations attributed to APOBEC activity. For lymphoid tumours and skin melanoma, we required that <35% and <50% of mutations were attributed to the AID and UV-light mutational signatures, respectively. The FDR was recalculated after post-filtering.

Candidate driver structural-variant analyses
We applied separate analyses to detect recurrent structural variant breakpoints and recurrent juxtapositions. For each analysis, we first binned breakpoints, accepting only one breakpoint per sample per bin. We then determined which bins had more breakpoints than expected by chance (the SRB analysis), and which pairs of bins (or 'tiles') were joined by more rearrangements than expected by chance (the SRJ analysis).

Candidate driver breakpoints
We calculated the background rate of breakpoints per bin based on a Gamma-Poisson model 15 that took into account genomic covariates, breakpoint counts normalized by the number of bases within each bin that had sufficient mappability to be eligible for breakpoint detection and accounted for an observed overdispersion of breakpoint counts that probably reflects unaccounted-for covariates (Supplementary Methods). We used the Gamma-Poisson model to calculate the P value for each bin (that is, the probability that each bin would exhibit the observed number of breakpoints (or greater) by chance alone), applying the Benjamini-Hochberg procedure 63 to correct for multiple hypotheses.

Post-filtering of driver breakpoint candidates
We scored each recurrent breakpoint locus on the basis of the average replication timing of its breakpoints, and filtered those loci with scores >0.5 as probable fragile sites 65 .

Candidate driver juxtapositions
We developed a background model to indicate the probability that two loci would be joined, taking into account the observed rate at which each locus underwent DNA breaks (from the breakpoint analysis), the distance between them and the propensity for these rearrangements to reflect a break followed by invasion versus two breaks that were then joined. We determined the probability that each tile would contain the observed number of rearrangements using a binomial test, followed by controlling for multiple hypothesis testing using the Benjamini-Hochberg procedure 63 .

Gene-expression analyses
Gene-expression data were provided by the PCAWG Transcriptome Core Group 66 , and also generated using the same approach for an extended set of non-coding transcripts (Supplementary Methods).

Additional evidence for selection
In addition to associations between mutations or structural variants and expression, we looked for signals of copy-number-alteration recurrence using the GISTIC2 algorithm 67 . We also tested whether driver candidates showed significantly higher frequency of loss-of-heterozygosity in mutated samples using Fisher's exact test. We calculated cancer allelic fractions using ploidy and tumour purity predictions from a previous publication 68 .

Mutational process and indel enrichment
For every gene, we calculated the proportion of indels of length 2-5 bp out of the total number of indels. This proportion was compared to the genome background proportion using a binomial test. We also compared the indel rate per gene (not distinguishing by length) to the background. Both sets of P values were corrected with the FDR method.

Power calculations
We estimated our power to discover driver elements mutated at a particular frequency in the population as previously described 3,16 , but solving for the lowest frequency for a driver element in the patient population that is powered (≥90%) for discovery. The calculation of this lowest frequency takes into account (i) the average background mutation frequencies for each cohort-element combination; (ii) the median length and average detection sensitivity for each element type and patient cohort size; and (iii) a global desired false-positive rate of 10%. The effect of element length is discussed in Supplementary Note 10, and details are provided in Supplementary Methods. Power calculations for detection of recurrent juxtapositions was performed similarly, except over a two-dimensional genomic fusion map divided into 100 × 100-kb tiles (Supplementary Methods). We performed this analysis first as a function of the distance between breakpoints (Extended Data Fig. 10a) and second as a function of the median number of rearrangements per sample, spanning values represented by histologies with more than 15 samples (Fig. 4b).
Estimation of the number of mutations in non-coding regions of known cancer genes NBR was used to estimate the background mutation rate expected across cancer genes, using a conservative list of 19,082 putative passenger genes as background and including as covariates the local mutation rate, gene expression and averaged copy-number states. The resulting model predicted the number of passenger SNVs and indels expected by chance. By aggregating the expected numbers over 603 known cancer genes from the CGC 69 (CGC v.80) (Supplementary Table 7), we compared the observed and expected numbers of mutations. For this analysis, we excluded samples with problems of low detection sensitivity (Supplementary Methods).

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Data availability
Data associated with this Article are available at https://dcc.icgc.org/ releases/PCAWG/drivers. SRBs and SRJs are available at www.svscape. org. A list of data files used for analyses in this paper is provided in Supplementary Table 20. Somatic and germline variant calls, mutational signatures, subclonal reconstructions, transcript abundance, splice calls and other core data generated by the ICGC and TCGA PCAWG Consortium are described in an accompanying Article 5 , and are available for download at https://dcc.icgc.org/releases/PCAWG. Additional information on accessing the data, including raw read files, can be found at https://docs.icgc.org/pcawg/data/. In accordance with the data access policies of the ICGC and TCGA projects, most molecular, clinical and specimen data are in an open tier that does not require access approval. To access information that could potentially identify participants, such as germline alleles and the underlying sequencing data, researchers will need to apply to the TCGA data access committee via dbGaP (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login) for access to the TCGA portion of the dataset, and to the ICGC data access compliance office (http://icgc.org/daco) for the ICGC portion of the dataset. In addition, to access somatic single-nucleotide variants derived from TCGA donors, researchers will also need to obtain dbGaP authorization.

Code availability
The core computational pipelines used by the PCAWG Consortium for alignment, quality control and variant calling are available to the public  Fig. 3 | Sensitivity of driver-discovery methods and filter statistics. a, Percentage of coding-driver discovery runs (with stable F 1 score, n = 33), across all cohorts, in which the method had the highest F 1 score (Methods). b, F 1 score of different methods of driver discovery, and different combinations evaluated in the four largest cohorts (pan-cancer (n = 2,278), carcinoma (n = 1,856), adenocarcinoma (n = 1,631) and digestive tract (n = 797)).

Extended Data
Only methods that used the same algorithm to call coding and non-coding drivers were evaluated. Vertical lines indicate 95% confidence intervals. Horizontal black lines mark the median in each group. P values were calculated with the two-sided non-parametric Mann-Whitney U test. c, On top, the initial number of hits identified as recurrently mutated for each element type. The element types mature miRNA (n = 2 before filtering) and miRNA promoters (n = 16 before filtering) were omitted from the table. The heat map shows the number of hits filtered at each step in the sequential application of filters and post-filtering re-application of the FDR correction. Background colours indicate the corresponding percentage of input element removed. The final numbers of hits (including those that were later filtered by the comprehensive vetting procedures) are indicated below the heat map. d, Sensitivity versus specificity in individual cohorts versus meta-cohorts for candidate drivers: Q values for the most significant individual cohort (x axis) versus meta cohort (y axis) are shown. Driver elements are coloured by their element type. Q values derived from combination of P values from individual driver-discovery methods (Methods). Fig. 4 | Mutation-to-expression correlation and focal copynumber alterations. a, Expression is compared between mutated and nonmutated samples. For each element, the z score of the expression values for mutated and wild type in the significant cohort is plotted. For copy number, CNA amplification indicates CNA > 10; CNA gain indicates CNA ≥ 3; CNA loss indicates CNA ≤ 1; and no events indicates CNA < 3 and CNA > 1. If a patient is mutated with multiple types of point mutation, indels are indicated over SNVs.

Extended Data
For TERT, only samples powered to call mutation status were used. P values are based on a two-sided Wilcoxon rank-sum test. Bars indicate means. b, Copynumber profiles of 55 of 441 stomach adenocarcinomas from TCGA show copynumber gains around HES1. TOB1 and its gene neighbour WFIKKN2 are focally amplified in cancer (172 of 10,844 total samples from 33 cancer types are shown). RMRP focal amplifications in TCGA cancers (160 of 10,844 total tumours shown). Fig. 5 | Non-coding driver candidates. a, MTG2 promoter locus (left) and associated gene-expression changes in carcinoma tumours (right). Expression of MTG2 in mutated (n = 3) versus the carcinoma meta-cohort wildtype cases (n = 896). Two-sided Wilcoxon rank-sum test. Bars represent means. b, Genomic locus of NFKBIZ 3′ UTR (left) and associated gene-expression changes in Lymph-BNHL (right). Expression of NFKBIZ in mutated (n = 6) versus wild-type cases (n = 98). Test and bars as in b. c, Genomic locus of the RMRP transcript and promoter region (left). RMRP is an RNA component of the endoribonuclease RNase MRP, the function of which depends on its RNA secondary and tertiary structure. The RNA secondary structure, tertiary structure interactions, protein and substrate interactions, and mutations with their predicted structural effect (right) of RMRP; lymphoma and melanoma mutations are excluded. d, MIR142 locus and mutations in patients with lymphoma with the AID signature annotation. e, Manhattan-style plot showing significance of mutation recurrence enrichment for genomic bins (top) and ultraconserved elements (bottom) across cohorts (Methods; Supplementary Table 9). Fig. 11 | P value combination details. a, Quantile-quantile plots of P values reported by various driver-detection algorithms on the three simulated datasets (Broad, DKFZ and Sanger; shown for coding regions (n = 20,172) in the meta-carcinoma cohort; see Methods for details for the statistical background model or test of each algorithm) showed no major enrichment of mutations above the background rate. Results generally followed the expected null (uniform) distribution, and the P values reported on simulated data were subsequently used to assess the covariance of method results. b, Quantile-quantile plots of integrated P values using the Brown and Fisher methods for combining P values across the results from different driverdetection algorithms were generated for a few representative tumour cohorts (shown here for coding regions). Brown combined P values (light blue) generally followed the null distribution as expected, whereas Fisher combined P values were significantly inflated (dark blue), confirming that dependencies existed between the results reported by the various driver-detection algorithms. To simplify the integration procedure, we calculated covariances using P values from the observed data instead of simulated data and found that the integrated results based on the observed covariances (first column of plots) were essentially the same as the results obtained using the simulated covariances (second, third, and fourth columns of plots). c, Triangular heat maps showing the Spearman correlations of P values among the various driverdetection methods in observed versus simulated data (coding regions (n = 20,172), colorectal adenocarcinoma cohort) are highly similar. Differences in the observed and simulated correlation values (shown in the heat maps on the far right) were minimal, and thus the final integration of P values across methods was performed using covariances estimated on observed data. d, Brown combined P values based on observed and simulated covariance estimations (shown on the right, top heat map, for coding regions in glioblastoma) did not differ noticeably. In cases in which individual methods reported results that yielded substantially fewer hits than the median across all methods (bottom heat map, methods in light grey with results in dashed box), removing the methods from the integration did not affect the number of significant genes identified (right column of results in bottom heat map, shown for coding regions in lung adenocarcinoma). Number of coding regions as in c.