Analyses of non-coding somatic drivers in 2,658 cancer whole genomes

The discovery of drivers of cancer has traditionally focused on protein-coding genes1–4. Here we present analyses of driver point mutations and structural variants in non-coding regions across 2,658 genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium5 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). For point mutations, we developed a statistically rigorous strategy for combining significance levels from multiple methods of driver discovery that overcomes the limitations of individual methods. For structural variants, we present two methods of driver discovery, and identify regions that are significantly affected by recurrent breakpoints and recurrent somatic juxtapositions. Our analyses confirm previously reported drivers6,7, raise doubts about others and identify novel candidates, including point mutations in the 5′ region of TP53, in the 3′ untranslated regions of NFKBIZ and TOB1, focal deletions in BRD4 and rearrangements in the loci of AKR1C genes. We show that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-coding genes, additional examples of these drivers will be found as more cancer genomes become available.


Gene expression analyses 23
13. Normalization for copy number variation 24 14. Mutation-to-expression association 25 16. Power calculations 26 17. Associations between mutation and signatures of selection: loss of heterozygosity and cancer allelic fractions 27 19. Mutational process and indel enrichment 31 20. Structural variation analysis 31 21. RNA structural analysis 31 22. Cancer associated germline variant distance to non-coding driver candidates. 33 23. Assessing the significance of somatic rearrangement breakpoints 33 24. Classification of rearrangement patterns at sites of recurrent breakpoints 36 26. Annotation of potential functional effects of rearrangements

Patient cohorts
Generation of high-quality tumor set (Loris Mularoni) We selected a total of 2,583 samples to be included in the point mutation driver discovery analyses. This list contains all the samples that were not flagged as problematic by the PCAWG Technical Working Group. A single aliquot was assigned to each sample; in cases where multiple aliquots were present, we selected a single aliquot based on the following criteria, in order of importance: -we prioritized primary tumors over metastatic or recurrent tumors -we selected aliquots with an OxoG score higher than 40 -we prioritized aliquots with the highest quality (as indicated by the Stars values) -we prioritized aliquots with RNA-seq data availability -we prioritized aliquots with the lowest contamination (as indicated by the ContEst values) -if a selection could not be made after applying the above filters, we selected an aliquot randomly

Selection of tumor cohorts for analysis (Esther Rheinbay)
Individual tumor type cohorts from the high-quality tumor set were selected for analysis if they met a minimum size. This size was determined based on the cumulative number of patients, such that no more than 2.5% of total patients were excluded. This led to a minimum cohort size criterion of 20 patients, and removed the Bone-Cart (9 donors), Bone-Epith (11) in the low mutation burden-samples to explain some of the spectra of hyper-mutated samples.
Only signatures discovered in the low-mutation rate sample set were attributed to mutations in those samples. In contrast, mutations from hyper-mutated samples could be attributed to signatures discovered in either the low-or hyper-mutated sample set. A similar strategy was used for determining signature attributions; we performed a separate attribution process for low-and hyper-mutated samples using the COMPOSITE signatures.
Signature attribution in low-mutation burden samples was performed separately in each tumor type (i.e., Biliary-AdenoCA, Bladder-TCC, Bone-Osteosarc, etc.). Attribution was also separately performed in MSI (n = 39), POLE (n = 9), skin (n = 107), and a single TMZ-treated tumor. In each separate cohort, the signature activity (e.g., which signatures are active or not) was primarily inferred through the automatic relevance determination process applied to the activity matrix H only, while fixing the normalized signature matrix, W. The attribution in low-mutation burden samples was performed using only signatures found in step 1 of the signature extraction. Two additional rules were applied in SBS signature attribution to enforce biological plausibility and minimize a signature bleeding; (i) allow signature BI_COMPOSITE_SBS4 (smoking signature) only in lung and head and neck cases; (ii) allow signature BI_COMPOSITE_SBS11 (TMZ signature) in a single GBM sample. This was enforced by introducing a binary, signature by sample, indicator matrix Z (1 -allowed and 0 -not allowed), which was multiplied by the H matrix in every multiplication update of H.

Local signatures analysis:
We also performed a de novo local signature analysis in lymphoma samples (n = 197) to identify mutational signatures of the activation-induced cytosine deamination (AID). Since a majority of mutations from AID-related processes cluster near the IgH locus and several known off-target sites, we considered the clustering information of mutations as an additional feature in the signature discovery 3 . In brief, we first calculated, for each mutation, the nearest mutation distance (NMD) to all other mutations on the same chromosome in the same patient. We then stratified mutations into two groups of 'clustered' (NMD ≤ 1kb) and 'nonclustered' mutations (NMD > 1kb). Next, the clustered and non-clustered mutation sets in each sample set were analyzed as a separate entity with the mutation count matrix of 1536-by-2N matrix, which enabled the discovery of mutational signatures unique to the clustered and nonclustered mutations. This analysis yielded ten mutational signatures, including two specific to clustered mutations (W3, W10). The profile of the signature W3 was characterized by predominant C>T/G at RCY motifs (R = purine, and Y = pyrimidine), resembling the known canonical AID signature, while the signature W10 had mostly T>A/C/G at TW motif (or A>T/G/G at WA) corresponding to the known hotspots of the non-canonical AID related to the error-prone translesion DNA synthesis. In addition, the signature profile of W7 was most similar to the previously identified AID signature (COSMIC9) but most mutations of W7 were scattered along the genome and not clustered. Here, we calculated AID activity as the sum of attributions in the three signatures: W3, W7, and W10.

Definition of genomic elements (Morten Muhlig Nielsen; Nasa Sinnott-Armstrong)
Elements pertaining to protein-coding genes (protein-coding promoters, 5'UTR, protein-coding sequence, protein-coding splice sites, and 3'UTR) were defined based on GENCODE annotations (v.19) 7 :   all CDS regions filtered out, were used as the input to driver detection.

Candidate driver identification methods
A summary of approaches used by each method is listed in Supplementary Table 2.

ActiveDriverWGS (Juri Reimand)
Driver analysis with ActiveDriverWGS 16 was performed after discarding hypermutated samples (>90,000 mutations) from the PCAWG cancer cohort. To avoid leakage of signals from known cancer drivers, we removed missense mutations in analyses of non-coding regions.
ActiveDriverWGS is a local mutation enrichment method for genome-wide discovery of cancer driver mutations with increased mutation burden of single nucleotide variants (SNVs) and indels.
ActiveDriverWGS performs a model-based test whether a given genomic element is significantly more mutated than adjacent background genomic sequence (+/-10kb and introns The MutSig suite 26 classifies whether genomics features, both coding and non-coding, are highly mutated relative to a predicted background mutation rate (BMR), which varies on a macroscopiclevel across patients (patient-specific mutation rates can span orders of magnitude across pan-cancer cohorts) and genes (known covariates such as replication timing are strongly correlated with mutation rate) as well as on a microscopic level across sequence contexts (since mutational signatures are heterogeneous across a cohort and highly context-dependent). MutSig accounts for all three of these to compute the joint BMR distribution across genes/patients/contexts, and then convolves across the latter two dimensions to estimate the expected distribution of total background burden for a given gene across a whole cohort. Genes are then scored by how their total non-background burden exceeds this null distribution. Furthermore, MutSig includes in the BMR calculation the number of bases in each feature that are sufficiently covered for mutation calling. This prevents underestimation of the BMR when a feature is only partially covered.
Because this property sensitized MutSig to the randomizations performed without accounting for coverage, we removed from the analysis shown in Extended Data Fig. 11 those CDS for which coverage was below 90% of the region.
MutSig estimates a gene's BMR by its synonymous mutation rate for coding genes, and by its mutation rate at non-conserved positions for non-coding genes. If the number of background mutations in a given gene is insufficient to provide a confident estimate of its BMR, MutSig will incorporate the background counts from other genes with similar covariate profiles into its estimator.
MutSig (MutSig2CV) was originally designed for coding regions only 26 . Modifications to this version of the algorithm to run on non-coding regions were made for this study's analyses of noncoding regions, as well as the sensitivity and benchmarking analysis in Extended Data Fig. 3.
This version does not include significance evaluation based on functional impact or positional clustering, and is available from https://github.com/broadinstitute/getzlab-PCAWG-MutSig2CV_NC.
The protein-coding MutSig2CV package, which includes tests for enrichment functional impact/clustering, was run on observed and simulated data for CDS.

NBR (Inigo Martincorena)
NBR (https://github.com/im3sanger/dndscv) is a method that tests for evidence of higher mutation density than expected by chance in a given region of the genome, while accounting for trinucleotide mutational signatures, sequence composition, and the local density of mutations around each element. This method has been described in detail in a previous publication 27 , where it was used to identify candidate driver noncoding elements across 560 breast cancer wholegenomes. To protect against neutral indel hotspots or indel artifacts, unique indel sites rather than total indels per element were used. To protect against misannotation of a mutation clusters as sets of independent events, a maximum of two mutations per region and per sample were considered in the analysis.

ncDriver (Henrik Hornshøj)
The ncDriver method 30 (http://moma.ki.au.dk/ncDriver/) provides separate evaluations of the significance for two mutation properties, the level of conservation, and the level of cancer type specificity. In the ncDriverConservation test, the conservation levels of mutated positions were evaluated locally for being surprisingly high, given the distribution of conservation within the element. The P value of the mean mutation phyloP conservation score for an element was obtained by Monte Carlo simulation of 100,000 mean phyloP scores based on the observedsame number of mutations. Each mutated element was also evaluated globally by looking up the rank of the element mean phyloP conservation score among all elements annotated as the same type. This provided P values for both local and global mutation conservation level, which were combined into a single conservation P value using Fisher's method 31 . In the ncDriverCancerType test, the distribution of observed mutation counts of an element across the cancer types were evaluated for being surprising compared to expected counts estimated from a background null model (as described for the ncdDetect method) that accounts for cancer type specific mutation signatures and other covariates. A goodness-of-fit test with Monte Carlo simulation was used to determine whether the distribution of observed mutation counts across cancer types within the element is surprising given the expected mutation counts based on cancer types, mutation contexts, and element type. For indels, the expected mutation counts were estimated solely from the mutation rates calculated from the mutation context, cancer type, and element type. P values from SNV and indel runs were combined using Fisher's method 31 .

OncodriveFML (Loris Mularoni)
OncodriveFML 32 (https://bitbucket.org/bbglab/oncodrivefml/src/master/) is a method designed to estimate the accumulated functional impact bias of tumor somatic mutations in genomic regions of interest, both coding and non-coding, based on a local simulation of the mutational process affecting it. The rationale behind OncodriveFML is that the observation of somatic mutations on a genomic element across tumors, whose average impact score is significantly greater than expected for said element constitutes a signal that these mutations have undergone positive selection during tumorigenesis. This, in turn, is considered as a direct indication that this element drives tumorigenesis.
OncodriveFML first computes the average functional impact score of the observed mutations in the element of interest. The functional impact scores of mutations have been calculated using both CADD 33 (coding and non-coding regions) and VEST3 34 (only coding regions). Then, the method randomly samples the same number of observed mutations following the probability of mutation of different tri-nucleotides, computed from the mutations observed in each cohort. The randomization step is repeated many times (1,000,000 in these analyses), and each time an average functional impact score is calculated. Finally, OncodriveFML derives an empirical P value for each element by comparing the average functional impact score observed in the element to its local expected average functional impact score resulting from the random sampling. The empirical P values are then corrected for false discovery rate, and genomic elements that remain significant after the correction are considered candidate drivers.
regDriver (Husen M. Umer) regDriver (https://github.com/husensofteng/regDriver) assesses the significance of mutations affecting transcription factor motifs using tissue-specific functional annotations 35 . For each tumor cohort, functional annotations from the cell lines most similar to the respective tumor type are gathered. A functionality score is computed for each mutation based on its overlapping functional annotations. regDriver collects highly scored mutations in each of the defined elements and assesses the elements' significance by comparing its accumulative score to a background score distribution obtained from the simulated sets. Therefore, only candidate regulatory mutations are considered in evaluating mutation enrichment per element.

Sanger simulations (Inigo Martincorena)
This simulation aimed to generate data sets of neutral somatic mutations that retain key sources of variation in mutation rates known to exist in cancer genomes, including mutational signatures, and variable mutation rates across the genome and also among individuals and cancer types.
To do so while minimizing the number of assumptions in the simulation, we used a simple local randomization approach. First, all coding mutations as well as mutations in the TERT promoter, The classical approach for combining P values obtained from independent tests of a given null hypothesis was described by R. A. Fisher in 1948. He noted that for a set of k P values, the sum X of the log-transformed P values, where and pi is the P value for the ith test, follows a chi-square distribution with 2k degrees of freedom 31 . Thus, to obtain a single combined P value for a set of independent tests, the new test statistic X is computed from the P values obtained from the tests and scored against a chisquare distribution with 2k degrees of freedom. Fisher's test is asymptotically optimal among all methods of combining independent tests 37 ; however, in cases where tests exhibit positive correlation among the ln(pi) values, the Fisher combined P value is generally too small (anticonservative).
In this study, we combine P values from several driver detection methods, many of which share similar approaches and whose results are therefore not independent. Thus when the pi are independent, var(ψ) = 4k, which gives f = k and c = 1, and the test statistic follows the chi-square distribution with 2k degrees of freedom described by Fisher. However, when the independence condition is relaxed, var(ψ) ≠ 4k, and the test statistic generally follows a different, scaled, chi-square distribution whose scaling parameter c and degrees of freedom 2f are determined by the covariances of the pi's. The covariances can be computed via numerical integration over the joint distributions of all pi and pj pairs, but this requires knowledge of the joint distribution; and even in cases where the joint distribution is known, the integration may not be computationally feasible for large and complex data sets 38 .
In this study, following the example of Poole et al. 38 , we computed the empirical covariance of pi and pj, using the samples wi and wj , where wi is the set of all reported P values for method i, and used the empirical covariance to approximate the Brown scaled chi-square distribution. The advantage to this approach is that the empirical covariance estimation is non-parametric -it does not assume an underlying joint distribution of pi and pj -and is thus applicable to complex and interrelated biological data sets where data is noisy and not regularly Gaussian. Poole et al.
showed that the empirical covariance estimation approach is accurate, robust, and efficient for such data sets.

Implementing and evaluating the combination method on simulated and observed data
To evaluate the efficacy of the empirical Brown's method of dependent P value combination, we generated three sets of simulated mutation data (see above) and ran the driver detection algorithms on each of the simulated data sets. We checked that the P value results from the various driver detection algorithms followed the expected null (uniform) distribution (Extended Data Fig. 11a). Then, for each simulated data set, we calculated the empirical covariance for each pair of driver algorithm results. We then used these covariance values over simulated data sets to compute the combined Brown P values on observed data: for each gene in the observed PCAWG somatic mutation data set, we computed the Brown test statistic from the set of P values reported by the various driver detection algorithms. The Brown test statistic was then evaluated against the appropriate chi-square distribution, whose scale and degree parameters were approximated by the covariance values calculated on the simulated data (see above).
We ran this procedure, as well as the Fisher method, for six representative cohorts (three of which are shown: ColoRect-AdenoCa, Lung-AdenoCa, Uterus-AdenoCa) and found that the Brown combined P values generally followed the null distribution as expected (Extended Data   Fig. 11b). The Fisher combined P values were significantly inflated (Extended Data Fig. 11b), confirming that dependencies existed between the results reported by the various driver detection algorithms.
To make this process more straightforward and reduce computation time, we explored whether computing the covariance values on observed data instead of simulated data would yield similar results. In each of the six representative cohorts, we calculated the empirical covariances on the observed data only and then computed the integrated Brown P values on the observed data using the observed covariances. Significant genes identified using only observed covariances remained mostly unchanged from the significant genes identified using the simulated covariances (Extended Data Fig. 11d), and examination of the differences in the covariance values between the simulated estimations and the observed estimations revealed only minor differences in values (Extended Data Fig. 11c). The significant drivers presented in this study were identified using this final approach -e.g., by computing integrated Brown P values using estimations of covariance on observed data only.
Combination of P values from observed data was performed for our 42 individual tumor-type and meta cohorts and 13 target element types. Methods were selected for each given data set (see below), and raw P values smaller than 10 -16 were trimmed to that value before proceeding with the combination. Methods with missing data for a given element (i.e., ones that failed to report a P value for a given element) were excluded from the calculation for that element, and therefore in some cases the integrated Brown P value was computed from P values reported by only a subset of all the driver detection algorithms contributing results for that data set.

Selecting methods to include in the combination of observed P values
In some cases, individual driver detection algorithms reported P values for a given data set that deviated strongly from the expected uniform null distribution. These were methods for which the quantile-quantile (QQ) plots demonstrated considerable inflation. We removed results that reported an unusual number of significant hits by calculating, for each set of results, the number of significant elements found by each individual method using the Benjamini-Hochberg FDR 21 with Q < 0.1 as the significance threshold. Any single method that reported four times the median number of significant elements identified by individual methods was discarded from the combination. In a separate analysis, we found that removing methods that yielded fewer hits than the median (i.e., methods with deflated QQ plots) did not affect the number of significant genes identified through the combination of the reported P values (Extended Data Fig. 11d); hence, we did not remove such methods. Post-filtering of significant hits was performed to remove those with accumulation of mutations caused by sequencing problems or mutational processes. In particular, we applied the following: (i) at least three mutations are present in the element, (ii) mutations are present in at least three patients of the tested cohort, (iii) less than 50% of mutations are located in palindromic DNA sequence 27 , (iv) more than 50% of mutations are located in mappable genomic regions (CRG alignability, DAC blacklisted regions, and DUKE uniqueness 39 ); and (v) manual review of sequence evidence for novel drivers. For lymphoid tumors, which contain regions of somatic hypermutation caused by AID enzyme activity, we (vi) further required less than 35% of mutations contributed by this process (signatures W3, W7, and W10, Supplementary Fig. 1 Supplementary Fig. 3) 6 . For each of the signature dependent filters, we "rescued" elements found significant at the FDR < 0.1 level and passed all filters in at least one other cohort.

DNA palindromes
We define a palindrome as a sequence of DNA followed by its complementary reverse with a sequence of variable length in between. It is hypothesized that these palindromes can temporarily form DNA hairpins 43 To evaluate the sensitivity and precision of different methods, and particularly of our approach for P value combination, we compared their relative performance in detecting known proteincoding cancer genes (603 genes from the manually curated Cancer Gene Census v80 database 45 ). As the true set of cancer drivers to be discovered in any cohort is unknown, we defined the truth set as the union of significant CGC genes successfully identified by any single or combination of methods. For the negative set, we used all genes not in the truth set, which almost certainly contains yet-undiscovered cancer genes. Then, for each combination of methods, we intersected the predicted significant genes with the truth set (negative and positive, respectively) to calculate the F1 score for each combination of methods. We ranked how individual methods performed based on their F1 score in all protein-coding cohort runs, after removing cohorts with fewer than 5 true positives. In addition, we only included runs with stable F1 scores and 95% confidence intervals <0.5 (see below), leaving 33 eligible cohorts. We used the unpaired Two-Sample Wilcoxon rank sum test to compare the F1 scores between the different number of methods in combinations for the four largest cohorts because of their highest statistical power to discover drivers (Adenocarcinoma, Carcinoma, Digestive tract tumors and Pancan-no-skin-melanoma-lymph).
Confidence intervals (95%) on F1 scores were empirically estimated using 1,000 simulations of precision and recall values drawn from beta distributions. Methods were ranked based on cohorts with at least five genes in the positive truth set (see above), and F1 scores with confidence interval size < 0.5.

Genome-wide driver discovery (Federico Abascal, Iñigo Martincorena)
To ensure that no highly recurrent genomic regions outside the defined functional elements were missed by our focused analysis, we searched for an excess of mutations in two additional element types: 1) non-overlapping 2-kb bins spanning the entire genome; and 2) 4,351 ultraconserved regions 46 , of which only 36% overlapped with our functional element regions. We used the NBR negative binomial regression model described above, but without local mutation rate covariates because of problems arising in bins close to unmappable regions (e.g., peri-

Normalization for copy number variation (Henrik Tobias Madsen, Morten Muhlig Nielsen, Jakob Skou Pedersen)
To account for the effects of somatic copy number alterations (CNAs) on expression, we used two different approaches to create two additional versions of the expression profiles. First, we used a conservative approach wherein we remove all samples not having the regular bi-allelic copy number for the gene in question. Second, we used a less conservative approach, wherein we first built a regression model of expression data based on copy number (CN) data and then tested for an effect of somatic mutations on the residual (i.e., the expression that is not explained by copy number).
Generally, the higher the copy number of a particular gene, the higher its expression. The relationship between copy number and gene expression is not strictly linear, as various feedback mechanisms in the cell try to compensate for the mostly deleterious effects of CNAs.
This is known as dosage compensation and has been studied extensively in the context of mammalian sex chromosomes, but also in evolution of yeast and in diseases caused by aneuploidy [51][52][53][54] . We therefore fit a linear regression model between the logarithm of expression and the logarithm of CN. This effectively amounts to a power-regression model.
A number of factors makes it difficult to learn the regression parameters for each gene and cancer type in isolation: (i) for some cancer types, we have only a limited number of samples; (ii) for some genes, there is not much variation in CN; and (iii) the variation in expression between samples is generally high. We overcome these problems by employing a mixed model strategy that allows sharing of information between genes, effectively regularizing the parameter estimates for gene/cancer-type combinations that carry little information on their own. Let 8,:,; and 8,:,; denote the expression and CNA measurement respectively for gene in cancer type and sample respectively. We then define: where and are fixed effects, whereas and are random effects, with and . Finally the residual is .
Using this model we infer a global CNA to expression regression, , but allow some regularized gene-specific and gene/cancer type-specific variation: and .
Thus, we exploit the similarity across genes and similarity within genes across cancer types.
Since the variance increases with the absolute value of the explanatory variable associated with a random slope, this kind of mixed model display heteroskedasticity. Furthermore, the model is not invariant under scaling of the explanatory variable, in this case CNA. We centralise log(CNA) such that normal diploid regions have the least variance.

Mutation-to-expression association (Morten Muhlig Nielsen, Henrik Tobias Madsen, Jakob Skou Pedersen)
The fraction of patients with accompanying RNA-seq data varies between cohorts, and thus the power to detect if mutations are associated with expression also varies. Overall, 1,190 patients (46%) have mRNA expression measurements. We report raw P values for mutation to expression association throughout the manuscript since some of the tests have significant sample overlap for meta-cohorts. The false discovery rate was controlled with the Benjamini-Hochberg procedure 21 and Q values together withRNA-seq sample counts for each of the tests presented in the manuscript can be found in Supplementary Table 10

Copy number analyses (Esther Rheinbay)
We surveyed significant focal copy number alterations for candidate driver genes as orthogonal evidence for their "driverness". Significant copy number alterations were obtained from the TCGA Copy Number Portal (http://portals.broadinstitute.org/tcga/home), analysis "2015-06-01stddata-2015_04_02 regular peel-off", a database of recurrent copy number alterations calculated by the GISTIC2 algorithm 55 across > 10,000 samples and 33 tumor types from TCGA. GISTIC2 results were included for candidate drivers if a gene was significant (residual Q < 0.1) and was located within a peak with ≤ 10 genes. Visualization was performed with the Integrative Genomics Viewer (IGV) 56 .

Estimation of total number of TERT promoter hotspot mutations. Detection sensitivity (d.s.)
for all patients was calculated for the two most recurrent TERT promoter hotspot sites

Calculation of the minimum powered mutation frequency in a population.
Power to discover driver elements mutated at a certain frequency in the population were conducted as described before 26,59 , but solving for the lowest frequency for a driver element in the patient population that is powered (≥ 90%) for discovery. The calculation of this lowest frequency takes into account (i) the average background mutation frequencies for each cohort/element combination, (ii) the median length and average detection sensitivity for each element type, patient cohort size, and (iii) a global desired false positive rate of 10%. The effect of element length is discussed in Supplementary Note 10.

Association between mutation and loss of heterozygosity
When a tumor carries a driver mutation in one allele of a given gene, it may be the case that a second hit on the other allele confers a growth advantage and is positively selected. When one of the events involves the loss of one of the alleles, the process is referred to as loss of heterozygosity (LOH). This kind of biallelic losses are typical of, but not exclusive to, tumor suppressor genes (TSGs).
For each gene, we build a 2x2 contingency table indicating the number of cases in which the gene was mutated or not and the number of cases in which the gene was subject to LOH or not.
We applied a Fisher's Exact test of proportions to identify which genes showed an excess of LOH associated to mutation. P values were corrected with the Benjamini-Hochberg FDR method to account for multiple hypotheses testing. This analysis was applied to each cohort separately and proved very successful in identifying TSG as well as some oncogenes (OGs). Where Lp corresponds to the local ploidy for the mutated locus, and Pt denotes the tumor purity.

Association between mutation and cancer allelic fractions
Ploidy and tumor purity predictions were obtained from Gerstung et al 57 .
To determine whether CAFs for a given gene or element were higher than expected we compared them to the CAFs observed in flanking regions. To define flanking regions, we took 2 kb at each side of the gene/element, excluding any eventually overlapping coding exons, but included introns (if present). The two sets of CAFs associated to each gene/element, i.e., those CAFs lying within the gene/element and those flanking it, were compared with a t-test to detect significant deviations. P values were corrected with the FDR method. This approach was able to identify most known TSGs and OGs.

Estimation of the number of driver mutations in non-coding regions of known cancer genes (Federico Abascal, Iñigo Martincorena)
We conducted a series of analyses on regions combined across genes to determine whether the paucity of driver mutations found in non-coding regions was related to lack of statistical power in single-gene analyses. For protein-coding sequences, the number of driver mutations was estimated using dN/dS ratios as described in (Ref 22 ). For non-coding, regulatory regions of protein-coding genes (promoter and UTRs), we relied on a modified version of the NBR negative binomial regression model described above to quantify the overall excess of driver mutations. We applied a second approach to determine whether there was an enrichment of LOH associated to mutations in the different types of non-coding regions associated to proteincoding genes.

Observed vs. expected numbers of mutations based on the NBR mutation model
NBR was used to estimate the background mutation rate expected across cancer genes, using a conservative list of 19,082 putative passenger genes as background. The resulting model is  Note 11). We also note that this test can underestimate the number of non-coding drivers since some driver mutations can be present in the list of putative passenger genes, although this effect is expected to be quantitatively small if the density of driver mutations in regulatory regions of known cancer genes is higher than in those of putative passenger genes.

Mutation-LOH association for aggregates of genes
For this analysis, we combined data across known cancer genes, including 603 genes in the CGC and 154 additional significantly mutated genes found by exome studies 22,26 . To estimate whether there was an excess of LOH associated to mutation in regulatory and coding regions of cancer genes, we calculated the fold change in LOH for the aggregate of cancer genes and normalized it dividing by the fold change observed in passenger genes. Confidence intervals were estimated using parametric bootstrapping (100,000 pseudoreplicates) for both cancer and passenger genes. The r_min scores were then transformed, 1-((r_min+1)/2), to range between 0 and 1, where 1

Mutational process and indel enrichment
indicates high structural impact score. Further, we followed the steps of oncodriveFML (see above) with 1,000,000 randomizations and using per sample mutational signatures (i.e., the probability of observing a mutation in a particular trinucleotide context in a given sample) to compute the P value at the cohort and sample level. Two different overlapping deletion calls in the RMRP gene body were observed in the same thyroid cancer patient. After manual inspection of the tumor and normal bam files, it was found that these calls were based on the same mutational event, and only one was included in the above analysis.

Muhlig Nielsen)
We used a set of genome-wide significant cancer associated germline SNPs (n = 650) from the NHGRI-EBI GWAS catalog 67 as collected by Sud et al. 68 . We evaluated the genomic distance from candidate non-coding drivers to the closest germline variant. All distances were above 50 kb with the exception of the TERT promoter, which was 1 kb away from a coding variant (rs2736098) in the TERT gene. To model the background rate of somatic breakpoints, we first established a discrete coordinate system on which to evaluate genomic covariates and breakpoint counts. We binned the genome into 50 kb bins, with 1 kb of overlap between bins to reduce edge effects, which produced 61,920

Assessing the significance of somatic rearrangement breakpoints
loci. Complex events with many tightly clustered breakpoints could dominate the breakpoint count at a single bin and cause an overestimation of the prevalence of breakpoints at those loci. To account for this, we only considered one breakpoint per sample per locus. After removing locussample duplicates, 336,496 breakpoints (55% of all breakpoints) were counted within our model.  Table 17).
The detected rate of breakpoints across the genome is also confounded by the mapping quality within a locus. Rearrangements in regions that are difficult to align to (e.g. alpha-satellite repeats) were rejected by our variant callers, leading to a relative depletion of events in regions with low mappability. To control for this effect, we use the concept of "eligible territory" from Imielinski et al 69

Genomic covariates that predict breakpoint frequencies
We hypothesized that local sequence features (e.g., density of repetitive elements), replicationtiming, chromatin state, epigenetic modifications, and other genomic features, could be predictive of breakpoints rates within our GP model. We therefore fit our GP model using both "interval" covariates that indicate genomic regions (e.g., SINE elements), and "numeric" tracks that indicate values (e.g., GC content) associated with genomic regions. The complete list of genomic covariates and their coefficients are listed in Supplementary Table 13.
Assessing the significance of loci with high breakpoint rates We used the full GP model to estimate the background rates for each locus and to calculate the probability that ci or more events would be observed at locus i. The count data ci is restricted to a non-negative integer, and the probabilities will be a slight overestimate of the true value. To correct for this, we use the procedure employed in Imielinski et al 69 to select a random probability from a uniform distribution between the probability of observing ci breakpoints and the probability of observing ci + 1 breakpoints. To correct for multiple hypothesis testing, we calculated the false discovery rate (FDR) using the Benjamini-Hochberg method 21 . The significant loci were defined as those with an FDR of < 10%. We created a final significant loci list by joining significant loci and their intervening regions if they were separated by fewer than 1 Mbp. Analysis of the P values and quantile-quantile plot (Supp. Fig. 8c) shows a uniform distribution without apparent biases of P values.
We next attempted to determine which breakpoints at each significantly recurrent locus were themselves likely driver rearrangements. We noted that the breakpoint counts at many loci were dominated by rearrangements from a small subset of tumor types, suggesting that the rearrangements in these tumor types were drivers. Some rearrangements from other tumor types, however, would often also be seen at background rates expected for these tumor types. We therefore calculated an enrichment P value (binomial test) that tumor type T was enriched at that locus: where k is the number of breakpoints from tumor type T intersecting the locus, n is the total number of breakpoints intersecting the locus, and rT is the fraction of breakpoints from tumor type T within the entire PCAWG cohort. Using this enrichment score, we considered as driver rearrangements only rearrangements from the most enriched tumor-type and any tumor-type x with log(px/ptop) < 3.

Comparison of recurrent breakpoint loci with significantly recurrent SCNAs and known fusions
Commented [MOU1]: Note that I changed the "n-1" at the exponent at the end to "n-i" which I believe to be the correct equation.
We compared the significantly recurrent breakpoint loci with sites of significantly recurrent SCNAs obtained from GISTIC2 55   , and double-break join, ;[ a\ ,models, we divided the genome into bins containing a target of 100 rearrangements per bin. To avoid cases in which a cluster of rearrangements is divided into two bins, we imposed a minimal distance between breakpoints of 2 kb; if a bin boundary falls between two breakpoints not meeting this condition, the bin is extended until the condition is met. The normalized distribution of number of breakpoint is the parameter ri used to construct the two models. After binning the genome, we constructed the rearrangement matrix, kij, by assigning each rearrangement in our dataset to a tile. Each sample was only allowed to contribute up to one rearrangement per tile.
The overall background rate of events is therefore represented by ;[ = X ;[ where the linear combination is taken over a set of weighting parameters X . We chose to use the distance between breakpoints (span) as a natural choice for the weighting parameters in this two-dimensional genomic representation. We divided the 2D space into short (≤1 Mbp), long (>1 Mbp), and inter-chromosomal translocations, and obtained the values of X by minimizing the Bayesian Inference Criteria (BIC). A list of recurrent rearrangements for the long subset was then generated by calculating a P value in each tile with a binomial test statistic against kij, followed by control of multiple hypotheses using the Benjamini-Hochberg FDR procedure at a threshold of 0.1 and a minimum count of two rearrangements per tile.

Annotation of potential functional effects of rearrangements (Jeremiah Wala, Ofer
Shapira, Nikos Sidiropoulos, Joachim Weischenfeldt, Rameen Beroukhim) We annotated the potential functional effects of each rearrangement based on the locations and orientations of its breakpoints. Gene definitions for genome build hg19 were obtained from the UCSC Table Browser 72 . Rearrangements were evaluated for whether they could produce a possible in-frame sense fusion transcript. The CCDS database 73 from hg19 was obtained from the UCSC Table Browser. With the CCDS intervals, breakpoints contained within a gene were annotated by which intron or exon they overlapped with, and the coding frame (1,2, or 3) of the first exon opposite the direction of the breakpoints. Candidate fusions were called as in-frame and sense if 1) the relative orientations of the breakpoints and directionality of the gene resulted in a potential sense fusion and 2) the two breakpoints were in the same coding frame.
We used a classification scheme described in our companion paper 60 to segregate rearrangements according to likely shared mechanisms of formation. We considered five major classes of rearrangements: isolated deletions, inversions, tandem duplications, interchromosomal translocations, and complex rearrangements.

Breakpoint association with gene expression in cis
To identify SRBs associated with nearby gene expression change, we applied CESAM which integrates rearrangement-derived breakpoints with RNA-seq data (FPKM-UQ) to identify expression changes associated with breakpoints in cis, as previously described 74 . In brief, normalized RNA-seq expression is regressed on a rearrangement breakpoint matrix, using tissuetype, total number of rearrangements, and first principal components of the breakpoint matrix as covariates. Expression data was dosage-adjusted prior to the analysis by normalizing the expression level of each gene (FPKM-UQ) to the copy number level of the gene in each tumor sample. This was done to remove effects due to copy-number dosage effects, i.e., not attributable to cis-effects. Only breakpoint bins with at least three tumors having associated RNA-seq data were evaluated. To assess whether SRB-CESAM hits were associated with juxtaposition of normally distant enhancer elements, the distal breakpoint of a rearrangement (relative to the breakpoint closest to the SRB centroid) was intersected with tissue-matched enhancer regions 15 with a window of +/-20 kb. Significance was assessed by random shuffling of breakpoint positions on the mappable genome (alpha < 0.1).
The pan-cancer association between copy number-adjusted gene expression changes of CESAM hits with breakpoints was assessed by computing the average of copy number-adjusted gene expression fold-change for each histology type to alleviate cell-type specific biases in gene expression.

Rearrangement types and effect on expression, enhancer-distance, and TSGs
For each cluster of rearrangements, the genomic centroid position of the breakpoints was used to identify the most deregulated gene within a window of +/-1 Mb of the centroid. Fold-change expression was calculated as the ratio between the median of gene expression for tumor samples with (SV+) versus without (SV-) a breakpoint at the cluster. A randomized background set was calculated for each cluster by random sampling (n = 1,000) a breakpoint from the complete set of rearrangement and computing fold-change as above with the same set of SV+ and SV-samples.
This was done to remove sample-specific biases in gene expression levels.
The distance to the nearest tissue-specific enhancer was computed as described above. Briefly, when available, we matched the tissue of origin from the tumor to the cell type from the enhancer track. Where no match was available, we compared the breakpoints with the complete enhancer set across cell types.

Biallelic inactivation events
To identify tumor suppressor two-hit events, we defined biallelic inactivation as a gene locus GA/B, where alleles A and B are genetically altered, leading to a genetic Gmut/mut state. The biallelic inactivation assessment includes three genetic inactivation event types consisting of somatic or germline deletions ("Loss"), somatic or germline SVs ("Break"), and somatic or germline SNVs ("Mutation"). Given a heterozygous GA/B locus, we required a loss of the A allele of the gene, leading to a hemizygous G-/B state, and genetic inactivation of the remaining B allele, specifically requiring the second event to overlap the loss on the A allele, leading to biallelic inactivation. We considered four classes of biallelic inactivations: i) Loss/Mutation, nonsynonymous driver mutations of the B allele; ii) Loss/Loss, two deletion events that overlap an exon and the copynumber derived allele count is 0 both for A and B allele; iii) Loss/Break, SVs where one or both breakpoints are situated in an exon of the B allele; and iv) Mutation/Mutation, a nonsynonymous germline SNV and a nonsynonymous driver somatic SNV of the same gene. We infer the germline mutation to occur on the A allele and the somatic mutation on the B allele, with the assumption that two independent driver mutation events are highly unlikely to occur on the same allele (https://bitbucket.org/weischenfeldt/biallelic_inactivation). Only curated tumor-suppressor genes were assessed. Enrichment of biallelic inactivation for each rearrangement cluster type was assessed by comparing the frequencies to a permuted set (Fisher's Exact test, n = 1,000), showing enrichment of biallelic inactivation at deletion-type (P < 0.005), neutral-type (P < 0.001) and fragile-type (P < 0.001), and depletion of amplification-type (P < 0.001) and fusion-type (P < 0.001) rearrangement clusters.

SV portal for visualisation of SV recurrence
SVscape is an interactive R Shiny server build for browsing structural variant and breakpoint distribution along user defined genes or genomic loci. 1D breakpoint enrichment P values are estimated by Fisher's Exact test, and asterisks denote significant 2D rearrangements involving the given locus. SVscape is available as a public instance at www.svscape.org and can be downloaded and deployed locally at www.bitbucket.org/weischenfeldt/svscape.git.

Power calculations for rearrangements (Ofer Shapira, Kiran Kumar)
To analyze the number of tumor-normal pairs needed to reach saturation in the detection of fusions, we employed a binomial power model 26  We performed the analysis first as a function of the distance between breakpoints with median number of rearrangements per sample of the entire cohort (Extended Data Fig. 10a). The second analysis was performed as a function of the median number of rearrangements per sample, spanning values represented by the ICGC histologies with more than 15 samples (Fig. 4b).
For each total number of tumour-normal pairs, N, the general procedure involved: 1) finding the minimal number of patients needed to reach significance level ---of P < 0.1/(# of tiles) based on Hnull; 2) using this value, calculating the minimal rate above background, r, that yields 90% power of the alternative distribution, Halt ~ Binomial(N, pnull + r); and 3) calculating contour lines of constant value rates above background.