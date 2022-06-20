Sequencing data curation

PCAWG dataset

We obtained somatic SNVs and indels from whole-genome sequencing of 2,583 unique tumors from the International Cancer Genome Consortium (ICGC) data portal (https://dcc.icgc.org/) and the database of Genotypes and Phenotypes (dbGaP) (project code: phs000178) that previously passed quality control5. The somatic mutation calls in this dataset have previously been stringently filtered to remove possible germline calls, false-positive calls due to oxidative DNA damage and calls with high strand bias12. Following procedures described in Rheinbay et al.5, we grouped samples into 38 individual cancer types and 14 meta-cohorts that combined similar tumor types, including a pan-cancer cohort that included all samples except melanoma and lymphoma tumors (consistent with Rheinbay et al.5). We removed samples with reported high microsatellite instability from all cohorts except the pan-cancer cohort and annotated autosomal coding SNVs and indels with their predicted functional impact using a custom annotation method. (We excluded sex chromosomes because the number of observed mutations on the X chromosome depends on the sex composition of a cohort). For the creation of somatic mutation maps and driver element analysis, we considered cohorts with at least 20 samples and >105 SNVs (Supplementary Table 1). This resulted in a set of 23 individual cancer types and 14 meta-cohorts.

Dietlein et al. dataset

We obtained somatic SNVs and indels from whole-exome sequencing of 11,873 tumors from 28 cancer types that had previously been curated in Dietlein et al.10 from http://www.cancer-genes.org/; the dataset previously underwent filtering to remove germline calls and due to oxidative DNA damage, as described in Dietlein et al10. We restricted to a set of 8,617 tumor samples from 17 cancer types for which we had mutation rate models trained on the PCAWG dataset (Supplementary Table 28). We additionally constructed a pan-cancer dataset by merging somatic mutations from all samples excluding melanoma and hematopoietic malignancies as in PCAWG5. Coding mutations were annotated for their predicted functional impact as above.

Target sequencing datasets

We obtained somatic SNVs from targeted sequencing of ten types of solid cancers performed using the IMPACT protocol at Memorial Sloan Kettering Cancer Institute from cbioportal53 (https://www.cbioportal.org/) (Supplementary Table 19). Possible germline calls were previously excluded from these datasets. We removed duplicate patients and hypermutated samples with >100 coding mutations in 221 genes common to all whole-exome and targeted sequenced samples (removal of hypermutated samples is common in driver gene detection and has been shown to improve accuracy4). Coding SNVs were then annotated for their predicted functional impact in coding sequence as above and merged with SNVs from the whole-exome datasets (after removing hypermutated samples) of the corresponding cancer type to form mega-cohorts with aggregate sample size of 14,018 tumors in ten cancer types.

Additional filtering of germline mutations

Any mutation occurring in an element with a nominally FDR < 0.1 significant burden of mutations was cross-referenced with the Genome Aggregation Database (gnomAD) version 2.1.1 (ref. 73) and excluded if it occurred in gnomAD with an allele count of five or more in any population, unless the mutation occurred primarily in a single population and the carrier was not of that population (this occurred only once; the mutation 1:43804317-C>T was observed in a carrier of European ancestry but is reported in gnomAD as occurring in Latino/admixed American populations). If the mutational burden of the element did not remain FDR < 0.1 significant after exclusion of these possible germline mutations, it was removed from further analysis. This filter was applied to all datasets.

Identification of mutational excess with probabilistic deep learning

Dig consists of two components: (1) a deep learning module that models approximately constant somatic mutation rates within kilobase-scale regions (for example, 10–50 kb) due to epigenetic features (for example, chromatin compactness) that vary at this scale5; and (2) a generative probabilistic model that captures the likelihood that a given position is mutated in a cancer cohort, conditioned on its sequence context10,29,30,34 and the kilobase-scale mutation rate of that cancer type. Intuitively, the kilobase-scale model provides information about how many neutral mutations should be present in a region, whereas the nucleotide context model determines how those mutations should be distributed among individual positions.

Modeling kilobase-scale mutation rates with deep learning

Model architecture

The purpose of the deep learning model is to (1) predict the mutation rate μ R and (2) quantify prediction uncertainty \(\sigma _R^2\) conditioned on the epigenetic organization of the region R. The architecture was previously described31. In brief, the network consists of a convolutional neural network (CNN) that takes as input a high-dimensional matrix of epigenetic assays (see ‘Model input and output’ section) and projects the matrix into a 16-dimensional vector. Optionally, the CNN also embeds into the 16-dimensional vector the mutation counts observed in the 100-kb regions flanking the region of interest. The low-dimensional embedding is then provided as input to a GP that predicts the mean and variance of number of mutations in the region. Technical details are provided in Supplementary Methods.

Model input and output

The CNN and GP were trained sequentially to predict somatic SNV counts in non-overlapping 10-kb regions by minimizing mean squared error loss between predicted values and observed counts from the PCAWG dataset for each of 37 cancer types. The network received as input matrices of size 735 × 100 where each row was an epigenetic feature track, and each column was the average track value in non-overlapping 100-bp windows. In total, 723 rows were uniformly processed −log 10 P values for peaks of chromatin markers from 111 tissues; ten rows were replication timings of ten cell lines from ENCODE33; and two were the average nucleotide content and average GC content of the human reference genome (Supplementary Table 3). The network additionally received as input somatic SNV counts in 100-kb regions flanking each 10 kb of interest from the relevant cancer in the PCAWG dataset. However, the accuracy of the method over 1-Mb regions was benchmarked using networks trained without flanking region counts to avoid any leakage of information between train and test sets.

Model training

For each cancer, predictions in each non-overlapping 10-kb region R of the autosome was obtained via the following five-fold cross-validation strategy. Bins that passed quality control (Supplementary Methods) were randomly divided into five equal-size folds, each containing 20% of the bins. Sequentially, each fold was withheld, and a deep learning model was trained using 80% of the remaining bins and validated over the other 20% of the remaining bins to avoid overfitting (Supplementary Methods). Prediction was then performed over the held-out fold (20% of the genome) and over regions filtered by quality checks. Additional technical details of model training are described in Supplementary Methods.

Testing mutational burden with a graphical model

Genome-wide likelihood of mutation from sequence context

For each cancer, maximum likelihood estimation was used to estimate the genome-wide probability of a mutation in each of 192 possible trinucleotide contexts using SNV counts from the PCAWG dataset. The statistical procedure is described in Supplementary Methods.

Modeling mutation counts over an arbitrary set of positions

We conceptualized that mutations arise in a region R with an unknown rate whose possible values are drawn from a distribution defined by the mean and variance predicted by the deep learning network. As mutations arise, they are distributed to individual positions based on the probability that each position in R is mutated based on its sequence context. Let \(M_{i,aX \to Yb}\) be the number of SNVs of the form \(aX \to Yb\) at position i in region R in some cancer cohort of interest. Then, under a probabilistic graphical model described in Supplementary Methods, the marginal distribution over a set of possible SNVs, I, in a region is31:

$$\mathop {\sum}\limits_I {M_{i,aX \to Yb}} \sim {{{\mathrm{NegativeBinomial}}}}\left( {\alpha _R,\frac{1}{{1 + C_{{{{\mathrm{SNV}}}}} \cdot \theta _R \cdot \mathop {\sum}

olimits_I {p_{R,aX \to Yb}} }}} \right).$$

where \(\alpha _R = \mu _R^2/\sigma _R^2\) and \(\theta _R = \sigma _R^2/\mu _R\) (recall μ R and \(\sigma _R^2\) are the mean and variance of mutation rate in region R estimated by the deep learning model); \(p_{R,aX \to Yb}\) is the genome-wide probability of a mutation of the form \(aX \to Yb\), normalized such that the probability of all possible mutations in R sums to 1; and C SNV is a constant scaling factor that accounts for the difference in sample size between the cohort of interest and the training cohort.

All parameters in the distribution except C SNV are already estimated from the training cohort. By default, C SNV is calculated as the ratio of the number of observed synonymous SNVs in the target dataset to the number of expected synonymous SNVs in the training cohort across all genes excluding TP53 (in which some synonymous mutations are under positive selection4). Thus, once the model has been trained once on the training cohort, calculating the distribution over any set of mutations in a target cohort of interest is essentially reduced to the constant time look-up of parameters. More details on the graphical model, including its extension to indels, multi-allelic variants and sets of variants that span multiple regions, are described in Supplementary Methods.

Comparison to existing driver detection methods

We compared Dig’s performance to that of six existing methods (NBR34, dNdScv4, MutSigCV21, Larva18, DriverPower19 and ActiveDriverWGS20) over two benchmarks: accuracy of the background mutation rate models and accuracy of driver detection. The six comparison methods were chosen because they are state-of-the-art methods that (1) identify putative driver candidates by searching for mutational excess and (2) are designed to model diverse regions of the genome: tiled regions (NBR), coding sequence (dNdScv and MutSigCV) and non-coding elements such as enhancers (Larva, ActiveDriverWGS and DriverPower). All methods were run with default parameters.

Comparing background mutation rate models

We compared the variance explained of observed SNV counts between models. Variance explained is the proportion to which a mathematical model accounts for variation in a dataset, which we calculated as the square of the Pearson correlation coefficient between predicted and observed SNV counts, as in previous work16. To ensure sufficient benchmarking power, we restricted comparisons to 16 cancer types in the PCAWG dataset with >1 million mutations because the variance-explained statistic becomes deflated when observed counts are low in a discrete system (Supplementary Methods). Comparisons were performed over non-overlapping 10-kb regions of the genome (Dig versus NBR), non-synonymous SNVs in coding sequences (Dig versus dNdScv versus MutsigCV) and the non-coding elements enhancers and long and short non-coding RNAs (Dig versus Larva versus DriverPower) (ActiveDriverWGS was not included because it does not output its internal estimates of mutation counts). We chose enhancers and non-coding RNAs because they are non-coding elements that all three methods could analyze and are sufficiently far from coding sequence that synonymous mutations cannot be used in general to estimate the neutral mutation rate. To control for confounding from element length (longer elements have more mutations on average than shorter elements), we restricted the analysis to genes 1–1.5 kb in length (n = 3,740) and non-coding elements 0.5–1 kb in length (n = 7,412). Additional details of region selection are described in Supplementary Methods.

Comparing driver element identification accuracy

Coding models

We compared the sensitivity, specificity and F1-score (harmonic mean of sensitivity and specificity) for driver gene detection from coding sequence mutations among Dig, MutSigCV and dNdScv across the 32 PCAWG cohorts (melanomas and hematopoietic cancers were excluded as in previous comparisons19). We additionally compared power over the 16 whole-exome sequenced cohorts from Dietlien et al.10 (excluding hematopoietic cancers as above). Details of both comparisons are provided in Supplementary Methods.

Non-coding models

We compared the sensitivity, specificity and F1-score for driver non-coding element identification from non-coding SNVs among Dig, DriverPower, Larva and ActiveDriverWGS20 across the 32 PCAWG cohorts (excluding melanoma and hematopoietic cancers as above). We chose to compare to these three methods because they are recently introduced methods for non-coding driver element identification that rely on neutral mutation models to test for selection. Details are provided in Supplementary Methods.

Power analysis

We conservatively simulated the power of Dig to detect driver SNVs at different carrier frequencies across enhancers and non-coding cryptic splice sites under the pan-cancer mutation map using a Monte Carlo approach described in Supplementary Methods.

Quantifying selection on cryptic splice SNVs

Curation of predicted splice SNVs

From SpliceAI40, we obtained a list of every possible SNV in the body of 17,816 autosomal genes with predicted impact on splicing (that is, SpliceAI Δ score) >0.2. Predicted splice-altering SNVs were separated into canonical (altering positions 1 bp or 2 bp 5′ or 3′ to an exon boundary) from cryptic splice SNVs (all other SNVs excluding sites that were 5 bp 3′ to an exon boundary that had been included in the definition of ‘essential splice sites’ considered by Martincorena et al.4— excluded to ensure that any enrichment we observed was independent of enrichment reported in that work). SNV positions were assigned based on the GENCODE V24 list of basic transcripts. Cryptic splice SNVs were further divided into coding SNVs (defined as synonymous SNVs common to each transcript of a gene) and intronic SNVs (defined as SNVs not falling within any coding sequence of any transcript).

Enrichment of coding mutations and splice SNVs in PCAWG

Dig was applied with default settings to the following sets of mutation from the PCAWG cohort in each of 17,815 genes for which we had predicted splice SNVs: synonymous SNVs, missense SNVs, nonsense (stop-gained) SNVs, coding indels, canonical splice SNVs and cryptic splice SNVs. Mutation enrichment was defined as the ratio of the observed mutations to expected mutations (this statistic is conceptually similar to the selection coefficient reported for coding mutations by dNdScv). P values for a gene set and mutation type were exactly calculated by convolving the mutation-type-specific negative binomial distributions for each gene in the gene set and summing the upper-tail probability that at least the number of observed mutations occurred by chance. We used a Monte Carlo simulation approach to estimate the 95% CIs of enrichment within a set of genes and given mutation type (Supplementary Methods). To further assess mutational enrichment, we directly compared the rate of mutations in TSGs and oncogenes to the rate in genes not in the CGC (Supplementary Methods). The excess of SNVs in TSGs in the CGC stratified by function (missense, nonsense, canonical splice and non-coding canonical splice) was calculated as the difference between the number of mutations observed and the number expected. The relative contribution for each functional category was defined as the excess for that category normalized by the sum of the excess across all categories. The 95% CI for the contribution of each category was calculated using a Monte Carlo approach (Supplementary Methods).

Genes enriched for non-canonical cryptic splice SNVs

In each of the 37 PCAWG cohorts, we identified genes with a significant burden of non-canonical cryptic splice SNVs as quantified by Dig. We considered two sets of genes: (1) all TSGs in the CGC (n = 283) and (2) all autosomal genes with predicted splice SNVs (n = 17,815). The significance threshold was defined per cancer as FDR q < 0.1 corrected for the number of tests (n = 283 or n = 17,815). We excluded genes where multiple SNVs contributing to the burden were observed in a single sample. We used a bootstrap method to determine whether predicted cryptic splice SNVs observed in TSGs with a significant burden were enriched for high predicted impact on splicing (Supplementary Methods).

Analysis of alternative splicing events in RNA-seq data

We obtained RNA-seq data for eight samples carrying deep intronic predicted cryptic splice SNVs (that is, distance to nearest exon boundary >20 bp) in TSGs with a significant burden of predicted non-coding cryptic splice SNVs and 41 control samples without a cryptic splice SNV. For each carrier–control pair of the same cancer type, we performed differential splicing analysis using LeafCutter as described by Li et al.41. Further details of the analysis are provided in Supplementary Methods.

Quantifying mutational excess in promoters and 5′ UTRs

Discovery of elements with a burden of mutations

Dig with default parameters was used to evaluate the PCAWG cohort (excluding hypermutated samples with >3,000 coding mutations) for mutational excess within two sets of regions: (1) indel excess within promoters previously defined by the PCAWG consortium5 (n = 19,251) and (2) SNV and indel excess within 5′ UTRs of TSGs (n = 106) and oncogenes (n = 95) in the CGC that spanned multiple exons of the canonical transcripts of genes (as defined by the UCSC genome browser for GRCh37); we additionally included the splice regions of the 5′ UTRs in our analysis, defined as the 20 bp bordering the start or end of an exon. The significance threshold was defined per cancer as FDR q < 0.1 corrected for the number of tests (n = 19,251 or n = 201).

ELF3 5′ UTR mutations in the Hartwig Medical Foundation cohort

We downloaded somatic mutations observed in the Hartwig Medical Foundation metastasis cohort50 from their online data portal (https://database.hartwigmedicalfoundation.nl/), excluding skin and hematopoietic tumors. Because we could only download mutations specific to a gene, we did not quantify burden with Dig. Rather, we directly compared the rate of SNVs in the 5′ UTR, first intron and 1-kb upstream region of ELF3 to the rate of synonymous mutations in ELF3 using a two-sided Fisher’s exact test.

Analysis of expression levels

We obtained gene expression levels (FPKM) and gene-level copy number estimates from the PCAWG data portal for all tumors for which RNA sequencing was performed. For a gene of interest, we applied a fixed-effects linear regression model to residualize the expression values for gene-level copy number per sample and the interaction between gene-level copy number and the cancer project that originally generated the RNA-seq data. We then normalized the residual expression values to have mean zero and unit variance across all samples and compared the normalized values between mutation carriers and non-carriers using a two-sided Mann–Whitney U-test.

Driver gene prediction in whole-exome and targeted sequenced samples

Mutational excess in ‘long-tail’ driver genes

For each of the ten cancer types for which we compiled SNVs from whole-exome and targeted sequenced cohorts, we assembled a list of known driver genes identified in any of three recent pan-cancer driver gene discovery efforts7,10,11 (we required genes be discovered with FDR < 0.1, the significance threshold common across the driver element detection literature) that were also common to all whole-exome and targeted sequenced samples (n = 69 oncogenes and n = 56 TSGs). For a given cancer, we considered ‘long-tail’ genes to be driver genes that were not on the list of known driver genes for the given cancer (that is, they were driver genes associated with other cancers). Dig was then used to quantify mutational excess in those long-tail genes. Because synonymous mutations were not available from the targeted sequenced samples, we instead used missense mutations with CADD phred score <15 to estimate the scaling factor that adapted the somatic mutation maps trained on PCAWG cohort to the meta-cohorts (details in Supplementary Methods). We directly estimated the P value of the mutational burden long-tail genes by convolving the neutral mutation distributions for each individual gene and calculating the upper-tail probability of at least the number of observed mutations across all genes occurring by chance under the null distribution. We calculated 95% CIs of excess mutations using the same Monte Carlo approach as in our analysis of cryptic splice SNVs. Excess rate per sample was calculated as the number of excess SNVs divided by the number of samples in the cohort for a given cancer type.

Identification of putative driver genes

We used Dig to identify individual genes with an excess of mutations in two cases: (1) in our meta-cohorts, testing 69 oncogenes for an excess of activating SNVs and 56 TSGs for an excess of pLoF SNVs (these were the set of known driver genes common to all whole-exome and targeted sequenced cohorts); and (2) in the exome-sequenced cohorts alone, testing 19,210 autosomal genes for an excess of pLoF SNVs. In each case, significance was defined as FDR q < 0.1 for the number of genes tested.

Box plot elements

All box plots have the following elements: center line, median; box limits, upper and lower quartiles; and whiskers, 1.5× interquartile range. Where shown, points depict all points used to construct the box-plot.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.