Model-based analysis of positive selection significantly expands the list of cancer driver genes, including RNA methyltransferases

Identifying driver genes is a central problem in cancer biology, and many methods have been developed to identify driver genes from somatic mutation data. However, existing methods either lack explicit statistical models, or rely on very simple models that do not capture complex features in somatic mutations of driver genes. Here, we present driverMAPS (Model-based Analysis of Positive Selection), a more comprehensive model-based approach to driver gene identification. This new method explicitly models, at the single-base level, the effects of positive selection in cancer driver genes as well as highly heterogeneous background mutational process. Its selection model captures elevated mutation rates in functionally important sites using multiple external annotations, as well as spatial clustering of mutations. Its background mutation model accounts for both known covariates and unexplained local variation. Simulations under realistic evolutionary models demonstrate that driverMAPS greatly improves the power of driver gene detection over state-of-the-art approaches. Applying driverMAPS to TCGA data across 20 tumor types identified 159 new potential driver genes. Cross-referencing this list with data from external sources strongly supports these findings. The novel genes include the mRNA methytransferases METTL3-METTL14, and we experimentally validated METTL3 as a potential tumor suppressor gene in bladder cancer. Our results thus provide strong support to the emerging hypothesis that mRNA modification is an important biological process underlying tumorigenesis.


Introduction
capture potential spatial clustering of somatic mutations into "hotspots". Our approach also 69 introduces other innovative features: a detailed model of the background mutation processes, which 70 accounts for known genomic features and variation across genes not captured by these features; and 71 the use of a Bayesian hierarchical model to combine information across cancer types and hence 72 improve parameter estimates. 73 Both simulations and application on TCGA data demonstrate the power of our approach. 74 The explicit statistical models of driver and non-driver genes allow us to perform realistic 75 simulations to assess methods, which was largely impossible in the past. We found that not all 76 existing methods properly control the False Discovery Rate (FDR) for driver gene discovery, and 77 among those with reasonable FDR control, driverMAPS has significantly higher power than 78 existing ones. We applied driverMAPS to TCGA exome sequencing data from 20 cancer types. The 79 results suggest that driverMAPS is better able to detect previously known driver genes than existing 80 methods, without excessive false positives. In addition, driverMAPS identified 159 new potential 81 driver genes not identified by other methods. Both literature survey and extensive computational 82 validation suggest that many of these genes are likely to be true driver genes. The novel potential 83 driver genes included both METTL3 and METTL14, which together form a key enzyme for RNA 84 methylation. We experimentally validated the functional relevance of somatic mutations in 85 METTL3, providing further support for both the effectiveness of our method, and for the potential 86 importance of RNA methylation in cancer. We believe that our methods and results will facilitate 87 the future discovery and validation of many more driver genes from cancer sequencing data. 88

driverMAPS: a probabilistic model of somatic mutation selection patterns 90
Our approach is outlined in Figure 1. In brief, we model aggregated exonic somatic 91 mutation counts from many tumor samples (e.g. as obtained from a normal-tumor paired 92 sequencing cohort). Let Y g denote the mutation count data in gene g. We develop models for Y g 93 under three different hypotheses: that the gene is a "non-driver gene" (H 0 ), an "oncogene" (H OG ) or 94 a "tumor suppressor gene" (H TSG ). Each model has two parts, a background mutation model 95 (BMM), which models the background mutation process, and a selection mutation model (SMM), 96 which models how selection acts on functional mutations. The rate of observed mutation at a 97 position is the product of the background mutation rate (from BMM) and a coefficient reflecting the 98 effect of position-specific selection (from SMM). We note that the coefficient can be related to the 99 selection coefficient of the mutation and effective population size under a simplified population 100 genetic model 12 . If the coefficient is greater than 1, it indicates positive selection and if it is less 101 than 1, negative selection. The BMM parameters are shared by all three hypotheses, reflecting the 102 assumption that background mutation processes are the same for cancer driver and non-driver 103 genes. In contrast the SMM parameters are hypothesis-specific, to capture the different selection 104 pressures in oncogenes vs tumor suppressor genes vs non-driver genes. We fit the hypothesis-105 specific parameters using training sets of known oncogenes 1 (H OG ), known TSGs 1 (H TSG ), and all 106 other genes (H 0 ). (This last set will contain some --as yet unidentified --driver genes, which will 107 tend to make our methods conservative in terms of identifying new driver genes.) To combine 108 information across tumor types we first estimate parameters separately in each tumor type, and then 109 stabilize these estimates using Empirical Bayes shrinkage 20 . 110 Having fit these models, we use them to identify genes whose mutation data are most 111 consistent with the driver genes models (H OG and H TSG ). Specifically, for each gene g, we measure 112 the overall evidence for g to be a driver gene by the Bayes Factor (likelihood ratio), BF g , defined as: 113 BF g := 0.5 [Pr(Y g | H OG ) + Pr(Y g | H TSG )] / Pr(Y g | H 0 ). 114 Large values of BF g indicate strong evidence for g being a driver gene, and at any given threshold We used a total of 734,754 somatic mutations from 20 tumor types in the TCGA project as 120 our input data 21 . We focused on single nucleotide somatic variations and extensively filtered input 121 mutation lists to ensure data quality (see Methods). Figure S1 summarizes mutation counts and 122 cohort sizes. 123 The first step of our method is to estimate parameters of the Background Mutation Model 124 (BMM) using data on synonymous mutations. These parameters capture how mutation rates depend 125 on various "background features" (Table S1), which include mutation type (C>T, A>G, etc), CpG 126 dinucleotide context, expression level, replication timing and chromatin conformation (HiC 127 sequencing) 4 . The signs and values of estimated parameters were generally similar across tumor 128 types, and consistent with previous evidence for each feature's effect on somatic mutation rate. For 129 example, the estimated effect of the feature "expression level" was negative for almost all tumors, 130 consistent with transcriptional coupled repair mechanisms effectively reducing mutation rate 131 ( Figure S2). 132 Our BMM also estimates gene-specific effects, using synonymous mutations of a gene, to 133 allow for local variation in somatic mutation rate not captured by measured features. Intuitively, the 134 gene-specific effect adjusts a gene's estimated mutation rate downward if the gene has fewer 135 synonymous mutations than expected based on its known features, and upwards if it has more 136 synonymous mutations than expected. A challenge here is that the small number of mutations per 137 gene (particularly in small genes) could make these estimates inaccurate. Here we address this using 138 Empirical Bayes methods to improve accuracy, and avoid outlying estimates at short genes that 139 have few potential synonymous mutations ( Figure 2a). Effectively, this adjusts a gene's rate only 140 when the gene provides sufficient information to do so reliably (sufficiently many potential 141 synonymous mutations). To demonstrate the reliability of the resulting estimates we use a 142 procedure similar to cross-validation: we estimated each gene's gene-specific effect using its 143 synonymous mutations, and then test the accuracy of the estimate (compared to no gene-specific 144 effect) in predicting the number of nonsynonymous mutations. We assume that for the vast majority 145 of genes, their mutational counts are dominated by background mutation processes, rather than 146 selection. Figure 2b shows results for SKCM tumors: without gene-specific effect the correlation of 147 observed and expected number of nonsynonymous mutations across genes was 0.56; with gene-148 specific adjustment the correlation increased to 0.88. Similar improvements were seen for other 149 tumors ( Figure S3). 150 The next step is to estimate parameters of the Selection Mutation Models (SMM), using data 151 on non-synonymous mutations. These parameters capture how the rate of non-synonymous somatic 152 mutations depend on various "functional features" (Table S2- suggesting that somatic mutations are enriched in both types of cancer driver genes. 159 The final step is to estimate parameters of the spatial model (HMM, Figure 1), which are 160 designed to capture how somatic mutations may cluster together in "hotspots" in driver genes. 161 Preliminary investigations showed that spatial clustering is generally stronger in known OGs than 162 in known TSGs, and so we fit the spatial model separately for OGs and TSGs in each tumor type 163 (Table S5). Our model identified some tumor types (e.g. BLCA and LUSC, Figure 2d) with strong 164 spatial clustering. In BLCA, the estimated hotspots are very short (mean 1.4bp) and are primarily 165 capturing an excess in recurrent mutations (independent mutations at the same base) compared with 166 expectations (Figure 2d). In LUSC, the clustering extends over slightly longer regions (mean 167 5.6bp), but still the primary signal is an excess of recurrent mutations (Figure 2d). 168 169

Simulations demonstrate that driverMAPS improves detection of driver genes 170
While many methods have been developed for driver gene identification, it is difficult to 171 compare them on real data where the true status of each gene is often unknown. Simulations are 172 extremely valuable in such situations, and have been used in many fields, including population 173 genetics 22 , statistical genetics 23 and single-cell transcriptomics 24 . Here we exploit our explicit 174 statistical model to perform realistic simulations based on parameters inferred from real data (here, 175 the TCGA UCS cohort). 176 We first assess a common strategy used in the field: Fisher's method to combine p-values of 177 a gene, each capturing a single feature of positive selection. We simulated somatic mutations in a 178 positively selected gene with both increased nonsynonymous mutation rates and mutational 179 hotspots. We ran two simple tests --a dN/dS test to detect enrichment of functional mutations and 180 another to detect spatial clustering (see Methods) --and then combined p values using Fisher's 181 method. Perhaps unexpectedly, the combined test has lower power than the dN/dS test alone 182 ( Figure 3a). We believe that this is because spatial clustering is a relatively weak feature in our 183 simulations (as in real data) and so the spatial test has much less power than the dN/dS test. 184 Consequently the spatial test adds more noise than signal, decreasing power. This result highlights a 185 weakness of methods based on combining p values; model-based approaches, such as ours, avoid 186 this problem by automatically weighting different features of the data based on their 187

informativeness. 188
We next used simulations to compare driverMAPS with six existing algorithms: MutSigCV, 189 OncodriveFML 9 , OncodriveFM 10 , OncodriveCLUST 8 , dNdScv 16  We next compared results from driverMAPS and other algorithms for predicting driver gene 201 using TCGA data (see Methods). Besides the full implementation of driverMAPS, we also tried a 202 "basic" version that looks only for an excess of nonsynonymous somatic mutations (without any 203 functional features or spatial model), and a "+feature" version with functional features but not the 204 spatial model. We applied all methods to the same somatic mutation data and compared the genes 205 they identified with a list of "known driver genes" (713 genes) compiled as the union of COSMIC 206  Table S7). Almost half of these putative novel genes 227 were not called by MutSigCV, OncodriveFML or dNdScv. Ten novel genes were found 228 independently in at least two tumor types (Table 1). This is unlikely to happen by chance 229 (permutation test, p < 1e -4 ), so these genes seem especially good candidates for being genuine 230 driver genes. 231 Since it is impractical to functionally validate all 170 putative novel genes, we sought other 232 data to support these genes likely being involved in cancer. We first selected three common cancers 233 --breast, lung and prostate --and conducted an extensive literature survey for each novel gene 234 identified in these tumor types. Among a total of 22 novel genes, we found clear support in the 235 literature for 20 being involved with cancer biology, either directly implicated as oncogenes or 236 tumor suppressor genes (but not in the list of "known driver genes") or linked to well-established 237 cancer pathways (Table S8). 238 We next assessed whether the novel genes were enriched for features often associated with 239 driver genes. Previous studies reported that driver genes tend to be highly expressed 4 compared 240 with other genes, and indeed we found that, collectively, the novel genes showed significantly 241 higher expression than randomly sampled genes in the corresponding tissues 21 (p<1e -4 ) ( Figure 5b). 242 Previous studies have also reported that driver genes tend to show enrichment and depletion 243 for different copy-number-variation (CNV) events, depending on their role in cancer. Specifically, 244 OGs are enriched for CNV gains and depleted for CNV loss, whereas TSGs show enrichment for 245 loss and depletion for gains. Consistent with this, we found novel genes identified as OGs are 246 enriched for CNV gain events (p<1e -4 ) while novel TSGs are depleted (p=3e -3 ). CNV loss events 247 for novel OGs are depleted compared to novel TSGs and to other genes (p= 0.04) (Figure 5c). 248 We also compared our novel genes with a "cancer dependency map" of 769 genes identified 249 from a large-scale RNAi screening study across 501 human cancer cell lines 26 . These are genes 250 whose knockdown affects cell growth differently across cancer cell lines, thus likely representing 251 genes that are critical for tumorigenesis, but not universally essential genes. We found 16 novel 252 driver genes overlapped with this gene list, a significant enrichment compared with random 253 sampling (odds ratio 2.9, p=3.7e -4 ) ( Figure 5d and Table S9). 254 To test whether our novel genes are functionally related to known cancer driver genes we 255 examined the connectivity of these two sets of genes in the HumanNet 27 gene network, which is 256 built from multiple data sources including protein-protein interactions and gene co-expression. On 257 average, each novel gene is connected to 3.8 known driver genes, significantly higher than expected 258 by chance (p = 0.001). We obtained a similarly significant result using a different gene network, 259 GeneMania 28 , which is constructed primarily from co-expression (p = 0.008) (Figure 5e). 260 Finally, we identified enriched functional categories in our novel genes using GO 261 enrichment 29,30 analysis (by geneSCF 31 ). Significant GO terms (FDR < 0.1, Figure 5f) include many 262 molecular processes directly implicated in cancer, such as transcription initiation and regulation. 263 The significant terms also include several that have not been previously implicated in cancer. Genes  35 . This form of RNA modification may influence RNA stability, export and 271 translation, and has been shown to be important for important biological processes such as stem cell 272 differentiation. Our results suggest that this RNA methylation pathway may also play a key role in 273 tumorigenesis, and so we examined the results for these genes in more detail. 274

METTL3 is a potential TSG in bladder cancer 276
driverMAPS identified the genes METTL3 and METTL14 as driver genes in the cohorts 277 BLCA (bladder cancer) and UCEC (uterine cancer) respectively. These two genes had relatively 278 low mutation frequencies (4% and 2%) and were not detected by MutSigCV, dNdScv or 279 OncodriveFML (those with reasonable FDR control). Inspecting the mutations in these two genes, 280 we found many to be "functional" as predicted by annotations, and showed spatial clustering 281 patterns in the MTase domain ( Figure 6a). Furthermore METTL3 contained a single synonymous 282 mutation, and METTL14 contained none, suggesting low baseline mutation rates at the two genes. 283 While this manuscript was in preparation, METTL14 was independently identified as a novel TSG 284 in endometrial cancer (Chuan He, to appear). We thus focused on METTL3 in bladder cancer. 285 To gain further insights into the potential impact of the somatic mutations in METTL3, we 286 performed structural analysis. By mapping mutations in the MTase domain of METTL3 to its 287 crystal structure 36 , we found them to be concentrated in two regions: one close to the binding site of When we tried to rescue this phenotype by transfection of METTL3 mutants, all of the mutations, 305 E532K/Q, E516K and P514T failed to restore methyltransferase activity to original levels ( Figure  306 6c, Figure S6a), suggesting that they are loss-of-function mutations. 307 We next examined whether disruption of METTL3 is associated with tumor progression. 308

315
We have developed an integrated statistical model-based method, driverMAPS, to identify 316 driver genes from patterns of somatic mutation. By applying this method to data from multiple 317 tumor types from TCGA, we detected 159 novel potential driver genes. We experimentally 318 validated the function of mutations in one gene, METTL3. The remaining genes (Table 1, Table S8-319 9) are enriched for many biological features relevant to cancer, and appear promising candidates for 320 further investigation. 321 Compared with previous methods for detecting driver genes, a key feature of driverMAPS is 322 that it models mutation rates at the base-pair level. This allows us to explicitly model how selection incorporating additional functional annotations and spatial modeling, allows that some non-330 synonymous mutations may be more informative than others in identifying driver genes. 331 Furthermore, by estimating parameters in a single integrated model, our approach learns how to 332 weigh and combine the many different sources of information. The results in Figures 3 and 4  333 demonstrate the increased power that comes from these extensions. 334 Our statistical and experimental results for the mRNA methyltransferase METTL3 add to 335 the growing evidence of links between mRNA methylation and cancer. Indeed, a recent study in 336 myeloid leukemia cell lines 38 , found that depletion of METTL3 also leads to a cancer-related 337 phenotype. And extensive functional studies of METTL14 in uterine cancer (Chuan He, to appear) 338 support a role for this gene in cancer etiology. However, intriguingly, our results on METTL3 in 339 bladder cancer, and the METTL14 results in uterine cancer suggest that they act as tumor 340 suppressor genes, whereas the data on METTL3 in myeloid leukemia cell lines are more consistent 341 with an oncogenic role, with depletion inducing cell differentiation and apoptosis 38 . Further studies 342 in multiple tumor types therefore seem necessary to properly characterize the role of mRNA 343 methylation in cancer. 344 Although our model incorporates many features not considered by existing methods, it 345 would likely benefit from incorporating still more features. For example, it may be useful to 346 incorporate data on protein structure, which affects the functional importance of amino acid 347 residues. Further, whereas we currently use the same mutation model for all individuals, it could be 348 helpful to incorporate individual-specific effects such as smoking-induced mutational signatures. 349 Finally, it could be useful to extend the model to incorporate information on non-coding variation, 350 which has been shown to be important for many human diseases including cancer. Although

Background Mutation model 402
For synonymous mutations we assume the following "background mutation model": 403 We allow the BMRs to depend on mutational features (e.g. the expression level of the gene) 408 using a log-linear model: 409

414
We assume that the gene-specific effects λ g have a gamma distribution across genes: 415 where α is a hyperparameter to be estimated. 417

Selection Mutation model 418
For the null model, H 0 , we assume no selection or spatial effect: For other models, m = OG,TSG , we allow the selection effect to depend on functional features 424 (e.g. the assessed deleteriousness of the mutation), using a log-linear model: 425

431
To model the spatial effects, we use a Hidden Markov Model (HMM) with parameters Θ m , 432 In brief, this HMM allows for the presence of mutation "hotspots" --contiguous base-pairs with a 434 higher rate of mutation --and the parameters include the average hotspot length and intensity of 435 selection (ρ). See Supplementary note for details. 436

Parameter estimation 437
Background mutation model 438 To simplify inference we took a sequential approach to parameter estimation. First we infer 439 parameters β b ,α of the BMM using the synonymous mutation data at all genes. Let S g denote the 440 subset of synonymous mutations S in gene g , and Y S g denote the corresponding observed counts: 441 (7) 442 Based on the synonymous mutation data, the likelihood for gene g is: 443

(9) 447
We maximize this likelihood, using numerical optimization, to obtain estimates β b ! ,α for β b ,α . 448 By ignoring the non-synonymous mutation data when fitting the BMM we may lose some 449 efficiency in principle, but we gain considerable simplification in practice. 450

Selection mutation model 451
We next estimate the model-specific parameters β f ,m . For m = OG,TSG . During this step 452 we ignore the HMM model (i.e. we set θ i m = 1 ), motivated by the fact that spatially-clustered 453 mutations are relatively rare and so should not significantly impact the estimates of β f ,m 454 For m = OG we estimate β f ,m using the non-synonymous mutation data from a curated list 455 G OG of 53 OGs. Estimation for β f ,TSG is identical except that we replace this list with a curated list 456 G TSG of 71 TSGs. Let G m denote these sets of training genes. Let Y NS g denote the counts of non-457 synonymous mutations in gene g . 458 Assuming independence across genes, the likelihood for β f ,m is: 459 where the second line follows because P(Y S g | β f ,m ) does not depend on β f ,m . The term in this 461 likelihood for gene g is given by: 462

(11) 463
It can be shown that 464 where µ g S and y g S are, respectively, the expected (considering only mutational features) and 466 observed number of synonymous mutations in gene g (see Supplementary notes). The conditional 467 mean of this distribution is α + y g Ŝ α + µ g S , so if y g S > µ g S , then E(λ g | Y S g ,α ) > 1. 468 We obtained the MLE of β f ,m by maximizing the likelihood (Equation 10) numerically, and 469 obtain corresponding estimated standard errors using the curvature of the likelihood (see 470 Supplementary notes). In tumor types with low mutation rates or sample sizes, these standard errors 471 can be relatively large, so we borrow information from other tumor types to ''stabilize'' these 472 estimates. Specifically we use the adaptive shrinkage method 20 to "shrink" estimated values of 473 β f ,m in each tumor type towards the mean across all tumor types . This shrinkage effect is strongest 474 for tumor types with large standard errors ( Figure S7). 475

HMM parameters 476
Having estimated β b ,α and β f ,m , we fix their values and estimate the HMM parameters 477 Θ m for m = OG,TSG . The likelihood function involves marginalization of the hidden states of the 478 Markov chain, which can be performed efficiently using standard methods for HMMs. We estimate 479 Θ m by maximizing this likelihood numerically. See Supplementary note for details. 480

Gene classification 481
Having estimated the model parameters as above, for each gene g , we compute its Bayes 482 Factor for being a driver gene as: 483 The equal weights in the numerator of this BF assume that OGs and TSGs are equally common. 485 This BF simplifies to 486

Supplementary notes). 490
After obtaining the BFs, we can compute the posterior probability of being a driver gene 491 (either OG or TSG ) for every gene, and estimate the Bayesian FDR 41 for any given BF threshold. 492 This step requires estimation of the proportion of driver genes, which we do by maximum 493

Simulations 495
For power analysis shown in Figure 3(a), we randomly picked a gene (ERBB3) and for a 496 given number of samples, we simulated mutations under positive selection and assessed the power 497 of detecting this gene as positively selected using different methods. We simulate synonymous 498 mutations at predefined background mutation rates (BMRs); we simulate positively selected 499 mutations at elevated mutation rates for nonsynonymous sites and hotspot sites (generated by a 500 Markov chain). This simulation procedure was performed many times and each time we obtained p 501 value for each method. Power is defined as the fraction of simulations with significant p values (p < 502 0.05). The test statistics for "dN/dS" method is the likelihood ratio of between Poisson models 503 under elevated mutation rates and BMRs. The test statistics for "cluster" method is the maximum 504 number of mutations within 3bp windows normalized by overall mutation rates. Null distributions 505 of test statistics are obtained by simulations with mutation rates for all sites equal to BMRs. p value 506 for "combined" method is obtained by combining p values of "dN/dS" and "cluster" using Fisher's 507

method. 508
For simulation performed in Figure 3 For all simulations, the predefined BMRs, effect sizes for functional annotations and spatial 520 clustering hotspot rated parameters were estimates by driverMAPS using UCS data (Table S1-S5, 521 UCS parameters). We re-estimated these parameters when running driverMAPS. 522

Comparison of gene prediction results from different methods 523
When comparing methods, we used the same mutation data (after filtering) and the same 524 nominal FDR threshold (0.1) for each method. Because driverMAPS used 124 known cancer genes 525 as a training set, to avoid bias towards this subset of genes when computing precision or power for 526 driverMAPS, we ran MAPs using a leave-one-out strategy. We perform 124 runs, each time 527 omitting one TSG/OG from the training set and estimating model parameters from the remaining 528 genes, and then count the omitted gene as "significant" only if this TSG/OG is significant 529 (FDR<0.1) in this run. We then calculate precision as the percentage of significant known cancer 530 genes of all significant genes. All data related to driverMAPS (basic, +feature and full version) 531 presented in Figure 3 were obtained in this way. In fact, estimated model parameters are quite stable 532 across runs, and so the overall result is similar to a single run not using this "leave-one-out" 533 strategy. 534

In vitro assay for m 6 A methyltransferase activity 546
The recombinant, His-tagged proteins METTL14 with wildtype or mutant METTL3 were 547 expressed in 1 LB Ecoli expression system and purified through Ni-NTA affinity column according 548 to a previously published procedure 42 . Protein purity was assessed by SDS-PAGE, and protein 549 concentration was determined by UV absorbance at 280 nm. We performed an in vitro 550 methyltransferase activity assay in a 50 µ L reaction mixture containing the following components:

RNA isolation 559
Total RNA was isolated with TRIZOL reagent (Invitrogen). mRNA was extracted from the 560 total RNA using the Dynabeads ® mRNA Purification Kit (Invitrogen), followed by removal of 561 contaminating rRNA with the RiboMinus transcriptome isolation kit (Invitrogen). mRNA 562 concentration was measured by UV absorbance at 260 nm.                 We use "/" to separate data obtained from different tumor types as indicated in the "Tumor" 809 column. A brief description of the gene's function and its known role in cancer is provided in the 810 "Function" column. Reference PMIDs are given in parentheses. 811