Detailed modeling of positive selection improves detection of cancer driver genes

Identifying driver genes from somatic mutations is a central problem in cancer biology. Existing methods, however, either lack explicit statistical models, or use models based on simplistic assumptions. Here, we present driverMAPS (Model-based Analysis of Positive Selection), a model-based approach to driver gene identification. This method explicitly models positive selection at the single-base level, as well as highly heterogeneous background mutational processes. In particular, the selection model captures elevated mutation rates in functionally important sites using multiple external annotations, and spatial clustering of mutations. Simulations under realistic evolutionary models demonstrate the increased power of driverMAPS over current approaches. Applying driverMAPS to TCGA data of 20 tumor types, we identified 159 new potential driver genes, including the mRNA methyltransferase METTL3-METTL14. We experimentally validated METTL3 as a tumor suppressor gene in bladder cancer, providing support to the important role mRNA modification plays in tumorigenesis.

4. It is not convincing for figure 5b-c since it is not surprising to see the statistical significance when comparing novel significant genes with random genes, especially these genes are expressed at a higher level than non-driver genes. Did these significant genes show higher expression level/CNV alterations than gene lists identified by other algorithms? 5. The algorithm mentioned the single-base level, while it seems unnecessary to identify driver genes. Could the authors clarify? If this is really the functions of the algorithm, could the author use this directly to predict which mutation on METTL3 is a driver?
Minor comments: 1. It is difficult to interpret the results for figure 6c/d without appropriate label. The color does not match from legend to the figure.
2. Could the authors also include the paper for METTL14 in endometrial cancer (Chuan He, to appear) in the revision for reference?
Reviewer #2 (Remarks to the Author): Zhao et al present a novel method, driverMAPS, to nominate driver genes in exome SNV data. The approach is based on a Bayesian hypothesis testing approach that classifies genes according to being an oncogene or TSG driver or null. They benchmark the method against state of the art approaches (dndscv, MutSigCV, oncodrive suite) to demonstrate increased power and adequate false discovery control with driverMAPS. They validate one of the novel hits derived from their method using functional experimentation.
Overall, the method is conceptually sound, clearly presented (though with numerous typos), and assiduously benchmarked against state of the art approaches. Functional mutations in an RNA methyltransferase (METTL3) previously not associated with cancer are validated in cell lines using RNAi and transgene over-expression in combination with an RNA methylation readout.
DriverMAPS employs a base resolution modeling of background mutation rates and explicit modeling of alternative hypotheses (oncogene, tumor suppressor) to derive a Bayes Factor of a gene being a cancer driver. The background likelihood models context-specific mutation counts at each position of the exome as an overdispersed Poisson whose log mean is a linear combination (defined by "Beta_b" coefficients) of gene level log mutation rate and regional covariates (gene expression, hi-c). The "selection model" likelihood adds additional "functional" terms to this log mean parameter. These comprise "Beta_f" coefficients that specify linear combinations of "functional features", e.g. SIFT, PhyloP, as well as an HMM derived "hotspot term" theta.
Model fitting involves finding maximum likelihood assignments for these Beta coefficients and the theta coefficients, among others, against real data of mutation counts. Background models are fit genome-wide against synonymous counts only. The two separate selection models (oncogene and tumor suppressor) are fit against non-synonymous counts using small curated sets of (<100) oncogenes and tumor suppressor genes. The resulting models are then applied genome-wide to yield Bayes factors for each genes, which are then combined using a Bayesian FDR procedure to yield a list of driver genes for the given tumor type.
Major critiques: -Justify / Clarify model fitting and necessity of curated "training sets" of oncogenes and tumor suppressors The selection model is fit to curated tumor suppressors and oncogenes -this reliance makes the method seem a little flimsy and circular, since many of these "lists" have been derived from similar analyses (eg MutSig) applied to TCGA These genes are presumably chosen from a "pan-cancer" list but the models fit by tumor type, where the majority of these genes are not relevant -e.g. the majority of EGFR mutations in melanoma are passengers. This makes me wonder how essential this gene choice is for the model fitting. Indeed the correlation of TSG with CNV loss frequency ( Figure 4) is poor, so not clear how much additional signal the TSG / OG training is picking up.
How well would a single "driver" model trained on all genes perform compared to this model? Similarly, how well does this OG + TSG driver model perform as an OG / TSG "classifier" on cross validation eg if you leave half of the genes out of the training? My guess it's not spectacular.
The background model / B^b parameters is fit using synonymous mutations only. Are these B^b parameters used in the selection model or re-fitted in the selection model?
What happens when nonsynonymous mutations are used instead? Since the background model lacks functional and hotspot features, it should still show a difference vs the selection model.
B^f parameters (SIFT, CONS, PhyloP) are reported in Figure 2c for the background model (B^f_0) -but the BMM definition does not include a selection term so not clear how these are fit. eg using the "selection model" on synonymous data? -Please describe and justify simulation approach used for power analysis The simulations are core to the arguments that driverMAPS has increased power and adequate FDR control, however they are not rigorously specified. "We simulated mutations under positive selection" and "we simulate synonymous mutations at predefined background mutation rates" are quite vague. One guess is that the authors are using the inference model with some specified parameters as a generative model from which to draw mutation count data. If so, then the simulations seem somewhat "rigged" to favor driverMAPS. This is especially true if the generative model has the exact same structure (eg same set of background and functional covariates) as the inference. What would happen if you had 10 unknown functional covariates generating the positively selected data, or only considered a subset of the generative functional / background covariates in the driverMAPS inference. ie how does incomplete knowledge of these factors influence the power and precision of the inference.
It would be ideal if the authors could use a more objective benchmark e.g. a "third party" cancer mutation simulation software or analysis of subsampled "real" data. There is no third party software that I'm aware of, and analysis of subsampled data vs a gold standard (eg COSMIC) may provide a decent analysis of specificity but sensitivity is hard to quantify. However, the authors should either pursue something in this direction or at the very least present a rigorous description of the simulations used for benchmarking.
Minor critiques: -Please clarify the HMM model and its inference, especially the theta term I've read the main and supplementary methods. It's not clear what theta represents, since it depends on p_m, which is not explicitly defined in the supplement via a formula but only described as the "average increase of mutation rate in hotspots under model m". Conceptually theta should be a variable whose value is >1 at hotspots and is 1 otherwise ... eg like a relative risk for the binary variable of "is hotspot" vs "is not hotspot".
Please provide an explicit formula for theta and rho_m. There is also a parameter rho_0, and rho mentioned in the HMM model fitting (page 7 supplement) that is not defined in the HMM spec (page 6). My guess is that rho is some odds ratio of being a hotspot, but this is not clear.
The initial HMM parameters (p_0, p_1) btw are strangely defined -these params don't have a subscript i but yet they are defined in terms of Z_i. Are these just supposed to be the probabilities of P(Z_0^m = 1).