Abstract
Methods are needed to reliably prioritize biologically active driver mutations over inactive passengers in high-throughput sequencing cancer data sets. We present ParsSNP, an unsupervised functional impact predictor that is guided by parsimony. ParsSNP uses an expectation–maximization framework to find mutations that explain tumor incidence broadly, without using predefined training labels that can introduce biases. We compare ParsSNP to five existing tools (CanDrA, CHASM, FATHMM Cancer, TransFIC, and Condel) across five distinct benchmarks. ParsSNP outperformed the existing tools in 24 of 25 comparisons. To investigate the real-world benefit of these improvements, we applied ParsSNP to an independent data set of 30 patients with diffuse-type gastric cancer. ParsSNP identified many known and likely driver mutations that other methods did not detect, including truncation mutations in known tumor suppressors and the recurrent driver substitution RHOA p.Tyr42Cys. In conclusion, ParsSNP uses an innovative, parsimony-based approach to prioritize cancer driver mutations and provides dramatic improvements over existing methods.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Forbes, S.A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 39, D945–D950 (2011).
Vogelstein, B. et al. Cancer genome landscapes. Science 339, 1546–1558 (2013).
Carter, H., Douville, C., Stenson, P.D., Cooper, D.N. & Karchin, R. Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genomics 14 (Suppl. 3), S3 (2013).
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Mao, Y. et al. CanDrA: cancer-specific driver missense mutation annotation with optimized features. PLoS One 8, e77945 (2013).
Carter, H. et al. Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. Cancer Res. 69, 6660–6667 (2009).
Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J.D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 48, 214–220 (2016).
Kumar, R.D., Searleman, A.C., Swamidass, S.J., Griffith, O.L. & Bose, R. Statistically identifying tumor suppressors and oncogenes from pan-cancer genome-sequencing data. Bioinformatics 31, 3561–3568 (2015).
Youn, A. & Simon, R. Identifying cancer driver genes in tumor genome sequencing studies. Bioinformatics 27, 175–181 (2011).
Tomasetti, C., Marchionni, L., Nowak, M.A., Parmigiani, G. & Vogelstein, B. Only three driver gene mutations are required for the development of lung and colorectal cancers. Proc. Natl. Acad. Sci. USA 112, 118–123 (2015).
Zaretzki, J.M., Browning, M.R., Hughes, T.B. & Swamidass, S.J. Extending P450 site-of-metabolism models with region-resolution data. Bioinformatics 31, 1966–1973 (2015).
Simonetti, F.L., Tornador, C., Nabau-Moretó, N. & Molina-Vila, M.A. & Marino-Buslje, C. Kin-Driver: a database of driver mutations in protein kinases. Database (Oxford) 2014, bau104 (2014).
Martelotto, L.G. et al. Benchmarking mutation effect prediction algorithms using functionally validated cancer-related missense mutations. Genome Biol. 15, 484 (2014).
Petitjean, A. et al. Impact of mutant p53 functional properties on TP53 mutation patterns and tumor phenotype: lessons from recent developments in the IARC TP53 database. Hum. Mutat. 28, 622–629 (2007).
Kim, E. et al. Systematic functional interrogation of rare cancer variants identifies oncogenic alleles. Cancer Discov. 6, 714–726 (2016).
Kakiuchi, M. et al. Recurrent gain-of-function mutations of RHOA in diffuse-type gastric carcinoma. Nat. Genet. 46, 583–587 (2014).
Schroeder, M.P., Rubio-Perez, C., Tamborero, D., Gonzalez-Perez, A. & Lopez-Bigas, N. OncodriveROLE classifies cancer driver genes in loss of function and activating mode of action. Bioinformatics 30, i549–i555 (2014).
Futreal, P.A. et al. A census of human cancer genes. Nat. Rev. Cancer 4, 177–183 (2004).
Shihab, H.A., Gough, J., Cooper, D.N., Day, I.N. & Gaunt, T.R. Predicting the functional consequences of cancer-associated amino acid substitutions. Bioinformatics 29, 1504–1510 (2013).
Gonzalez-Perez, A., Deu-Pons, J. & Lopez-Bigas, N. Improving the prediction of the functional impact of cancer mutations by baseline tolerance transformation. Genome Med. 4, 89 (2012).
González-Pérez, A. & López-Bigas, N. Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am. J. Hum. Genet. 88, 440–449 (2011).
Olden, J.D. & Jackson, D.A. Illuminating the “black box”: a randomization approach for understanding variable contributions in artificial neural networks. Ecol. Modell. 154, 135–150 (2002).
Guan, B., Wang, T.-L. & Shih, IeM. ARID1A, a factor that promotes formation of SWI/SNF-mediated chromatin remodeling, is a tumor suppressor in gynecologic cancers. Cancer Res. 71, 6718–6727 (2011).
Lawrence, M.S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).
Bose, R. et al. Activating HER2 mutations in HER2 gene amplification negative breast cancer. Cancer Discov. 3, 224–237 (2013).
Kang, S., Bader, A.G. & Vogt, P.K. Phosphatidylinositol 3-kinase mutations identified in human cancer are oncogenic. Proc. Natl. Acad. Sci. USA 102, 802–807 (2005).
Koo, B.-K. et al. Tumour suppressor RNF43 is a stem-cell E3 ligase that induces endocytosis of Wnt receptors. Nature 488, 665–669 (2012).
Kim, V.N., Kataoka, N. & Dreyfuss, G. Role of the nonsense-mediated decay factor hUpf3 in the splicing-dependent exon–exon junction complex. Science 293, 1832–1836 (2001).
Huang, F.W. et al. Highly recurrent TERT promoter mutations in human melanoma. Science 339, 957–959 (2013).
Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955–961 (2015).
Fujita, P.A. et al. The UCSC Genome Browser database: update 2011. Nucleic Acids Res. 39, D876–D882 (2011).
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
Kandoth, C. et al. Mutational landscape and significance across 12 major cancer types. Nature 502, 333–339 (2013).
Reva, B., Antipin, Y. & Sander, C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res. 39, e118 (2011).
Gonzalez-Perez, A. & Lopez-Bigas, N. Functional impact bias reveals cancer drivers. Nucleic Acids Res. 40, e169 (2012).
Basheer, I.A. & Hajmeer, M. Artificial neural networks: fundamentals, computing, design, and application. J. Microbiol. Methods 43, 3–31 (2000).
Dempster, A.P., Laird, N.M. & Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39, 1–38 (1977).
Hong, Y. On computing the distribution function for the sum of independent and nonidentical random indicators (Technical Report 11-2) (Department of Statistics, Virginia Tech, 2011).
Cancer Genome Atlas Research Network. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N. Engl. J. Med. 368, 2059–2074 (2013).
Tamborero, D., Gonzalez-Perez, A. & Lopez-Bigas, N. OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes. Bioinformatics 29, 2238–2244 (2013).
DeLong, E.R., DeLong, D.M. & Clarke-Pearson, D.L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).
Venables, W.N. & Ripley, B.D. Modern Applied Statistics with S (Springer Science & Business Media, 2002).
Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).
Acknowledgements
We thank O.L. Griffith for critically reading the manuscript. Our work was supported by the Alvin J. Siteman Cancer Center, the Ohana Breast Cancer Research Fund, the Foundation for the Barnes-Jewish Hospital (to R.B.), the National Library of Medicine of the National Institutes of Health (R01LM012222 to S.J.S.), and the Canadian Institutes of Health Research (DFS-134967 to R.D.K.).
Author information
Authors and Affiliations
Contributions
R.D.K. and S.J.S. designed the study. R.D.K. wrote software and performed the analysis. R.D.K., S.J.S. and R.B. wrote the manuscript. R.B. supervised the project.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 ParsSNP convergence and reproducibility.
(a) The EM portion of ParsSNP consistently converges in 15–20 iterations. Lines are offset slightly to aid visualization. (b) The pan-cancer training set was partitioned randomly into two equally sized, independent halves. ParsSNP produces highly correlated scores when trained on independent but comparable data sets (n = 566,223).
Supplementary Figure 2 Comparison of default parameters and parameter variations during learning.
The results of various alternative parameter settings are plotted against the reference labels produced using default settings in the training data set (n = 566,223; see Supplementary Table 2 for summary statistics). Most alternative settings produce predictions that are highly correlated with the reference. ELOG1.1, the E step uses a logarithmic upper bound with base of 1.1 (default = 2); ELOG10, the logarithm base is 10; ECONSTANT3, the E step uses a constant upper bound set to 3 (the default upper bound scales logarithmically in base 2); ECONSTANT10, constant upper bound of 10; ECONSTANT20, constant upper bound of 20; EFLOOR0, the E-step lower bound is set to 0 (default = 1); EFLOOR5, the lower bound is set to 5; E3to10, the E step uses lower and upper bounds of 3 and 10 for all samples; ESTEP0.8, the E-step sliding bound is calculated as 80% of current belief (default = 90%); ESTEP0.95, the sliding bound is calculated as 95% of current belief; MCV2, the M step uses 2-fold cross-validation (default = 5); LOGISTIC, the M step uses logistic regression (default is a tuned neural network); NODES6, the M step uses a neural network with only 6 hidden nodes (default is tuned, can use more than 6 nodes); DECAY0.1, the M step uses a neural network with weight decay of 0.1 (default is tuned, can use less stringent decay); DECAY0.1; NODES6, the M step enforces use of a simpler neural network than default settings require.
Supplementary Figure 3 Detecting recurrent missense mutations in the pan-cancer test set.
Control ROC curves, related to Supplementary Table 3, column 1. AUROCs are depicted.
Supplementary Figure 4 Detecting non-recurrent mutations in CGC members in the pan-cancer test set.
Control ROC curves, related to Supplementary Table 3, column 2. AUROCs are depicted.
Supplementary Figure 5 Detecting driver mutations in the driver–dbSNP data set.
Control ROC curves, related to Supplementary Table 3, column 3. AUROCs are depicted.
Supplementary Figure 6 Detecting disruptive mutations in the IARC p53 data set
Control ROC curves, related to Supplementary Table 3, column 4. AUROCs are depicted.
Supplementary Figure 7 Detecting functional mutations in the functional–neutral data set.
Control ROC curves, related to Supplementary Table 3, column 5. AUROCs are depicted.
Supplementary Figure 8 Box plots of ParsSNP score by mutation and gene type.
(a) Truncation rate is a gene-level descriptor that assigns low P values to genes enriched in truncations (TSG-like) and assigns high P values to genes that are depleted in truncations (oncogene-like). ‘Truncation’ events include frameshift, premature stop, and non-stop changes. ‘Missense’ mutations include missense substitutions as well as in-frame indels. ‘Silent’ changes include synonymous nucleotide substitutions as well as noncoding variants. Truncations receive higher median scores in TSG-like genes, while missense mutations receive higher scores in both TSG-like and oncogene-like genes. This represents a potential nonlinear two-way interaction between ParsSNP descriptors (truncation rate and mutation type). Boxes enclose the interquartile range. (b) ParsLR uses logistic regression rather than a neural network model and does not exhibit the same properties as the full ParsSNP model.
Supplementary Figure 9 Identification of putative driver genes and mutations.
Genes are plotted by the average ParsSNP score of their mutations and their single highest score in the entire pan-cancer data set (training + test + hypermutator). The top ParsSNP scoring mutations are generally found in members of the CGC. Two genes not belonging to the CGC have multiple exceptional mutations (arrows): TATA-box-binding protein (TBP) and the calcium-activated potassium channel KCNN3. Both have significantly higher median ParsSNP scores than expected by chance (Bonferroni-corrected one-sample Wilcoxon P < 0.05) and multiple mutations with exceptionally high ParsSNP scores, including TBP A191T (ParsSNP = 0.75) and R168Q (0.67), as well as KCNN3 R435C (0.60), L413Q (0.59), and S517Y (0.53).
Supplementary Figure 10 Differential functionality between hypermutated and non-hypermutated samples.
(a) A one-sample Wilcoxon test was performed on each gene in both the hypermutated and non-hypermutated (training + test) portions of the data set using internal null distributions. The –log10 P values of these tests are shown. As expected, many well-known cancer genes were more easily detected in the non-hypermutators. No genes were observed with elevated ParsSNP scores exclusively in the hypermutators. (b) A two-sample Wilcoxon test was performed for each gene, comparing the ParsSNP scores assigned to it in the hypermutated and non-hypermutated segments. Genes are plotted by the magnitude of median shift (negative values indicate lower scores in the hypermutated samples) and the –log10 P value. This analysis indicates that mutations in RNF43 and UPF3A have modestly but significantly elevated scores when observed in hypermutators. This suggests that these genes may be involved in the unique biology of these tumors.
Supplementary Figure 11 ParsSNP performance and data set size.
ParsSNP models were trained on progressively smaller subsets of the pan-cancer training data (n = 566,223), and performance (AUROC) was assessed for each classification task. Points represent average performance from five replicates.
Supplementary Figure 12 Criteria for thresholding ParsSNP scores.
(a) The E-step constraints are one possible objective criterion for thresholding ParsSNP scores. The value to be optimized is the percentage of samples receiving a number of driver mutations that is compatible with the E-step upper and lower bounds under the proposed threshold. (b) Another approach is to select a threshold that optimizes accuracy (correct classification rate) in the classification tasks.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–12. (PDF 1738 kb)
Supplementary Tables 1–7
Supplementary Tables 1–7. (XLSX 5336 kb)
Rights and permissions
About this article
Cite this article
Kumar, R., Swamidass, S. & Bose, R. Unsupervised detection of cancer driver mutations with parsimony-guided learning. Nat Genet 48, 1288–1294 (2016). https://doi.org/10.1038/ng.3658
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng.3658
This article is cited by
-
In silico methods for predicting functional synonymous variants
Genome Biology (2023)
-
Network-based prediction approach for cancer-specific driver missense mutations using a graph neural network
BMC Bioinformatics (2023)
-
Impact of deleterious missense PRKCI variants on structural and functional dynamics of protein
Scientific Reports (2022)
-
Identification of novel prognostic biomarkers by integrating multi-omics data in gastric cancer
BMC Cancer (2021)
-
Current cancer driver variant predictors learn to recognize driver genes instead of functional variants
BMC Biology (2021)