Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Technical Report
  • Published:

Unsupervised detection of cancer driver mutations with parsimony-guided learning

Abstract

Methods are needed to reliably prioritize biologically active driver mutations over inactive passengers in high-throughput sequencing cancer data sets. We present ParsSNP, an unsupervised functional impact predictor that is guided by parsimony. ParsSNP uses an expectation–maximization framework to find mutations that explain tumor incidence broadly, without using predefined training labels that can introduce biases. We compare ParsSNP to five existing tools (CanDrA, CHASM, FATHMM Cancer, TransFIC, and Condel) across five distinct benchmarks. ParsSNP outperformed the existing tools in 24 of 25 comparisons. To investigate the real-world benefit of these improvements, we applied ParsSNP to an independent data set of 30 patients with diffuse-type gastric cancer. ParsSNP identified many known and likely driver mutations that other methods did not detect, including truncation mutations in known tumor suppressors and the recurrent driver substitution RHOA p.Tyr42Cys. In conclusion, ParsSNP uses an innovative, parsimony-based approach to prioritize cancer driver mutations and provides dramatic improvements over existing methods.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Overview of ParsSNP and label learning.
Figure 2: ParsSNP detects recurrent mutations and mutations in known cancer-related genes in the pan-cancer test set.
Figure 3: ParsSNP identifies experimentally validated mutations in external data sets.
Figure 4: Comparison of candidate driver mutations in an independent data set identifies known and likely drivers that are only identified by ParsSNP.

Similar content being viewed by others

References

  1. Forbes, S.A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 39, D945–D950 (2011).

    CAS  Google Scholar 

  2. Vogelstein, B. et al. Cancer genome landscapes. Science 339, 1546–1558 (2013).

    Article  CAS  Google Scholar 

  3. Carter, H., Douville, C., Stenson, P.D., Cooper, D.N. & Karchin, R. Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genomics 14 (Suppl. 3), S3 (2013).

    Article  Google Scholar 

  4. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

    Article  CAS  Google Scholar 

  5. Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).

    Article  CAS  Google Scholar 

  6. Mao, Y. et al. CanDrA: cancer-specific driver missense mutation annotation with optimized features. PLoS One 8, e77945 (2013).

    Article  CAS  Google Scholar 

  7. Carter, H. et al. Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. Cancer Res. 69, 6660–6667 (2009).

    Article  CAS  Google Scholar 

  8. Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J.D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 48, 214–220 (2016).

    Article  CAS  Google Scholar 

  9. Kumar, R.D., Searleman, A.C., Swamidass, S.J., Griffith, O.L. & Bose, R. Statistically identifying tumor suppressors and oncogenes from pan-cancer genome-sequencing data. Bioinformatics 31, 3561–3568 (2015).

    Article  CAS  Google Scholar 

  10. Youn, A. & Simon, R. Identifying cancer driver genes in tumor genome sequencing studies. Bioinformatics 27, 175–181 (2011).

    Article  CAS  Google Scholar 

  11. Tomasetti, C., Marchionni, L., Nowak, M.A., Parmigiani, G. & Vogelstein, B. Only three driver gene mutations are required for the development of lung and colorectal cancers. Proc. Natl. Acad. Sci. USA 112, 118–123 (2015).

    Article  CAS  Google Scholar 

  12. Zaretzki, J.M., Browning, M.R., Hughes, T.B. & Swamidass, S.J. Extending P450 site-of-metabolism models with region-resolution data. Bioinformatics 31, 1966–1973 (2015).

    Article  CAS  Google Scholar 

  13. Simonetti, F.L., Tornador, C., Nabau-Moretó, N. & Molina-Vila, M.A. & Marino-Buslje, C. Kin-Driver: a database of driver mutations in protein kinases. Database (Oxford) 2014, bau104 (2014).

    Article  Google Scholar 

  14. Martelotto, L.G. et al. Benchmarking mutation effect prediction algorithms using functionally validated cancer-related missense mutations. Genome Biol. 15, 484 (2014).

    Article  Google Scholar 

  15. Petitjean, A. et al. Impact of mutant p53 functional properties on TP53 mutation patterns and tumor phenotype: lessons from recent developments in the IARC TP53 database. Hum. Mutat. 28, 622–629 (2007).

    Article  CAS  Google Scholar 

  16. Kim, E. et al. Systematic functional interrogation of rare cancer variants identifies oncogenic alleles. Cancer Discov. 6, 714–726 (2016).

    Article  CAS  Google Scholar 

  17. Kakiuchi, M. et al. Recurrent gain-of-function mutations of RHOA in diffuse-type gastric carcinoma. Nat. Genet. 46, 583–587 (2014).

    Article  CAS  Google Scholar 

  18. Schroeder, M.P., Rubio-Perez, C., Tamborero, D., Gonzalez-Perez, A. & Lopez-Bigas, N. OncodriveROLE classifies cancer driver genes in loss of function and activating mode of action. Bioinformatics 30, i549–i555 (2014).

    Article  CAS  Google Scholar 

  19. Futreal, P.A. et al. A census of human cancer genes. Nat. Rev. Cancer 4, 177–183 (2004).

    Article  CAS  Google Scholar 

  20. Shihab, H.A., Gough, J., Cooper, D.N., Day, I.N. & Gaunt, T.R. Predicting the functional consequences of cancer-associated amino acid substitutions. Bioinformatics 29, 1504–1510 (2013).

    Article  CAS  Google Scholar 

  21. Gonzalez-Perez, A., Deu-Pons, J. & Lopez-Bigas, N. Improving the prediction of the functional impact of cancer mutations by baseline tolerance transformation. Genome Med. 4, 89 (2012).

    Article  Google Scholar 

  22. González-Pérez, A. & López-Bigas, N. Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am. J. Hum. Genet. 88, 440–449 (2011).

    Article  Google Scholar 

  23. Olden, J.D. & Jackson, D.A. Illuminating the “black box”: a randomization approach for understanding variable contributions in artificial neural networks. Ecol. Modell. 154, 135–150 (2002).

    Article  Google Scholar 

  24. Guan, B., Wang, T.-L. & Shih, IeM. ARID1A, a factor that promotes formation of SWI/SNF-mediated chromatin remodeling, is a tumor suppressor in gynecologic cancers. Cancer Res. 71, 6718–6727 (2011).

    Article  CAS  Google Scholar 

  25. Lawrence, M.S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).

    Article  CAS  Google Scholar 

  26. Bose, R. et al. Activating HER2 mutations in HER2 gene amplification negative breast cancer. Cancer Discov. 3, 224–237 (2013).

    Article  CAS  Google Scholar 

  27. Kang, S., Bader, A.G. & Vogt, P.K. Phosphatidylinositol 3-kinase mutations identified in human cancer are oncogenic. Proc. Natl. Acad. Sci. USA 102, 802–807 (2005).

    Article  CAS  Google Scholar 

  28. Koo, B.-K. et al. Tumour suppressor RNF43 is a stem-cell E3 ligase that induces endocytosis of Wnt receptors. Nature 488, 665–669 (2012).

    Article  CAS  Google Scholar 

  29. Kim, V.N., Kataoka, N. & Dreyfuss, G. Role of the nonsense-mediated decay factor hUpf3 in the splicing-dependent exon–exon junction complex. Science 293, 1832–1836 (2001).

    Article  CAS  Google Scholar 

  30. Huang, F.W. et al. Highly recurrent TERT promoter mutations in human melanoma. Science 339, 957–959 (2013).

    Article  CAS  Google Scholar 

  31. Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955–961 (2015).

    Article  CAS  Google Scholar 

  32. Fujita, P.A. et al. The UCSC Genome Browser database: update 2011. Nucleic Acids Res. 39, D876–D882 (2011).

    Article  CAS  Google Scholar 

  33. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).

    Article  Google Scholar 

  34. Kandoth, C. et al. Mutational landscape and significance across 12 major cancer types. Nature 502, 333–339 (2013).

    Article  CAS  Google Scholar 

  35. Reva, B., Antipin, Y. & Sander, C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res. 39, e118 (2011).

    Article  CAS  Google Scholar 

  36. Gonzalez-Perez, A. & Lopez-Bigas, N. Functional impact bias reveals cancer drivers. Nucleic Acids Res. 40, e169 (2012).

    Article  CAS  Google Scholar 

  37. Basheer, I.A. & Hajmeer, M. Artificial neural networks: fundamentals, computing, design, and application. J. Microbiol. Methods 43, 3–31 (2000).

    Article  CAS  Google Scholar 

  38. Dempster, A.P., Laird, N.M. & Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39, 1–38 (1977).

    Google Scholar 

  39. Hong, Y. On computing the distribution function for the sum of independent and nonidentical random indicators (Technical Report 11-2) (Department of Statistics, Virginia Tech, 2011).

  40. Cancer Genome Atlas Research Network. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N. Engl. J. Med. 368, 2059–2074 (2013).

  41. Tamborero, D., Gonzalez-Perez, A. & Lopez-Bigas, N. OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes. Bioinformatics 29, 2238–2244 (2013).

    Article  CAS  Google Scholar 

  42. DeLong, E.R., DeLong, D.M. & Clarke-Pearson, D.L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).

    Article  CAS  Google Scholar 

  43. Venables, W.N. & Ripley, B.D. Modern Applied Statistics with S (Springer Science & Business Media, 2002).

  44. Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).

    Article  Google Scholar 

Download references

Acknowledgements

We thank O.L. Griffith for critically reading the manuscript. Our work was supported by the Alvin J. Siteman Cancer Center, the Ohana Breast Cancer Research Fund, the Foundation for the Barnes-Jewish Hospital (to R.B.), the National Library of Medicine of the National Institutes of Health (R01LM012222 to S.J.S.), and the Canadian Institutes of Health Research (DFS-134967 to R.D.K.).

Author information

Authors and Affiliations

Authors

Contributions

R.D.K. and S.J.S. designed the study. R.D.K. wrote software and performed the analysis. R.D.K., S.J.S. and R.B. wrote the manuscript. R.B. supervised the project.

Corresponding authors

Correspondence to S Joshua Swamidass or Ron Bose.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 ParsSNP convergence and reproducibility.

(a) The EM portion of ParsSNP consistently converges in 15–20 iterations. Lines are offset slightly to aid visualization. (b) The pan-cancer training set was partitioned randomly into two equally sized, independent halves. ParsSNP produces highly correlated scores when trained on independent but comparable data sets (n = 566,223).

Supplementary Figure 2 Comparison of default parameters and parameter variations during learning.

The results of various alternative parameter settings are plotted against the reference labels produced using default settings in the training data set (n = 566,223; see Supplementary Table 2 for summary statistics). Most alternative settings produce predictions that are highly correlated with the reference. ELOG1.1, the E step uses a logarithmic upper bound with base of 1.1 (default = 2); ELOG10, the logarithm base is 10; ECONSTANT3, the E step uses a constant upper bound set to 3 (the default upper bound scales logarithmically in base 2); ECONSTANT10, constant upper bound of 10; ECONSTANT20, constant upper bound of 20; EFLOOR0, the E-step lower bound is set to 0 (default = 1); EFLOOR5, the lower bound is set to 5; E3to10, the E step uses lower and upper bounds of 3 and 10 for all samples; ESTEP0.8, the E-step sliding bound is calculated as 80% of current belief (default = 90%); ESTEP0.95, the sliding bound is calculated as 95% of current belief; MCV2, the M step uses 2-fold cross-validation (default = 5); LOGISTIC, the M step uses logistic regression (default is a tuned neural network); NODES6, the M step uses a neural network with only 6 hidden nodes (default is tuned, can use more than 6 nodes); DECAY0.1, the M step uses a neural network with weight decay of 0.1 (default is tuned, can use less stringent decay); DECAY0.1; NODES6, the M step enforces use of a simpler neural network than default settings require.

Supplementary Figure 3 Detecting recurrent missense mutations in the pan-cancer test set.

Control ROC curves, related to Supplementary Table 3, column 1. AUROCs are depicted.

Supplementary Figure 4 Detecting non-recurrent mutations in CGC members in the pan-cancer test set.

Control ROC curves, related to Supplementary Table 3, column 2. AUROCs are depicted.

Supplementary Figure 5 Detecting driver mutations in the driver–dbSNP data set.

Control ROC curves, related to Supplementary Table 3, column 3. AUROCs are depicted.

Supplementary Figure 6 Detecting disruptive mutations in the IARC p53 data set

Control ROC curves, related to Supplementary Table 3, column 4. AUROCs are depicted.

Supplementary Figure 7 Detecting functional mutations in the functional–neutral data set.

Control ROC curves, related to Supplementary Table 3, column 5. AUROCs are depicted.

Supplementary Figure 8 Box plots of ParsSNP score by mutation and gene type.

(a) Truncation rate is a gene-level descriptor that assigns low P values to genes enriched in truncations (TSG-like) and assigns high P values to genes that are depleted in truncations (oncogene-like). ‘Truncation’ events include frameshift, premature stop, and non-stop changes. ‘Missense’ mutations include missense substitutions as well as in-frame indels. ‘Silent’ changes include synonymous nucleotide substitutions as well as noncoding variants. Truncations receive higher median scores in TSG-like genes, while missense mutations receive higher scores in both TSG-like and oncogene-like genes. This represents a potential nonlinear two-way interaction between ParsSNP descriptors (truncation rate and mutation type). Boxes enclose the interquartile range. (b) ParsLR uses logistic regression rather than a neural network model and does not exhibit the same properties as the full ParsSNP model.

Supplementary Figure 9 Identification of putative driver genes and mutations.

Genes are plotted by the average ParsSNP score of their mutations and their single highest score in the entire pan-cancer data set (training + test + hypermutator). The top ParsSNP scoring mutations are generally found in members of the CGC. Two genes not belonging to the CGC have multiple exceptional mutations (arrows): TATA-box-binding protein (TBP) and the calcium-activated potassium channel KCNN3. Both have significantly higher median ParsSNP scores than expected by chance (Bonferroni-corrected one-sample Wilcoxon P < 0.05) and multiple mutations with exceptionally high ParsSNP scores, including TBP A191T (ParsSNP = 0.75) and R168Q (0.67), as well as KCNN3 R435C (0.60), L413Q (0.59), and S517Y (0.53).

Supplementary Figure 10 Differential functionality between hypermutated and non-hypermutated samples.

(a) A one-sample Wilcoxon test was performed on each gene in both the hypermutated and non-hypermutated (training + test) portions of the data set using internal null distributions. The –log10 P values of these tests are shown. As expected, many well-known cancer genes were more easily detected in the non-hypermutators. No genes were observed with elevated ParsSNP scores exclusively in the hypermutators. (b) A two-sample Wilcoxon test was performed for each gene, comparing the ParsSNP scores assigned to it in the hypermutated and non-hypermutated segments. Genes are plotted by the magnitude of median shift (negative values indicate lower scores in the hypermutated samples) and the –log10 P value. This analysis indicates that mutations in RNF43 and UPF3A have modestly but significantly elevated scores when observed in hypermutators. This suggests that these genes may be involved in the unique biology of these tumors.

Supplementary Figure 11 ParsSNP performance and data set size.

ParsSNP models were trained on progressively smaller subsets of the pan-cancer training data (n = 566,223), and performance (AUROC) was assessed for each classification task. Points represent average performance from five replicates.

Supplementary Figure 12 Criteria for thresholding ParsSNP scores.

(a) The E-step constraints are one possible objective criterion for thresholding ParsSNP scores. The value to be optimized is the percentage of samples receiving a number of driver mutations that is compatible with the E-step upper and lower bounds under the proposed threshold. (b) Another approach is to select a threshold that optimizes accuracy (correct classification rate) in the classification tasks.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–12. (PDF 1738 kb)

Supplementary Tables 1–7

Supplementary Tables 1–7. (XLSX 5336 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumar, R., Swamidass, S. & Bose, R. Unsupervised detection of cancer driver mutations with parsimony-guided learning. Nat Genet 48, 1288–1294 (2016). https://doi.org/10.1038/ng.3658

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng.3658

This article is cited by

Search

Quick links

Nature Briefing: Cancer

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

Get what matters in cancer research, free to your inbox weekly. Sign up for Nature Briefing: Cancer