Unsupervised detection of cancer driver mutations with parsimony-guided learning

Kumar, Runjun D; Swamidass, S Joshua; Bose, Ron

doi:10.1038/ng.3658

Technical Report
Published: 12 September 2016

Unsupervised detection of cancer driver mutations with parsimony-guided learning

Runjun D Kumar^1,2,3,
S Joshua Swamidass^2,4 &
Ron Bose¹

Nature Genetics volume 48, pages 1288–1294 (2016)Cite this article

5149 Accesses
36 Citations
98 Altmetric
Metrics details

Subjects

Abstract

Methods are needed to reliably prioritize biologically active driver mutations over inactive passengers in high-throughput sequencing cancer data sets. We present ParsSNP, an unsupervised functional impact predictor that is guided by parsimony. ParsSNP uses an expectation–maximization framework to find mutations that explain tumor incidence broadly, without using predefined training labels that can introduce biases. We compare ParsSNP to five existing tools (CanDrA, CHASM, FATHMM Cancer, TransFIC, and Condel) across five distinct benchmarks. ParsSNP outperformed the existing tools in 24 of 25 comparisons. To investigate the real-world benefit of these improvements, we applied ParsSNP to an independent data set of 30 patients with diffuse-type gastric cancer. ParsSNP identified many known and likely driver mutations that other methods did not detect, including truncation mutations in known tumor suppressors and the recurrent driver substitution RHOA p.Tyr42Cys. In conclusion, ParsSNP uses an innovative, parsimony-based approach to prioritize cancer driver mutations and provides dramatic improvements over existing methods.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Overview of ParsSNP and label learning.**

**Figure 2: ParsSNP detects recurrent mutations and mutations in known cancer-related genes in the pan-cancer test set.**

**Figure 3: ParsSNP identifies experimentally validated mutations in external data sets.**

**Figure 4: Comparison of candidate driver mutations in an independent data set identifies known and likely drivers that are only identified by ParsSNP.**

Detailed modeling of positive selection improves detection of cancer driver genes

Article Open access 30 July 2019

Siming Zhao, Jun Liu, … Xin He

Identification of cancer driver genes based on nucleotide context

Article 03 February 2020

Felix Dietlein, Donate Weghorn, … Shamil R. Sunyaev

Combined burden and functional impact tests for cancer driver discovery using DriverPower

Article Open access 05 February 2020

Shimin Shuai, PCAWG Drivers and Functional Interpretation Working Group, … PCAWG Consortium

References

Forbes, S.A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 39, D945–D950 (2011).
CAS Google Scholar
Vogelstein, B. et al. Cancer genome landscapes. Science 339, 1546–1558 (2013).
Article CAS Google Scholar
Carter, H., Douville, C., Stenson, P.D., Cooper, D.N. & Karchin, R. Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genomics 14 (Suppl. 3), S3 (2013).
Article Google Scholar
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Article CAS Google Scholar
Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Article CAS Google Scholar
Mao, Y. et al. CanDrA: cancer-specific driver missense mutation annotation with optimized features. PLoS One 8, e77945 (2013).
Article CAS Google Scholar
Carter, H. et al. Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. Cancer Res. 69, 6660–6667 (2009).
Article CAS Google Scholar
Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J.D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 48, 214–220 (2016).
Article CAS Google Scholar
Kumar, R.D., Searleman, A.C., Swamidass, S.J., Griffith, O.L. & Bose, R. Statistically identifying tumor suppressors and oncogenes from pan-cancer genome-sequencing data. Bioinformatics 31, 3561–3568 (2015).
Article CAS Google Scholar
Youn, A. & Simon, R. Identifying cancer driver genes in tumor genome sequencing studies. Bioinformatics 27, 175–181 (2011).
Article CAS Google Scholar
Tomasetti, C., Marchionni, L., Nowak, M.A., Parmigiani, G. & Vogelstein, B. Only three driver gene mutations are required for the development of lung and colorectal cancers. Proc. Natl. Acad. Sci. USA 112, 118–123 (2015).
Article CAS Google Scholar
Zaretzki, J.M., Browning, M.R., Hughes, T.B. & Swamidass, S.J. Extending P450 site-of-metabolism models with region-resolution data. Bioinformatics 31, 1966–1973 (2015).
Article CAS Google Scholar
Simonetti, F.L., Tornador, C., Nabau-Moretó, N. & Molina-Vila, M.A. & Marino-Buslje, C. Kin-Driver: a database of driver mutations in protein kinases. Database (Oxford) 2014, bau104 (2014).
Article Google Scholar
Martelotto, L.G. et al. Benchmarking mutation effect prediction algorithms using functionally validated cancer-related missense mutations. Genome Biol. 15, 484 (2014).
Article Google Scholar
Petitjean, A. et al. Impact of mutant p53 functional properties on TP53 mutation patterns and tumor phenotype: lessons from recent developments in the IARC TP53 database. Hum. Mutat. 28, 622–629 (2007).
Article CAS Google Scholar
Kim, E. et al. Systematic functional interrogation of rare cancer variants identifies oncogenic alleles. Cancer Discov. 6, 714–726 (2016).
Article CAS Google Scholar
Kakiuchi, M. et al. Recurrent gain-of-function mutations of RHOA in diffuse-type gastric carcinoma. Nat. Genet. 46, 583–587 (2014).
Article CAS Google Scholar
Schroeder, M.P., Rubio-Perez, C., Tamborero, D., Gonzalez-Perez, A. & Lopez-Bigas, N. OncodriveROLE classifies cancer driver genes in loss of function and activating mode of action. Bioinformatics 30, i549–i555 (2014).
Article CAS Google Scholar
Futreal, P.A. et al. A census of human cancer genes. Nat. Rev. Cancer 4, 177–183 (2004).
Article CAS Google Scholar
Shihab, H.A., Gough, J., Cooper, D.N., Day, I.N. & Gaunt, T.R. Predicting the functional consequences of cancer-associated amino acid substitutions. Bioinformatics 29, 1504–1510 (2013).
Article CAS Google Scholar
Gonzalez-Perez, A., Deu-Pons, J. & Lopez-Bigas, N. Improving the prediction of the functional impact of cancer mutations by baseline tolerance transformation. Genome Med. 4, 89 (2012).
Article Google Scholar
González-Pérez, A. & López-Bigas, N. Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am. J. Hum. Genet. 88, 440–449 (2011).
Article Google Scholar
Olden, J.D. & Jackson, D.A. Illuminating the “black box”: a randomization approach for understanding variable contributions in artificial neural networks. Ecol. Modell. 154, 135–150 (2002).
Article Google Scholar
Guan, B., Wang, T.-L. & Shih, IeM. ARID1A, a factor that promotes formation of SWI/SNF-mediated chromatin remodeling, is a tumor suppressor in gynecologic cancers. Cancer Res. 71, 6718–6727 (2011).
Article CAS Google Scholar
Lawrence, M.S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).
Article CAS Google Scholar
Bose, R. et al. Activating HER2 mutations in HER2 gene amplification negative breast cancer. Cancer Discov. 3, 224–237 (2013).
Article CAS Google Scholar
Kang, S., Bader, A.G. & Vogt, P.K. Phosphatidylinositol 3-kinase mutations identified in human cancer are oncogenic. Proc. Natl. Acad. Sci. USA 102, 802–807 (2005).
Article CAS Google Scholar
Koo, B.-K. et al. Tumour suppressor RNF43 is a stem-cell E3 ligase that induces endocytosis of Wnt receptors. Nature 488, 665–669 (2012).
Article CAS Google Scholar
Kim, V.N., Kataoka, N. & Dreyfuss, G. Role of the nonsense-mediated decay factor hUpf3 in the splicing-dependent exon–exon junction complex. Science 293, 1832–1836 (2001).
Article CAS Google Scholar
Huang, F.W. et al. Highly recurrent TERT promoter mutations in human melanoma. Science 339, 957–959 (2013).
Article CAS Google Scholar
Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955–961 (2015).
Article CAS Google Scholar
Fujita, P.A. et al. The UCSC Genome Browser database: update 2011. Nucleic Acids Res. 39, D876–D882 (2011).
Article CAS Google Scholar
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
Article Google Scholar
Kandoth, C. et al. Mutational landscape and significance across 12 major cancer types. Nature 502, 333–339 (2013).
Article CAS Google Scholar
Reva, B., Antipin, Y. & Sander, C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res. 39, e118 (2011).
Article CAS Google Scholar
Gonzalez-Perez, A. & Lopez-Bigas, N. Functional impact bias reveals cancer drivers. Nucleic Acids Res. 40, e169 (2012).
Article CAS Google Scholar
Basheer, I.A. & Hajmeer, M. Artificial neural networks: fundamentals, computing, design, and application. J. Microbiol. Methods 43, 3–31 (2000).
Article CAS Google Scholar
Dempster, A.P., Laird, N.M. & Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39, 1–38 (1977).
Google Scholar
Hong, Y. On computing the distribution function for the sum of independent and nonidentical random indicators (Technical Report 11-2) (Department of Statistics, Virginia Tech, 2011).
Cancer Genome Atlas Research Network. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N. Engl. J. Med. 368, 2059–2074 (2013).
Tamborero, D., Gonzalez-Perez, A. & Lopez-Bigas, N. OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes. Bioinformatics 29, 2238–2244 (2013).
Article CAS Google Scholar
DeLong, E.R., DeLong, D.M. & Clarke-Pearson, D.L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).
Article CAS Google Scholar
Venables, W.N. & Ripley, B.D. Modern Applied Statistics with S (Springer Science & Business Media, 2002).
Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).
Article Google Scholar

Download references

Acknowledgements

We thank O.L. Griffith for critically reading the manuscript. Our work was supported by the Alvin J. Siteman Cancer Center, the Ohana Breast Cancer Research Fund, the Foundation for the Barnes-Jewish Hospital (to R.B.), the National Library of Medicine of the National Institutes of Health (R01LM012222 to S.J.S.), and the Canadian Institutes of Health Research (DFS-134967 to R.D.K.).

Author information

Authors and Affiliations

Division of Oncology, Department of Medicine, Washington University School of Medicine, St. Louis, Missouri, USA
Runjun D Kumar & Ron Bose
Computational and Systems Biology Program, Washington University in St. Louis, St. Louis, Missouri, USA
Runjun D Kumar & S Joshua Swamidass
Medical Scientist Training Program, Washington University School of Medicine, St. Louis, Missouri, USA
Runjun D Kumar
Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, Missouri, USA
S Joshua Swamidass

Authors

Runjun D Kumar
View author publications
You can also search for this author in PubMed Google Scholar
S Joshua Swamidass
View author publications
You can also search for this author in PubMed Google Scholar
Ron Bose
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.D.K. and S.J.S. designed the study. R.D.K. wrote software and performed the analysis. R.D.K., S.J.S. and R.B. wrote the manuscript. R.B. supervised the project.

Corresponding authors

Correspondence to S Joshua Swamidass or Ron Bose.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 ParsSNP convergence and reproducibility.

(a) The EM portion of ParsSNP consistently converges in 15–20 iterations. Lines are offset slightly to aid visualization. (b) The pan-cancer training set was partitioned randomly into two equally sized, independent halves. ParsSNP produces highly correlated scores when trained on independent but comparable data sets (n = 566,223).

Supplementary Figure 2 Comparison of default parameters and parameter variations during learning.

The results of various alternative parameter settings are plotted against the reference labels produced using default settings in the training data set (n = 566,223; see Supplementary Table 2 for summary statistics). Most alternative settings produce predictions that are highly correlated with the reference. ELOG1.1, the E step uses a logarithmic upper bound with base of 1.1 (default = 2); ELOG10, the logarithm base is 10; ECONSTANT3, the E step uses a constant upper bound set to 3 (the default upper bound scales logarithmically in base 2); ECONSTANT10, constant upper bound of 10; ECONSTANT20, constant upper bound of 20; EFLOOR0, the E-step lower bound is set to 0 (default = 1); EFLOOR5, the lower bound is set to 5; E3to10, the E step uses lower and upper bounds of 3 and 10 for all samples; ESTEP0.8, the E-step sliding bound is calculated as 80% of current belief (default = 90%); ESTEP0.95, the sliding bound is calculated as 95% of current belief; MCV2, the M step uses 2-fold cross-validation (default = 5); LOGISTIC, the M step uses logistic regression (default is a tuned neural network); NODES6, the M step uses a neural network with only 6 hidden nodes (default is tuned, can use more than 6 nodes); DECAY0.1, the M step uses a neural network with weight decay of 0.1 (default is tuned, can use less stringent decay); DECAY0.1; NODES6, the M step enforces use of a simpler neural network than default settings require.

Supplementary Figure 3 Detecting recurrent missense mutations in the pan-cancer test set.

Control ROC curves, related to Supplementary Table 3, column 1. AUROCs are depicted.

Supplementary Figure 4 Detecting non-recurrent mutations in CGC members in the pan-cancer test set.

Control ROC curves, related to Supplementary Table 3, column 2. AUROCs are depicted.

Supplementary Figure 5 Detecting driver mutations in the driver–dbSNP data set.

Control ROC curves, related to Supplementary Table 3, column 3. AUROCs are depicted.

Supplementary Figure 6 Detecting disruptive mutations in the IARC p53 data set

Control ROC curves, related to Supplementary Table 3, column 4. AUROCs are depicted.

Supplementary Figure 7 Detecting functional mutations in the functional–neutral data set.

Control ROC curves, related to Supplementary Table 3, column 5. AUROCs are depicted.

Supplementary Figure 8 Box plots of ParsSNP score by mutation and gene type.

(a) Truncation rate is a gene-level descriptor that assigns low P values to genes enriched in truncations (TSG-like) and assigns high P values to genes that are depleted in truncations (oncogene-like). ‘Truncation’ events include frameshift, premature stop, and non-stop changes. ‘Missense’ mutations include missense substitutions as well as in-frame indels. ‘Silent’ changes include synonymous nucleotide substitutions as well as noncoding variants. Truncations receive higher median scores in TSG-like genes, while missense mutations receive higher scores in both TSG-like and oncogene-like genes. This represents a potential nonlinear two-way interaction between ParsSNP descriptors (truncation rate and mutation type). Boxes enclose the interquartile range. (b) ParsLR uses logistic regression rather than a neural network model and does not exhibit the same properties as the full ParsSNP model.

Supplementary Figure 9 Identification of putative driver genes and mutations.

Genes are plotted by the average ParsSNP score of their mutations and their single highest score in the entire pan-cancer data set (training + test + hypermutator). The top ParsSNP scoring mutations are generally found in members of the CGC. Two genes not belonging to the CGC have multiple exceptional mutations (arrows): TATA-box-binding protein (TBP) and the calcium-activated potassium channel KCNN3. Both have significantly higher median ParsSNP scores than expected by chance (Bonferroni-corrected one-sample Wilcoxon P < 0.05) and multiple mutations with exceptionally high ParsSNP scores, including TBP A191T (ParsSNP = 0.75) and R168Q (0.67), as well as KCNN3 R435C (0.60), L413Q (0.59), and S517Y (0.53).

Supplementary Figure 10 Differential functionality between hypermutated and non-hypermutated samples.

(a) A one-sample Wilcoxon test was performed on each gene in both the hypermutated and non-hypermutated (training + test) portions of the data set using internal null distributions. The –log₁₀ P values of these tests are shown. As expected, many well-known cancer genes were more easily detected in the non-hypermutators. No genes were observed with elevated ParsSNP scores exclusively in the hypermutators. (b) A two-sample Wilcoxon test was performed for each gene, comparing the ParsSNP scores assigned to it in the hypermutated and non-hypermutated segments. Genes are plotted by the magnitude of median shift (negative values indicate lower scores in the hypermutated samples) and the –log₁₀ P value. This analysis indicates that mutations in RNF43 and UPF3A have modestly but significantly elevated scores when observed in hypermutators. This suggests that these genes may be involved in the unique biology of these tumors.

Supplementary Figure 11 ParsSNP performance and data set size.

ParsSNP models were trained on progressively smaller subsets of the pan-cancer training data (n = 566,223), and performance (AUROC) was assessed for each classification task. Points represent average performance from five replicates.

Supplementary Figure 12 Criteria for thresholding ParsSNP scores.

(a) The E-step constraints are one possible objective criterion for thresholding ParsSNP scores. The value to be optimized is the percentage of samples receiving a number of driver mutations that is compatible with the E-step upper and lower bounds under the proposed threshold. (b) Another approach is to select a threshold that optimizes accuracy (correct classification rate) in the classification tasks.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–12. (PDF 1738 kb)

Supplementary Tables 1–7

Supplementary Tables 1–7. (XLSX 5336 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kumar, R., Swamidass, S. & Bose, R. Unsupervised detection of cancer driver mutations with parsimony-guided learning. Nat Genet 48, 1288–1294 (2016). https://doi.org/10.1038/ng.3658

Download citation

Received: 06 June 2016
Accepted: 05 August 2016
Published: 12 September 2016
Issue Date: October 2016
DOI: https://doi.org/10.1038/ng.3658

This article is cited by

In silico methods for predicting functional synonymous variants
- Brian C. Lin
- Upendra Katneni
- Chava Kimchi-Sarfaty
Genome Biology (2023)
Network-based prediction approach for cancer-specific driver missense mutations using a graph neural network
- Narumi Hatano
- Mayumi Kamada
- Yasushi Okuno
BMC Bioinformatics (2023)
Impact of deleterious missense PRKCI variants on structural and functional dynamics of protein
- Hania Shah
- Khushbukhat Khan
- Maria Shabbir
Scientific Reports (2022)
Identification of novel prognostic biomarkers by integrating multi-omics data in gastric cancer
- Nannan Liu
- Yun Wu
- Liwei Zhuang
BMC Cancer (2021)
Current cancer driver variant predictors learn to recognize driver genes instead of functional variants
- Daniele Raimondi
- Antoine Passemiers
- Yves Moreau
BMC Biology (2021)