Identification of cancer driver genes based on nucleotide context

Dietlein, Felix; Weghorn, Donate; Taylor-Weiner, Amaro; Richters, André; Reardon, Brendan; Liu, David; Lander, Eric S.; Van Allen, Eliezer M.; Sunyaev, Shamil R.

doi:10.1038/s41588-019-0572-y

Article
Published: 03 February 2020

Identification of cancer driver genes based on nucleotide context

Nature Genetics volume 52, pages 208–218 (2020)Cite this article

20k Accesses
121 Citations
119 Altmetric
Metrics details

Subjects

Abstract

Cancer genomes contain large numbers of somatic mutations but few of these mutations drive tumor development. Current approaches either identify driver genes on the basis of mutational recurrence or approximate the functional consequences of nonsynonymous mutations by using bioinformatic scores. Passenger mutations are enriched in characteristic nucleotide contexts, whereas driver mutations occur in functional positions, which are not necessarily surrounded by a particular nucleotide context. We observed that mutations in contexts that deviate from the characteristic contexts around passenger mutations provide a signal in favor of driver genes. We therefore developed a method that combines this feature with the signals traditionally used for driver-gene identification. We applied our method to whole-exome sequencing data from 11,873 tumor–normal pairs and identified 460 driver genes that clustered into 21 cancer-related pathways. Our study provides a resource of driver genes across 28 tumor types with additional driver genes identified according to mutations in unusual nucleotide contexts.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Dependency of mutations on extended nucleotide contexts.**

**Fig. 2: Mutations in unusual contexts provide a signal in favor of driver genes.**

**Fig. 3: Comparison of different methods to identify driver genes.**

**Fig. 4: A catalog of driver genes in human cancer.**

**Fig. 5: Stratification of driver genes based on literature support.**

**Fig. 6: Characterization of driver genes based on physical interactions.**

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Qiuyue Yuan & Zhana Duren

A single-cell atlas enables mapping of homeostatic cellular shifts in the adult human breast

Article Open access 28 March 2024

Austin D. Reed, Sara Pensa, … Walid T. Khaled

Genome-wide association studies

Article 26 August 2021

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

Data availability

A complete MAF of the sequencing data used in this study is available on www.cancer-genes.org and in the Supplementary Information.

Code availability

MutPanning can be downloaded as an interactive software package from www.cancer-genes.org and from the Supplementary Information (including Supplementary Data 1–4). MutPanning can be run on a local computer with at least one CPU, 8 GB memory and 2.5 GB hard drive. In addition, an online version of MutPanning is available through the GenePattern platform (http://www.genepattern.org/modules/docs/MutPanning and http://bit.ly/mutpanning-gp). The MutPanning source code is available on GitHub (https://github.com/vanallenlab/MutPanningV2). MutPannig is distributed under the BSD-3-Clause open source license.

References

Stratton, M. R., Campbell, P. J. & Futreal, P. A. The cancer genome. Nature 458, 719–724 (2009).
Article CAS PubMed PubMed Central Google Scholar
Vogelstein, B. et al. Cancer genome landscapes. Science 339, 1546–1558 (2013).
Article CAS PubMed PubMed Central Google Scholar
Stephens, P. J. et al. The landscape of cancer genes and mutational processes in breast cancer. Nature 486, 400–404 (2012).
Article CAS PubMed PubMed Central Google Scholar
Greaves, M. & Maley, C. C. Clonal evolution in cancer. Nature 481, 306–313 (2012).
Article CAS PubMed PubMed Central Google Scholar
Bailey, M. H. et al. Comprehensive characterization of cancer driver genes and mutations. Cell 173, 371–385 (2018).
Article CAS PubMed PubMed Central Google Scholar
Porta-Pardo, E. & Godzik, A. e-Driver: a novel method to identify protein regions driving cancer. Bioinformatics 30, 3109–3114 (2014).
Article CAS PubMed PubMed Central Google Scholar
Tamborero, D., Gonzalez-Perez, A. & Lopez-Bigas, N. OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes. Bioinformatics 29, 2238–2244 (2013).
Article CAS PubMed Google Scholar
Gonzalez-Perez, A. & Lopez-Bigas, N. Functional impact bias reveals cancer drivers. Nucleic Acids Res. 40, e169 (2012).
Article CAS PubMed PubMed Central Google Scholar
Mularoni, L., Sabarinathan, R., Deu-Pons, J., Gonzalez-Perez, A. & Lopez-Bigas, N. OncodriveFML: a general framework to identify coding and non-coding regions with cancer driver mutations. Genome Biol. 17, 128 (2016).
Article CAS PubMed PubMed Central Google Scholar
Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).
Article CAS PubMed PubMed Central Google Scholar
Lawrence, M. S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495–501 (2014).
Article CAS PubMed PubMed Central Google Scholar
Martincorena, I. et al. Universal patterns of selection in cancer and somatic tissues. Cell 171, 1029–1041 (2017).
Article CAS PubMed PubMed Central Google Scholar
Weghorn, D. & Sunyaev, S. Bayesian inference of negative and positive selection in human cancers. Nat. Genet. 49, 1785–1788 (2017).
Article CAS PubMed Google Scholar
Hoadley, K. A. et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell 158, 929–944 (2014).
Article CAS PubMed PubMed Central Google Scholar
The Cancer Genome Atlas Research Network Comprehensive molecular profiling of lung adenocarcinoma. Nature 511, 543–550 (2014).
Article CAS PubMed Central Google Scholar
Hoadley, K. A. et al. Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell 173, 291–304 (2018).
Article CAS PubMed PubMed Central Google Scholar
Cooper, G. M. & Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet. 12, 628–640 (2011).
Article CAS PubMed Google Scholar
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Article CAS PubMed PubMed Central Google Scholar
Kumar, R. D., Searleman, A. C., Swamidass, S. J., Griffith, O. L. & Bose, R. Statistically identifying tumor suppressors and oncogenes from pan-cancer genome-sequencing data. Bioinformatics 31, 3561–3568 (2015).
Article CAS PubMed PubMed Central Google Scholar
Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).
Article CAS PubMed PubMed Central Google Scholar
Alexandrov, L. B. et al. Mutational signatures associated with tobacco smoking in human cancer. Science 354, 618–622 (2016).
Article CAS PubMed PubMed Central Google Scholar
Nik-Zainal, S. et al. Mutational processes molding the genomes of 21 breast cancers. Cell 149, 979–993 (2012).
Article CAS PubMed PubMed Central Google Scholar
Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).
Article CAS PubMed PubMed Central Google Scholar
Ebrahimi, D., Alinejad-Rokny, H. & Davenport, M. P. Insights into the motif preference of APOBEC3 enzymes. PLoS ONE 9, e87679 (2014).
Article CAS PubMed PubMed Central Google Scholar
Roberts, S. A. et al. Clustered mutations in yeast and in human cancers can arise from damaged long single-strand DNA regions. Mol. Cell 46, 424–435 (2012).
Article CAS PubMed PubMed Central Google Scholar
Roberts, S. A. et al. An APOBEC cytidine deaminase mutagenesis pattern is widespread in human cancers. Nat. Genet. 45, 970–976 (2013).
Article CAS PubMed PubMed Central Google Scholar
Church, D. N. et al. DNA polymerase ε and δ exonuclease domain mutations in endometrial cancer. Hum. Mol. Genet. 22, 2820–2828 (2013).
Article CAS PubMed PubMed Central Google Scholar
Shinbrot, E. et al. Exonuclease mutations in DNA polymerase epsilon reveal replication strand specific mutation patterns and human origins of replication. Genome Res. 24, 1740–1750 (2014).
Article CAS PubMed PubMed Central Google Scholar
Goodman, M. F. & Fygenson, K. D. DNA polymerase fidelity: from genetics toward a biochemical understanding. Genetics 148, 1475–1482 (1998).
CAS PubMed PubMed Central Google Scholar
Ganai, R. A. & Johansson, E. DNA replication—a matter of fidelity. Mol. Cell 62, 745–755 (2016).
Article CAS PubMed Google Scholar
Hofree, M. et al. Challenges in identifying cancer genes by analysis of exome sequencing data. Nat. Commun. 7, 12096 (2016).
Article CAS PubMed PubMed Central Google Scholar
Tokheim, C. J., Papadopoulos, N., Kinzler, K. W., Vogelstein, B. & Karchin, R. Evaluating the evaluation of cancer driver genes. Proc. Natl Acad. Sci. USA 113, 14330–14335 (2016).
Article CAS PubMed PubMed Central Google Scholar
Makova, K. D. & Hardison, R. C. The effects of chromatin organization on variation in mutation rates in the genome. Nat. Rev. Genet. 16, 213–223 (2015).
Article CAS PubMed PubMed Central Google Scholar
Schuster-Bockler, B. & Lehner, B. Chromatin organization is a major influence on regional mutation rates in human cancer cells. Nature 488, 504–507 (2012).
Article CAS PubMed Google Scholar
Polak, P. et al. Reduced local mutation density in regulatory DNA of cancer genomes is linked to DNA repair. Nat. Biotechnol. 32, 71–75 (2014).
Article CAS PubMed Google Scholar
North, B. V., Curtis, D. & Sham, P. C. A note on the calculation of empirical P values from Monte Carlo procedures. Am. J. Hum. Genet. 71, 439–441 (2002).
Article CAS PubMed PubMed Central Google Scholar
Ewens, W. J. On estimating P values by the Monte Carlo method. Am. J. Hum. Genet. 72, 496–498 (2003).
Article CAS PubMed PubMed Central Google Scholar
Shiraishi, Y., Tremmel, G., Miyano, S. & Stephens, M. A simple model-based approach to inferring and visualizing cancer mutation signatures. PLoS Genet. 11, e1005657 (2015).
Article CAS PubMed PubMed Central Google Scholar
Fredriksson, N. J. et al. Recurrent promoter mutations in melanoma are defined by an extended context-specific mutational signature. PLoS Genet. 13, e1006773 (2017).
Article CAS PubMed PubMed Central Google Scholar
Chang, M. T. et al. Identifying recurrent mutations in cancer reveals widespread lineage diversity and mutational specificity. Nat. Biotechnol. 34, 155–163 (2016).
Article CAS PubMed Google Scholar
Chang, M. T. et al. Accelerating discovery of functional mutant alleles in cancer. Cancer Discov. 8, 174–183 (2018).
Article CAS PubMed Google Scholar
Forbes, S. A. et al. COSMIC: exploring the world’s knowledge of somatic mutations in human cancer. Nucleic Acids Res. 43, D805–11 (2015).
Article CAS PubMed Google Scholar
Futreal, P. A. et al. A census of human cancer genes. Nat. Rev. Cancer 4, 177–183 (2004).
Article CAS PubMed PubMed Central Google Scholar
Chakravarty, D. et al. OncoKB: a precision oncology knowledge base. JCO Precis. Oncol. https://doi.org/10.1200/PO.17.00011 (2017).
Grau, J., Grosse, I. & Keilwagen, J. PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R. Bioinformatics 31, 2595–2597 (2015).
Article CAS PubMed PubMed Central Google Scholar
Tomasetti, C., Marchionni, L., Nowak, M. A., Parmigiani, G. & Vogelstein, B. Only three driver gene mutations are required for the development of lung and colorectal cancers. Proc. Natl Acad. Sci. USA 112, 118–123 (2015).
Article CAS PubMed Google Scholar
Ellrott, K. et al. Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst. 6, 271–281 (2018).
Article CAS PubMed PubMed Central Google Scholar
Dees, N. D. et al. MuSiC: identifying mutational significance in cancer genomes. Genome Res. 22, 1589–1598 (2012).
Article CAS PubMed PubMed Central Google Scholar
Szklarczyk, D. et al. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, D447–52 (2015).
Article CAS PubMed Google Scholar
Cowen, L., Ideker, T., Raphael, B. J. & Sharan, R. Network propagation: a universal amplifier of genetic associations. Nat. Rev. Genet. 18, 551–562 (2017).
Article CAS PubMed Google Scholar
Hofree, M., Shen, J. P., Carter, H., Gross, A. & Ideker, T. Network-based stratification of tumor mutations. Nat. Methods 10, 1108–1115 (2013).
Article CAS PubMed PubMed Central Google Scholar
Leiserson, M. D. et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet. 47, 106–114 (2015).
Article CAS PubMed Google Scholar
Murphy, M., Chatterjee, S. S., Jain, S., Katari, M. & DasGupta, R. TCF7L1 modulates colorectal cancer growth by inhibiting expression of the tumor-suppressor gene EPHB3. Sci. Rep. 6, 28299 (2016).
Article PubMed PubMed Central Google Scholar
Morrison, G., Scognamiglio, R., Trumpp, A. & Smith, A. Convergence of cMyc and β-catenin on Tcf7l1 enables endoderm specification. EMBO J. 35, 356–368 (2016).
Article CAS PubMed Google Scholar
Cairns, J. et al. Differential roles of ERRFI1 in EGFR and AKT pathway regulation affect cancer proliferation. EMBO Rep. 19, e44767 (2018).
Article CAS PubMed PubMed Central Google Scholar
Taatjes, D. J. The human Mediator complex: a versatile, genome-wide regulator of transcription. Trends Biochem. Sci. 35, 315–322 (2010).
Article CAS PubMed PubMed Central Google Scholar
Soutourina, J. Transcription regulation by the Mediator complex. Nat. Rev. Mol. Cell Biol. 19, 262–274 (2018).
Article CAS PubMed Google Scholar
Garraway, L. A. & Lander, E. S. Lessons from the cancer genome. Cell 153, 17–37 (2013).
Article CAS PubMed Google Scholar
Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144, 646–674 (2011).
Article CAS PubMed Google Scholar
Pereira, B., Billaud, M. & Almeida, R. RNA-binding proteins in cancer: old players and new actors. Trends Cancer 3, 506–528 (2017).
Article CAS PubMed Google Scholar
Neelamraju, Y., Gonzalez-Perez, A., Bhat-Nakshatri, P., Nakshatri, H. & Janga, S. C. Mutational landscape of RNA-binding proteins in human cancers. RNA Biol. 15, 115–129 (2018).
Article PubMed Google Scholar
Pelletier, J., Thomas, G. & Volarevic, S. Ribosome biogenesis in cancer: new players and therapeutic avenues. Nat. Rev. Cancer 18, 51–63 (2018).
Article CAS PubMed Google Scholar
Sulima, S. O., Hofman, I. J. F., De Keersmaecker, K. & Dinman, J. D. How ribosomes translate cancer. Cancer Discov. 7, 1069–1087 (2017).
Article CAS PubMed PubMed Central Google Scholar
Wilson, K. F., Erickson, J. W., Antonyak, M. A. & Cerione, R. A. Rho GTPases and their roles in cancer metabolism. Trends Mol. Med. 19, 74–82 (2013).
Article CAS PubMed Google Scholar
Porter, A. P., Papaioannou, A. & Malliri, A. Deregulation of Rho GTPases in cancer. Small GTPases 7, 123–138 (2016).
Article CAS PubMed PubMed Central Google Scholar
Thorsson, V. et al. The immune landscape of cancer. Immunity 48, 812–830 (2018).
Article CAS PubMed PubMed Central Google Scholar
Disis, M. L. Immune regulation of cancer. J. Clin. Oncol. 28, 4531–4538 (2010).
Article CAS PubMed PubMed Central Google Scholar
Chakravorty, D. et al. MYCbase: a database of functional sites and biochemical properties of Myc in both normal and cancer cells. BMC Bioinform. 18, 224 (2017).
Article CAS Google Scholar
Izarzugaza, J. M., Redfern, O. C., Orengo, C. A. & Valencia, A. Cancer-associated mutations are preferentially distributed in protein kinase functional sites. Proteins 77, 892–903 (2009).
Article CAS PubMed Google Scholar
Taylor-Weiner, A. et al. DeTiN: overcoming tumor-in-normal contamination. Nat. Methods 15, 531–534 (2018).
Article CAS PubMed PubMed Central Google Scholar
Creixell, P. et al. Pathway and network analysis of cancer genomes. Nat. Methods 12, 615–621 (2015).
Article CAS PubMed PubMed Central Google Scholar
Hess, J. M. et al. Passenger hotspot mutations in cancer. Cancer Cell 36, 288–301 (2019).
Article CAS PubMed PubMed Central Google Scholar
Carter, H. et al. Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. Cancer Res. 69, 6660–6667 (2009).
Article CAS PubMed PubMed Central Google Scholar
AACR Project GENIE Consortium. AACR project GENIE: powering precision medicine through an international consortium. Cancer Discov. 7, 818–831 (2017).
Cheng, D. T. et al. Comprehensive detection of germline variants by MSK-IMPACT, a clinical diagnostic platform for solid tumor molecular oncology and concurrent cancer predisposition testing. BMC Med. Genomics 10, 33 (2017).
Article CAS PubMed PubMed Central Google Scholar
Rheinbay, E. et al. Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes. Preprint at bioRxiv https://doi.org/10.1101/237313 (2017).
Zhang, J. et al. International Cancer Genome Consortium Data Portal—a one-stop shop for cancer genomics data. Database 2011, bar026 (2011).
PubMed PubMed Central Google Scholar
Priestley, P. et al. Pan-cancer whole-genome analyses of metastatic solid tumours. Nature 575, 210–216 (2019).
Article CAS PubMed PubMed Central Google Scholar
Reich, M. et al. GenePattern 2.0. Nat. Genet. 38, 500–501 (2006).
Article CAS PubMed Google Scholar
Reich, M. et al. The genepattern notebook environment. Cell Syst. 5, 149–151 (2017).
Article CAS PubMed PubMed Central Google Scholar
Gao, J. et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal. 6, pl1 (2013).
Article CAS PubMed PubMed Central Google Scholar
Cerami, E. et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2, 401–404 (2012).
Article PubMed Google Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article CAS PubMed PubMed Central Google Scholar
Costello, M. et al. Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res. 41, e67 (2013).
Article CAS PubMed PubMed Central Google Scholar
Gilson, M. K. et al. BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 44, D1045–53 (2016).
Article CAS PubMed Google Scholar
Xenarios, I. et al. DIP: the database of interacting proteins. Nucleic Acids Res. 28, 289–291 (2000).
Stark, C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 34, D535–9 (2006).
Article CAS PubMed Google Scholar
Peri, S. et al. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 13, 2363–2371 (2003).
Article CAS PubMed PubMed Central Google Scholar
Hermjakob, H. et al. IntAct: an open source molecular interaction database. Nucleic Acids Res. 32, D452–5 (2004).
Article CAS PubMed PubMed Central Google Scholar
Licata, L. et al. MINT, the molecular interaction database: 2012 update. Nucleic Acids Res. 40, D857–61 (2012).
Article CAS PubMed Google Scholar
Schaefer, C. F. et al. PID: the pathway interaction database. Nucleic Acids Res. 37, D674–9 (2009).
Article CAS PubMed Google Scholar
Miller, M., Shuman, J. D., Sebastian, T., Dauter, Z. & Johnson, P. F. Structural basis for DNA recognition by the basic region leucine zipper transcription factor CCAAT/enhancer-binding protein α. J. Biol. Chem. 278, 15178–15184 (2003).
Article CAS PubMed Google Scholar
Chen, Y. et al. DNA binding by GATA transcription factor suggests mechanisms of DNA looping and long-range gene regulation. Cell Rep. 2, 1197–1206 (2012).
Article CAS PubMed PubMed Central Google Scholar
Bravo, J., Li, Z., Speck, N. A. & Warren, A. J. The leukemia-associated AML1 (Runx1)–CBFβ complex functions as a DNA-induced molecular clamp. Nat. Struct. Biol. 8, 371–378 (2001).
Article CAS PubMed Google Scholar
Gao, N. et al. Structural basis of human transcription factor Sry-related box 17 binding to DNA. Protein Pept. Lett. 20, 481–488 (2013).
CAS PubMed Google Scholar
Palasingam, P., Jauch, R., Ng, C. K. & Kolatkar, P. R. The structure of Sox17 bound to DNA reveals a conserved bending topology but selective protein interaction platforms. J. Mol. Biol. 388, 619–630 (2009).
Article CAS PubMed Google Scholar
Zhang, S. et al. Molecular mechanism of APC/C activation by mitotic phosphorylation. Nature 533, 260–264 (2016).
Article CAS PubMed PubMed Central Google Scholar
He, Y. et al. Near-atomic resolution visualization of human transcription promoter opening. Nature 533, 359–365 (2016).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank G. Getz and C. Cotsapas for their valuable comments and suggestions. We thank M. Reich and T. Liefeld for adding MutPanning as a module to the GenePattern platform. The results presented in this study are in part based on data generated by the TCGA Research Network: https://www.cancer.gov/tcga. F.D. was supported by the EMBO Long-Term Fellowship Program (grant no. ALTF 502-2016), the Claudia Adams Barr Program for Innovative Cancer Research and the AWS Cloud Credits for Research Program. E.M.V.A. and S.R.S received funding from the National Institutes of Health (grants nos K08 CA188615, R01 CA227388 and R21 CA242861 to E.M.V.A. and grants nos R01 MH101244, R35 GM127131 and U01 HG009088 to S.R.S.). E.M.V.A acknowledges support through the Phillip A. Sharp Innovation in Collaboration Award. F.D. and E.M.V.A. were further supported through the ASPIRE Award of The Mark Foundation for Cancer Research.

Author information

These authors contributed equally: Felix Dietlein, Donate Weghorn.
These authors jointly supervised this work: Eliezer M. Van Allen, Shamil R. Sunyaev.

Authors and Affiliations

Department of Medical Oncology, Dana–Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
Felix Dietlein, Amaro Taylor-Weiner, Brendan Reardon, David Liu & Eliezer M. Van Allen
Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, MA, USA
Felix Dietlein, Amaro Taylor-Weiner, André Richters, Brendan Reardon, David Liu, Eric S. Lander & Eliezer M. Van Allen
Division of Genetics, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
Donate Weghorn & Shamil R. Sunyaev
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Donate Weghorn & Shamil R. Sunyaev
Centre for Genomic Regulation, Barcelona, Spain
Donate Weghorn
Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, USA
André Richters

Authors

Felix Dietlein
View author publications
You can also search for this author in PubMed Google Scholar
Donate Weghorn
View author publications
You can also search for this author in PubMed Google Scholar
Amaro Taylor-Weiner
View author publications
You can also search for this author in PubMed Google Scholar
André Richters
View author publications
You can also search for this author in PubMed Google Scholar
Brendan Reardon
View author publications
You can also search for this author in PubMed Google Scholar
David Liu
View author publications
You can also search for this author in PubMed Google Scholar
Eric S. Lander
View author publications
You can also search for this author in PubMed Google Scholar
Eliezer M. Van Allen
View author publications
You can also search for this author in PubMed Google Scholar
Shamil R. Sunyaev
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

F.D., D.W., A.R., E.S.L., E.M.V.A. and S.R.S. wrote the manuscript and prepared the figures, which all authors reviewed. F.D., D.W., B.R., D.L., E.M.V.A. and S.R.S. designed and performed the bioinformatics analyses for driver-gene identification, and designed and performed the bioinformatics analyses for method comparison and stratification of the driver-gene catalog. F.D., D.W., A.T.-W., A.R., B.R., D.L., E.S.L., E.M.V.A. and S.R.S. performed a review of the findings and biological follow-up analyses. F.D., D.W., A.T.-W., B.R., D.L., E.S.L., E.M.V.A. and S.R.S. contributed to the development of the method and its implementation.

Corresponding authors

Correspondence to Felix Dietlein, Eliezer M. Van Allen or Shamil R. Sunyaev.

Ethics declarations

Competing interests

E.M.V.A. is a consultant for Tango Therapeutics, Genome Medical, Invitae, Foresite Capital, Dynamo and Illumina. E.M.V.A. received research support from Novartis and BMS as well as travel support from Roche and Genentech. E.M.V.A. is an equity holder of Syapse, Tango Therapeutics and Genome Medical.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Modeling of mutation probabilities based on extended nucleotide contexts.

a, We applied the composite likelihood model to COSMIC mutation signatures. For each trinucleotide context, we compared the original mutation frequency against the mutation frequency returned by the composite likelihood model based on Pearson correlation. Dot colors reflect base substitution types. b, For six base substitution types, we plotted the original mutation probability (based on 11873 samples) against the prediction of the composite likelihood model, which we derived as the product of the mutational likelihood of its reference nucleotide and its substitution type. Each dot represents a cancer type. Pearson correlations are annotated at the bottom right. The number of samples per cancer type can be found in Extended Data Fig. 5. c, For three cancer types (bladder, n = 317 samples; endometrium, n = 327; skin, n = 582) we examined whether nucleotides outside the trinucleotide context affected mutation probabilities. For this purpose, we compared mutation probabilities, modeled based on tri- (blue) and 7-nucleotide contexts (yellow), with original mutation probabilities based on context-specific mutation counts. Data points are sorted according to the modeled mutation rates, derived from the 7-nucleotide context (x-axis). Black circles indicate ratios between the observed probabilities and the corresponding trinucleotide-specific likelihoods (y-axis). Similarly, the orange line displays the ratio between the likelihoods, derived from the 7-nucleotide and trinucleotide contexts, respectively (y-axis). Local mutation probabilities vary across positions surrounded the same trinucleotide context. Accounting for extended nucleotide contexts reduces this heterogeneity.

Extended Data Fig. 2 Evaluation of the composite likelihood model applied to extended nucleotide contexts.

To test the independence assumption of the composite likelihood model, we examined the interaction between any two positions (25 possible combinations) in the 11-nucleotide context around mutations of eight cancer types (bladder, n = 317 samples; breast, n = 1443; colorectal, n = 223; endometrium, n = 327; gastroesophageal, n = 833; head and neck, n = 425; lung adeno, n = 446; skin, n = 582). For any two positions, there are 96 possible nucleotide contexts and we plotted the observed mutation count of each nucleotide context (x-axis) against the predictions of the composite likelihood model (y-axis). Pearson correlation coefficients between observed and predicted data served as a measure of interaction. Each position pair is visualized in a separate correlation plot, and positions are annotated at the bottom right of the plot. For instance, pair (-1,1) refers to the trinucleotide context. Dot colors indicate the base substitution types.

Extended Data Fig. 3 Generalization of the composite likelihood model to extended nucleotide contexts.

We counted the number of mutations in each possible nucleotide context of length ≤7 based on the sequencing data of 11,873 samples. The exact number of samples per cancer type included in this analysis is shown in Extended Data Fig. 5. We compared these counts with the mutability scores returned by the composite likelihood model (218,448 different nucleotide contexts). Since the number of possible nucleotide contexts was too large to be visualized directly, we plotted the data point density. The Pearson correlation coefficient (R) of each plot is annotated at the bottom right.

Extended Data Fig. 4 Extended nucleotide contexts contribute to the performance of the composite likelihood model.

We examined whether accounting for extended contexts beyond trinucleotide contexts improved the fit of the composite likelihood model. To this end, we varied the number of nucleotides in the composite likelihood model between 0 (i.e. only substitution types) and 6 (i.e. 7-nucleotide contexts). We computed the residual sum of squared differences between observed mutation counts and the predictions of the composite likelihood model. As a negative control, we determined the residual sum of squares for a uniform distribution. This baseline was used to normalize the residual sum of squares for each cancer type. For some cancer types with ‘flat’ mutation signatures, nucleotide contexts only had minor impact on the fit of the model, but did not decrease the performance of the model (for example, lung adeno., n = 446 samples). For other cancer types, the fit of the model largely depended on the trinucleotide context, but not on the extended nucleotide context (e.g., prostate cancer, n = 880). For most cancer types with high background mutation rates, the fit of the composite likelihood model strongly depended on the extended nucleotide context (e.g., bladder, n = 317; breast, n = 1443; cervical, n = 192; colorectal, n = 223; endometrial cancer, n = 327; melanoma, n = 582).

Extended Data Fig. 5 A large-scale cohort of whole-exome sequencing data to identify rare cancer genes.

To systematically identify candidate cancer genes, we analyzed sequencing data from 11,873 individual tumor samples using the statistical framework that we had developed in this study. Our study cohort contained whole-exome sequencing data from 32 TCGA-related (orange) and 55 TCGA-independent (blue) projects.

Extended Data Fig. 6 Benchmarking of the performance of MutPanning for cancer gene identification.

We benchmarked the performance of our method against 7 previously published methods for cancer gene identification based on the sequencing data of 11,873 samples spanning 28 different cancer types. The exact number of samples per cancer type can be found in Extended Data Fig. 5. To benchmark the performance of a method, we sorted genes according to the significance values (adjusted for multiple testing) returned by the method. As a conservative approximation of the true-positive rate we used Cancer Gene Census (CGC) genes (a, b, c) and OncoKB genes (d, e, f) to derive ROC and precision-recall curves. We quantified the performance of each method as the area under the ROC curve (AUC) for the top 150 (a, d) or 1000 (b, e) non-CGC/OncoKB genes, respectively. Further, we determined the precision at 5% recall for each method (c, f). We normalized these measures to the maximum within each cancer type.

Extended Data Fig. 7 Comparison of different methods for cancer-gene identification.

We benchmarked the performance of our method against 7 previously published methods for cancer gene identification based on the sequencing data of 11,873 samples spanning 28 different cancer types. To benchmark the performance of a method, we sorted genes according to the significance values (adjusted for multiple testing) returned by the method. As a conservative approximation of the true-positive rate we used Cancer Gene Census (CGC) genes (a, c, e) and OncoKB genes (b, d, f) to derive ROC and precision-recall curves. We quantified the performance of each method as the area under the ROC curve (AUC) for the top 150 (a, b) or 1000 (c, d) non-CGC/OncoKB genes, respectively. Further, we determined the precision at 5% recall for each method (e, f). Box plots indicate the distribution of these performance measures for each method across cancer types. Each cancer type is represented by a dot. Boxes indicate the 25%/75% interquartile range, whiskers extend to the 5%/95%-quantile range. The median of each distribution is indicated as a vertical line.

Extended Data Fig. 8 Comparison of performance measures derived from CGC versus OncoKB.

We benchmarked the performance of our method against 7 previously published methods for cancer gene identification based on the sequencing data of 11,873 samples spanning 28 different cancer types. To benchmark the performance of a method, we sorted genes according to the significance values (adjusted for multiple testing) returned by the method. As a conservative approximation of the true-positive rate we used Cancer Gene Census (CGC) genes and OncoKB genes to derive ROC and precision-recall curves. We quantified the performance of each method as the area under the ROC curve (AUC) for the top 150 (a) or 1000 (b) non-CGC/OncoKB genes, respectively. Further, we determined the precision at 5% recall for each method (c). This figure compares the performance measures derived from the CGC (x-axis) and OncoKB (y-axis) databases. Each dot represents the AUC/precision of a different method (dot color) for an individual cancer type. The concordance between CGC and OncoKB measures suggests that our measure of performance does not entirely depend on the dataset used to approximate the true-positive rate.

Extended Data Fig. 9 Comparison of methods in two homogeneously processed datasets.

We compared the performance of MutPanning with 7 other methods on two independently processed datasets (TCGA subcohort (a-c, g-i), n = 7060 samples; MC3 dataset (d-f, j-l), n = 9079). We used the Cancer Gene Census (CGC) (a-f) and OncoKB (g-l) for benchmarking. We quantified the performance by the AUC of the ROC curve of the top 1,000 non-CGC/OncoKB genes returned by each method. a, d, g, j, Box plots indicate the distribution of performance measures for each method. Boxes indicate the 25%/75% interquartile range, whiskers extend to the 5%/95%-quantile range. Distribution medians are indicated as vertical lines. Each dot represents an AUC for one of the 27 cancer types in the TCGA and MC3 datasets. b, e, h, k, We normalized AUCs by the maximum AUC within each tumor type. We then compared these normalized AUCs between methods across cancer types. c, f, i, l, We compared the AUCs obtained from our original study cohort with the AUCs from TCGA and MC3 based on Pearson correlation. Each dot reflects a cancer type/method. Cohort sizes for TCGA/MC3 datasets: bladder: 130/386; blood: 197/139; brain: 576/821; breast: 975/779; cervix: 192/274; cholangio: 35/34; colorectal: 223/316; endometrium: 305/451; gastroesophageal: 467/529; head&neck: 279/502; kidney clear: 417/368; kidney non-clear: 227/340; liver: 194/354; lung adenocarcinoma: 230/431; lung squamous: 173/464; lymph: 48/37; ovarian: 316/408; pancreas: 149/155; pheochromocytoma: 179/179; pleura: 82/81; prostate: 323/477; sarcoma: 247/204; skin: 342/422; testicular: 149/145; thymus: 123/121; thyroid: 402/492; uveal melanoma: 80/80.

Extended Data Fig. 10 Recurrent mutations in domains of protein–DNA interaction.

Significance values in this figure legend were computed using MutPanning and adjusted for multiple testing (false discovery rate, FDR). Recurrent SOX17 mutations in endometrial cancer (n = 327 samples, FDR = 8.77 × 10⁻³) are located in the high-mobility-group box domain at the SOX17–DNA interface (PDB: 4A3N superposed with 3F27). POLR2A harbors recurrent mutations in lung adenocarcinoma (n = 446, FDR = 9.28 × 10⁻⁶) at the end of an alpha helical segment that is directly pointed at the major groove of the double stranded DNA (PDB: 5IYB). The open complex of a cryo-EM multicomponent structure where the melted single-stranded template DNA is inserted into the active site and RNA polymerase II locates the transcription start site is visualized. CEBPA harbors recurrent mutations in hematological malignancies (n = 1,018, FDR = 1.16 × 10⁻⁷) at the cross-over interface of the two CEBPA homodimers (PDB: 1NWQ). GATA3 (PDB: 4HCA) harbors recurrent mutations in breast cancer (n = 1,443, FDR < 10⁻²⁰) at Asn334, which is located in the GATA-type 2 zinc finger (res317–res341), as well as the residue Met294, which is located peripheral to the GATA-type 1 zinc finger domain (res263–res287). RUNX1 harbors recurrent mutations in breast cancer (n = 1,443, FDR = 2.22 × 10⁻⁴) and hematological malignancies (n = 1018, FDR = 1.94 × 10⁻⁵). Arg174 plays an important role for DNA recognition and facilitates the formation of hydrogen bond interactions to a guanosine base from the consensus DNA binding sequence of RUNX1 (PDB: 1H9D).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dietlein, F., Weghorn, D., Taylor-Weiner, A. et al. Identification of cancer driver genes based on nucleotide context. Nat Genet 52, 208–218 (2020). https://doi.org/10.1038/s41588-019-0572-y

Download citation

Received: 17 May 2019
Accepted: 16 December 2019
Published: 03 February 2020
Issue Date: February 2020
DOI: https://doi.org/10.1038/s41588-019-0572-y

This article is cited by

Gsw-fi: a GLM model incorporating shrinkage and double-weighted strategies for identifying cancer driver genes with functional impact
- Xiaolu Xu
- Zitong Qi
- Xiumei Han
BMC Bioinformatics (2024)
A distinct class of pan-cancer susceptibility genes revealed by an alternative polyadenylation transcriptome-wide association study
- Hui Chen
- Zeyang Wang
- Lei Li
Nature Communications (2024)
Detecting and understanding meaningful cancerous mutations based on computational models of mRNA splicing
- Nicolas Lynn
- Tamir Tuller
npj Systems Biology and Applications (2024)
Cell cycle gene alterations associate with a redistribution of mutation risk across chromosomal domains in human cancers
- Marina Salvadores
- Fran Supek
Nature Cancer (2024)
What can we learn about acid-base transporters in cancer from studying somatic mutations in their genes?
- Bobby White
- Pawel Swietach
Pflügers Archiv - European Journal of Physiology (2024)