Identification of cancer driver genes based on nucleotide context

Abstract

Cancer genomes contain large numbers of somatic mutations but few of these mutations drive tumor development. Current approaches either identify driver genes on the basis of mutational recurrence or approximate the functional consequences of nonsynonymous mutations by using bioinformatic scores. Passenger mutations are enriched in characteristic nucleotide contexts, whereas driver mutations occur in functional positions, which are not necessarily surrounded by a particular nucleotide context. We observed that mutations in contexts that deviate from the characteristic contexts around passenger mutations provide a signal in favor of driver genes. We therefore developed a method that combines this feature with the signals traditionally used for driver-gene identification. We applied our method to whole-exome sequencing data from 11,873 tumor–normal pairs and identified 460 driver genes that clustered into 21 cancer-related pathways. Our study provides a resource of driver genes across 28 tumor types with additional driver genes identified according to mutations in unusual nucleotide contexts.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Dependency of mutations on extended nucleotide contexts.
Fig. 2: Mutations in unusual contexts provide a signal in favor of driver genes.
Fig. 3: Comparison of different methods to identify driver genes.
Fig. 4: A catalog of driver genes in human cancer.
Fig. 5: Stratification of driver genes based on literature support.
Fig. 6: Characterization of driver genes based on physical interactions.

Data availability

A complete MAF of the sequencing data used in this study is available on www.cancer-genes.org and in the Supplementary Information.

Code availability

MutPanning can be downloaded as an interactive software package from www.cancer-genes.org and from the Supplementary Information (including Supplementary Data 14). MutPanning can be run on a local computer with at least one CPU, 8 GB memory and 2.5 GB hard drive. In addition, an online version of MutPanning is available through the GenePattern platform (http://www.genepattern.org/modules/docs/MutPanning and http://bit.ly/mutpanning-gp). The MutPanning source code is available on GitHub (https://github.com/vanallenlab/MutPanningV2). MutPannig is distributed under the BSD-3-Clause open source license.

References

  1. 1.

    Stratton, M. R., Campbell, P. J. & Futreal, P. A. The cancer genome. Nature 458, 719–724 (2009).

  2. 2.

    Vogelstein, B. et al. Cancer genome landscapes. Science 339, 1546–1558 (2013).

  3. 3.

    Stephens, P. J. et al. The landscape of cancer genes and mutational processes in breast cancer. Nature 486, 400–404 (2012).

  4. 4.

    Greaves, M. & Maley, C. C. Clonal evolution in cancer. Nature 481, 306–313 (2012).

  5. 5.

    Bailey, M. H. et al. Comprehensive characterization of cancer driver genes and mutations. Cell 173, 371–385 (2018).

  6. 6.

    Porta-Pardo, E. & Godzik, A. e-Driver: a novel method to identify protein regions driving cancer. Bioinformatics 30, 3109–3114 (2014).

  7. 7.

    Tamborero, D., Gonzalez-Perez, A. & Lopez-Bigas, N. OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes. Bioinformatics 29, 2238–2244 (2013).

  8. 8.

    Gonzalez-Perez, A. & Lopez-Bigas, N. Functional impact bias reveals cancer drivers. Nucleic Acids Res. 40, e169 (2012).

  9. 9.

    Mularoni, L., Sabarinathan, R., Deu-Pons, J., Gonzalez-Perez, A. & Lopez-Bigas, N. OncodriveFML: a general framework to identify coding and non-coding regions with cancer driver mutations. Genome Biol. 17, 128 (2016).

  10. 10.

    Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).

  11. 11.

    Lawrence, M. S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495–501 (2014).

  12. 12.

    Martincorena, I. et al. Universal patterns of selection in cancer and somatic tissues. Cell 171, 1029–1041 (2017).

  13. 13.

    Weghorn, D. & Sunyaev, S. Bayesian inference of negative and positive selection in human cancers. Nat. Genet. 49, 1785–1788 (2017).

  14. 14.

    Hoadley, K. A. et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell 158, 929–944 (2014).

  15. 15.

    The Cancer Genome Atlas Research Network Comprehensive molecular profiling of lung adenocarcinoma. Nature 511, 543–550 (2014).

  16. 16.

    Hoadley, K. A. et al. Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell 173, 291–304 (2018).

  17. 17.

    Cooper, G. M. & Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet. 12, 628–640 (2011).

  18. 18.

    Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

  19. 19.

    Kumar, R. D., Searleman, A. C., Swamidass, S. J., Griffith, O. L. & Bose, R. Statistically identifying tumor suppressors and oncogenes from pan-cancer genome-sequencing data. Bioinformatics 31, 3561–3568 (2015).

  20. 20.

    Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).

  21. 21.

    Alexandrov, L. B. et al. Mutational signatures associated with tobacco smoking in human cancer. Science 354, 618–622 (2016).

  22. 22.

    Nik-Zainal, S. et al. Mutational processes molding the genomes of 21 breast cancers. Cell 149, 979–993 (2012).

  23. 23.

    Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).

  24. 24.

    Ebrahimi, D., Alinejad-Rokny, H. & Davenport, M. P. Insights into the motif preference of APOBEC3 enzymes. PLoS ONE 9, e87679 (2014).

  25. 25.

    Roberts, S. A. et al. Clustered mutations in yeast and in human cancers can arise from damaged long single-strand DNA regions. Mol. Cell 46, 424–435 (2012).

  26. 26.

    Roberts, S. A. et al. An APOBEC cytidine deaminase mutagenesis pattern is widespread in human cancers. Nat. Genet. 45, 970–976 (2013).

  27. 27.

    Church, D. N. et al. DNA polymerase ε and δ exonuclease domain mutations in endometrial cancer. Hum. Mol. Genet. 22, 2820–2828 (2013).

  28. 28.

    Shinbrot, E. et al. Exonuclease mutations in DNA polymerase epsilon reveal replication strand specific mutation patterns and human origins of replication. Genome Res. 24, 1740–1750 (2014).

  29. 29.

    Goodman, M. F. & Fygenson, K. D. DNA polymerase fidelity: from genetics toward a biochemical understanding. Genetics 148, 1475–1482 (1998).

  30. 30.

    Ganai, R. A. & Johansson, E. DNA replication—a matter of fidelity. Mol. Cell 62, 745–755 (2016).

  31. 31.

    Hofree, M. et al. Challenges in identifying cancer genes by analysis of exome sequencing data. Nat. Commun. 7, 12096 (2016).

  32. 32.

    Tokheim, C. J., Papadopoulos, N., Kinzler, K. W., Vogelstein, B. & Karchin, R. Evaluating the evaluation of cancer driver genes. Proc. Natl Acad. Sci. USA 113, 14330–14335 (2016).

  33. 33.

    Makova, K. D. & Hardison, R. C. The effects of chromatin organization on variation in mutation rates in the genome. Nat. Rev. Genet. 16, 213–223 (2015).

  34. 34.

    Schuster-Bockler, B. & Lehner, B. Chromatin organization is a major influence on regional mutation rates in human cancer cells. Nature 488, 504–507 (2012).

  35. 35.

    Polak, P. et al. Reduced local mutation density in regulatory DNA of cancer genomes is linked to DNA repair. Nat. Biotechnol. 32, 71–75 (2014).

  36. 36.

    North, B. V., Curtis, D. & Sham, P. C. A note on the calculation of empirical P values from Monte Carlo procedures. Am. J. Hum. Genet. 71, 439–441 (2002).

  37. 37.

    Ewens, W. J. On estimating P values by the Monte Carlo method. Am. J. Hum. Genet. 72, 496–498 (2003).

  38. 38.

    Shiraishi, Y., Tremmel, G., Miyano, S. & Stephens, M. A simple model-based approach to inferring and visualizing cancer mutation signatures. PLoS Genet. 11, e1005657 (2015).

  39. 39.

    Fredriksson, N. J. et al. Recurrent promoter mutations in melanoma are defined by an extended context-specific mutational signature. PLoS Genet. 13, e1006773 (2017).

  40. 40.

    Chang, M. T. et al. Identifying recurrent mutations in cancer reveals widespread lineage diversity and mutational specificity. Nat. Biotechnol. 34, 155–163 (2016).

  41. 41.

    Chang, M. T. et al. Accelerating discovery of functional mutant alleles in cancer. Cancer Discov. 8, 174–183 (2018).

  42. 42.

    Forbes, S. A. et al. COSMIC: exploring the world’s knowledge of somatic mutations in human cancer. Nucleic Acids Res. 43, D805–11 (2015).

  43. 43.

    Futreal, P. A. et al. A census of human cancer genes. Nat. Rev. Cancer 4, 177–183 (2004).

  44. 44.

    Chakravarty, D. et al. OncoKB: a precision oncology knowledge base. JCO Precis. Oncol. https://doi.org/10.1200/PO.17.00011 (2017).

  45. 45.

    Grau, J., Grosse, I. & Keilwagen, J. PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R. Bioinformatics 31, 2595–2597 (2015).

  46. 46.

    Tomasetti, C., Marchionni, L., Nowak, M. A., Parmigiani, G. & Vogelstein, B. Only three driver gene mutations are required for the development of lung and colorectal cancers. Proc. Natl Acad. Sci. USA 112, 118–123 (2015).

  47. 47.

    Ellrott, K. et al. Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst. 6, 271–281 (2018).

  48. 48.

    Dees, N. D. et al. MuSiC: identifying mutational significance in cancer genomes. Genome Res. 22, 1589–1598 (2012).

  49. 49.

    Szklarczyk, D. et al. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, D447–52 (2015).

  50. 50.

    Cowen, L., Ideker, T., Raphael, B. J. & Sharan, R. Network propagation: a universal amplifier of genetic associations. Nat. Rev. Genet. 18, 551–562 (2017).

  51. 51.

    Hofree, M., Shen, J. P., Carter, H., Gross, A. & Ideker, T. Network-based stratification of tumor mutations. Nat. Methods 10, 1108–1115 (2013).

  52. 52.

    Leiserson, M. D. et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet. 47, 106–114 (2015).

  53. 53.

    Murphy, M., Chatterjee, S. S., Jain, S., Katari, M. & DasGupta, R. TCF7L1 modulates colorectal cancer growth by inhibiting expression of the tumor-suppressor gene EPHB3. Sci. Rep. 6, 28299 (2016).

  54. 54.

    Morrison, G., Scognamiglio, R., Trumpp, A. & Smith, A. Convergence of cMyc and β-catenin on Tcf7l1 enables endoderm specification. EMBO J. 35, 356–368 (2016).

  55. 55.

    Cairns, J. et al. Differential roles of ERRFI1 in EGFR and AKT pathway regulation affect cancer proliferation. EMBO Rep. 19, e44767 (2018).

  56. 56.

    Taatjes, D. J. The human Mediator complex: a versatile, genome-wide regulator of transcription. Trends Biochem. Sci. 35, 315–322 (2010).

  57. 57.

    Soutourina, J. Transcription regulation by the Mediator complex. Nat. Rev. Mol. Cell Biol. 19, 262–274 (2018).

  58. 58.

    Garraway, L. A. & Lander, E. S. Lessons from the cancer genome. Cell 153, 17–37 (2013).

  59. 59.

    Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144, 646–674 (2011).

  60. 60.

    Pereira, B., Billaud, M. & Almeida, R. RNA-binding proteins in cancer: old players and new actors. Trends Cancer 3, 506–528 (2017).

  61. 61.

    Neelamraju, Y., Gonzalez-Perez, A., Bhat-Nakshatri, P., Nakshatri, H. & Janga, S. C. Mutational landscape of RNA-binding proteins in human cancers. RNA Biol. 15, 115–129 (2018).

  62. 62.

    Pelletier, J., Thomas, G. & Volarevic, S. Ribosome biogenesis in cancer: new players and therapeutic avenues. Nat. Rev. Cancer 18, 51–63 (2018).

  63. 63.

    Sulima, S. O., Hofman, I. J. F., De Keersmaecker, K. & Dinman, J. D. How ribosomes translate cancer. Cancer Discov. 7, 1069–1087 (2017).

  64. 64.

    Wilson, K. F., Erickson, J. W., Antonyak, M. A. & Cerione, R. A. Rho GTPases and their roles in cancer metabolism. Trends Mol. Med. 19, 74–82 (2013).

  65. 65.

    Porter, A. P., Papaioannou, A. & Malliri, A. Deregulation of Rho GTPases in cancer. Small GTPases 7, 123–138 (2016).

  66. 66.

    Thorsson, V. et al. The immune landscape of cancer. Immunity 48, 812–830 (2018).

  67. 67.

    Disis, M. L. Immune regulation of cancer. J. Clin. Oncol. 28, 4531–4538 (2010).

  68. 68.

    Chakravorty, D. et al. MYCbase: a database of functional sites and biochemical properties of Myc in both normal and cancer cells. BMC Bioinform. 18, 224 (2017).

  69. 69.

    Izarzugaza, J. M., Redfern, O. C., Orengo, C. A. & Valencia, A. Cancer-associated mutations are preferentially distributed in protein kinase functional sites. Proteins 77, 892–903 (2009).

  70. 70.

    Taylor-Weiner, A. et al. DeTiN: overcoming tumor-in-normal contamination. Nat. Methods 15, 531–534 (2018).

  71. 71.

    Creixell, P. et al. Pathway and network analysis of cancer genomes. Nat. Methods 12, 615–621 (2015).

  72. 72.

    Hess, J. M. et al. Passenger hotspot mutations in cancer. Cancer Cell 36, 288–301 (2019).

  73. 73.

    Carter, H. et al. Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. Cancer Res. 69, 6660–6667 (2009).

  74. 74.

    AACR Project GENIE Consortium. AACR project GENIE: powering precision medicine through an international consortium. Cancer Discov. 7, 818–831 (2017).

  75. 75.

    Cheng, D. T. et al. Comprehensive detection of germline variants by MSK-IMPACT, a clinical diagnostic platform for solid tumor molecular oncology and concurrent cancer predisposition testing. BMC Med. Genomics 10, 33 (2017).

  76. 76.

    Rheinbay, E. et al. Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes. Preprint at bioRxiv https://doi.org/10.1101/237313 (2017).

  77. 77.

    Zhang, J. et al. International Cancer Genome Consortium Data Portal—a one-stop shop for cancer genomics data. Database 2011, bar026 (2011).

  78. 78.

    Priestley, P. et al. Pan-cancer whole-genome analyses of metastatic solid tumours. Nature 575, 210–216 (2019).

  79. 79.

    Reich, M. et al. GenePattern 2.0. Nat. Genet. 38, 500–501 (2006).

  80. 80.

    Reich, M. et al. The genepattern notebook environment. Cell Syst. 5, 149–151 (2017).

  81. 81.

    Gao, J. et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal. 6, pl1 (2013).

  82. 82.

    Cerami, E. et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2, 401–404 (2012).

  83. 83.

    Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

  84. 84.

    Costello, M. et al. Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res. 41, e67 (2013).

  85. 85.

    Gilson, M. K. et al. BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 44, D1045–53 (2016).

  86. 86.

    Xenarios, I. et al. DIP: the database of interacting proteins. Nucleic Acids Res. 28, 289–291 (2000).

  87. 87.

    Stark, C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 34, D535–9 (2006).

  88. 88.

    Peri, S. et al. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 13, 2363–2371 (2003).

  89. 89.

    Hermjakob, H. et al. IntAct: an open source molecular interaction database. Nucleic Acids Res. 32, D452–5 (2004).

  90. 90.

    Licata, L. et al. MINT, the molecular interaction database: 2012 update. Nucleic Acids Res. 40, D857–61 (2012).

  91. 91.

    Schaefer, C. F. et al. PID: the pathway interaction database. Nucleic Acids Res. 37, D674–9 (2009).

  92. 92.

    Miller, M., Shuman, J. D., Sebastian, T., Dauter, Z. & Johnson, P. F. Structural basis for DNA recognition by the basic region leucine zipper transcription factor CCAAT/enhancer-binding protein α. J. Biol. Chem. 278, 15178–15184 (2003).

  93. 93.

    Chen, Y. et al. DNA binding by GATA transcription factor suggests mechanisms of DNA looping and long-range gene regulation. Cell Rep. 2, 1197–1206 (2012).

  94. 94.

    Bravo, J., Li, Z., Speck, N. A. & Warren, A. J. The leukemia-associated AML1 (Runx1)–CBFβ complex functions as a DNA-induced molecular clamp. Nat. Struct. Biol. 8, 371–378 (2001).

  95. 95.

    Gao, N. et al. Structural basis of human transcription factor Sry-related box 17 binding to DNA. Protein Pept. Lett. 20, 481–488 (2013).

  96. 96.

    Palasingam, P., Jauch, R., Ng, C. K. & Kolatkar, P. R. The structure of Sox17 bound to DNA reveals a conserved bending topology but selective protein interaction platforms. J. Mol. Biol. 388, 619–630 (2009).

  97. 97.

    Zhang, S. et al. Molecular mechanism of APC/C activation by mitotic phosphorylation. Nature 533, 260–264 (2016).

  98. 98.

    He, Y. et al. Near-atomic resolution visualization of human transcription promoter opening. Nature 533, 359–365 (2016).

Download references

Acknowledgements

We thank G. Getz and C. Cotsapas for their valuable comments and suggestions. We thank M. Reich and T. Liefeld for adding MutPanning as a module to the GenePattern platform. The results presented in this study are in part based on data generated by the TCGA Research Network: https://www.cancer.gov/tcga. F.D. was supported by the EMBO Long-Term Fellowship Program (grant no. ALTF 502-2016), the Claudia Adams Barr Program for Innovative Cancer Research and the AWS Cloud Credits for Research Program. E.M.V.A. and S.R.S received funding from the National Institutes of Health (grants nos K08 CA188615, R01 CA227388 and R21 CA242861 to E.M.V.A. and grants nos R01 MH101244, R35 GM127131 and U01 HG009088 to S.R.S.). E.M.V.A acknowledges support through the Phillip A. Sharp Innovation in Collaboration Award. F.D. and E.M.V.A. were further supported through the ASPIRE Award of The Mark Foundation for Cancer Research.

Author information

F.D., D.W., A.R., E.S.L., E.M.V.A. and S.R.S. wrote the manuscript and prepared the figures, which all authors reviewed. F.D., D.W., B.R., D.L., E.M.V.A. and S.R.S. designed and performed the bioinformatics analyses for driver-gene identification, and designed and performed the bioinformatics analyses for method comparison and stratification of the driver-gene catalog. F.D., D.W., A.T.-W., A.R., B.R., D.L., E.S.L., E.M.V.A. and S.R.S. performed a review of the findings and biological follow-up analyses. F.D., D.W., A.T.-W., B.R., D.L., E.S.L., E.M.V.A. and S.R.S. contributed to the development of the method and its implementation.

Correspondence to Felix Dietlein or Eliezer M. Van Allen or Shamil R. Sunyaev.

Ethics declarations

Competing interests

E.M.V.A. is a consultant for Tango Therapeutics, Genome Medical, Invitae, Foresite Capital, Dynamo and Illumina. E.M.V.A. received research support from Novartis and BMS as well as travel support from Roche and Genentech. E.M.V.A. is an equity holder of Syapse, Tango Therapeutics and Genome Medical.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Modeling of mutation probabilities based on extended nucleotide contexts.

a, We applied the composite likelihood model to COSMIC mutation signatures. For each trinucleotide context, we compared the original mutation frequency against the mutation frequency returned by the composite likelihood model based on Pearson correlation. Dot colors reflect base substitution types. b, For six base substitution types, we plotted the original mutation probability (based on 11873 samples) against the prediction of the composite likelihood model, which we derived as the product of the mutational likelihood of its reference nucleotide and its substitution type. Each dot represents a cancer type. Pearson correlations are annotated at the bottom right. The number of samples per cancer type can be found in Extended Data Fig. 5. c, For three cancer types (bladder, n = 317 samples; endometrium, n = 327; skin, n = 582) we examined whether nucleotides outside the trinucleotide context affected mutation probabilities. For this purpose, we compared mutation probabilities, modeled based on tri- (blue) and 7-nucleotide contexts (yellow), with original mutation probabilities based on context-specific mutation counts. Data points are sorted according to the modeled mutation rates, derived from the 7-nucleotide context (x-axis). Black circles indicate ratios between the observed probabilities and the corresponding trinucleotide-specific likelihoods (y-axis). Similarly, the orange line displays the ratio between the likelihoods, derived from the 7-nucleotide and trinucleotide contexts, respectively (y-axis). Local mutation probabilities vary across positions surrounded the same trinucleotide context. Accounting for extended nucleotide contexts reduces this heterogeneity.

Extended Data Fig. 2 Evaluation of the composite likelihood model applied to extended nucleotide contexts.

To test the independence assumption of the composite likelihood model, we examined the interaction between any two positions (25 possible combinations) in the 11-nucleotide context around mutations of eight cancer types (bladder, n = 317 samples; breast, n = 1443; colorectal, n = 223; endometrium, n = 327; gastroesophageal, n = 833; head and neck, n = 425; lung adeno, n = 446; skin, n = 582). For any two positions, there are 96 possible nucleotide contexts and we plotted the observed mutation count of each nucleotide context (x-axis) against the predictions of the composite likelihood model (y-axis). Pearson correlation coefficients between observed and predicted data served as a measure of interaction. Each position pair is visualized in a separate correlation plot, and positions are annotated at the bottom right of the plot. For instance, pair (-1,1) refers to the trinucleotide context. Dot colors indicate the base substitution types.

Extended Data Fig. 3 Generalization of the composite likelihood model to extended nucleotide contexts.

We counted the number of mutations in each possible nucleotide context of length ≤7 based on the sequencing data of 11,873 samples. The exact number of samples per cancer type included in this analysis is shown in Extended Data Fig. 5. We compared these counts with the mutability scores returned by the composite likelihood model (218,448 different nucleotide contexts). Since the number of possible nucleotide contexts was too large to be visualized directly, we plotted the data point density. The Pearson correlation coefficient (R) of each plot is annotated at the bottom right.

Extended Data Fig. 4 Extended nucleotide contexts contribute to the performance of the composite likelihood model.

We examined whether accounting for extended contexts beyond trinucleotide contexts improved the fit of the composite likelihood model. To this end, we varied the number of nucleotides in the composite likelihood model between 0 (i.e. only substitution types) and 6 (i.e. 7-nucleotide contexts). We computed the residual sum of squared differences between observed mutation counts and the predictions of the composite likelihood model. As a negative control, we determined the residual sum of squares for a uniform distribution. This baseline was used to normalize the residual sum of squares for each cancer type. For some cancer types with ‘flat’ mutation signatures, nucleotide contexts only had minor impact on the fit of the model, but did not decrease the performance of the model (for example, lung adeno., n = 446 samples). For other cancer types, the fit of the model largely depended on the trinucleotide context, but not on the extended nucleotide context (e.g., prostate cancer, n = 880). For most cancer types with high background mutation rates, the fit of the composite likelihood model strongly depended on the extended nucleotide context (e.g., bladder, n = 317; breast, n = 1443; cervical, n = 192; colorectal, n = 223; endometrial cancer, n = 327; melanoma, n = 582).

Extended Data Fig. 5 A large-scale cohort of whole-exome sequencing data to identify rare cancer genes.

To systematically identify candidate cancer genes, we analyzed sequencing data from 11,873 individual tumor samples using the statistical framework that we had developed in this study. Our study cohort contained whole-exome sequencing data from 32 TCGA-related (orange) and 55 TCGA-independent (blue) projects.

Extended Data Fig. 6 Benchmarking of the performance of MutPanning for cancer gene identification.

We benchmarked the performance of our method against 7 previously published methods for cancer gene identification based on the sequencing data of 11,873 samples spanning 28 different cancer types. The exact number of samples per cancer type can be found in Extended Data Fig. 5. To benchmark the performance of a method, we sorted genes according to the significance values (adjusted for multiple testing) returned by the method. As a conservative approximation of the true-positive rate we used Cancer Gene Census (CGC) genes (a, b, c) and OncoKB genes (d, e, f) to derive ROC and precision-recall curves. We quantified the performance of each method as the area under the ROC curve (AUC) for the top 150 (a, d) or 1000 (b, e) non-CGC/OncoKB genes, respectively. Further, we determined the precision at 5% recall for each method (c, f). We normalized these measures to the maximum within each cancer type.

Extended Data Fig. 7 Comparison of different methods for cancer-gene identification.

We benchmarked the performance of our method against 7 previously published methods for cancer gene identification based on the sequencing data of 11,873 samples spanning 28 different cancer types. To benchmark the performance of a method, we sorted genes according to the significance values (adjusted for multiple testing) returned by the method. As a conservative approximation of the true-positive rate we used Cancer Gene Census (CGC) genes (a, c, e) and OncoKB genes (b, d, f) to derive ROC and precision-recall curves. We quantified the performance of each method as the area under the ROC curve (AUC) for the top 150 (a, b) or 1000 (c, d) non-CGC/OncoKB genes, respectively. Further, we determined the precision at 5% recall for each method (e, f). Box plots indicate the distribution of these performance measures for each method across cancer types. Each cancer type is represented by a dot. Boxes indicate the 25%/75% interquartile range, whiskers extend to the 5%/95%-quantile range. The median of each distribution is indicated as a vertical line.

Extended Data Fig. 8 Comparison of performance measures derived from CGC versus OncoKB.

We benchmarked the performance of our method against 7 previously published methods for cancer gene identification based on the sequencing data of 11,873 samples spanning 28 different cancer types. To benchmark the performance of a method, we sorted genes according to the significance values (adjusted for multiple testing) returned by the method. As a conservative approximation of the true-positive rate we used Cancer Gene Census (CGC) genes and OncoKB genes to derive ROC and precision-recall curves. We quantified the performance of each method as the area under the ROC curve (AUC) for the top 150 (a) or 1000 (b) non-CGC/OncoKB genes, respectively. Further, we determined the precision at 5% recall for each method (c). This figure compares the performance measures derived from the CGC (x-axis) and OncoKB (y-axis) databases. Each dot represents the AUC/precision of a different method (dot color) for an individual cancer type. The concordance between CGC and OncoKB measures suggests that our measure of performance does not entirely depend on the dataset used to approximate the true-positive rate.

Extended Data Fig. 9 Comparison of methods in two homogeneously processed datasets.

We compared the performance of MutPanning with 7 other methods on two independently processed datasets (TCGA subcohort (a-c, g-i), n = 7060 samples; MC3 dataset (d-f, j-l), n = 9079). We used the Cancer Gene Census (CGC) (a-f) and OncoKB (g-l) for benchmarking. We quantified the performance by the AUC of the ROC curve of the top 1,000 non-CGC/OncoKB genes returned by each method. a, d, g, j, Box plots indicate the distribution of performance measures for each method. Boxes indicate the 25%/75% interquartile range, whiskers extend to the 5%/95%-quantile range. Distribution medians are indicated as vertical lines. Each dot represents an AUC for one of the 27 cancer types in the TCGA and MC3 datasets. b, e, h, k, We normalized AUCs by the maximum AUC within each tumor type. We then compared these normalized AUCs between methods across cancer types. c, f, i, l, We compared the AUCs obtained from our original study cohort with the AUCs from TCGA and MC3 based on Pearson correlation. Each dot reflects a cancer type/method. Cohort sizes for TCGA/MC3 datasets: bladder: 130/386; blood: 197/139; brain: 576/821; breast: 975/779; cervix: 192/274; cholangio: 35/34; colorectal: 223/316; endometrium: 305/451; gastroesophageal: 467/529; head&neck: 279/502; kidney clear: 417/368; kidney non-clear: 227/340; liver: 194/354; lung adenocarcinoma: 230/431; lung squamous: 173/464; lymph: 48/37; ovarian: 316/408; pancreas: 149/155; pheochromocytoma: 179/179; pleura: 82/81; prostate: 323/477; sarcoma: 247/204; skin: 342/422; testicular: 149/145; thymus: 123/121; thyroid: 402/492; uveal melanoma: 80/80.

Extended Data Fig. 10 Recurrent mutations in domains of protein–DNA interaction.

Significance values in this figure legend were computed using MutPanning and adjusted for multiple testing (false discovery rate, FDR). Recurrent SOX17 mutations in endometrial cancer (n = 327 samples, FDR = 8.77 × 10−3) are located in the high-mobility-group box domain at the SOX17–DNA interface (PDB: 4A3N superposed with 3F27). POLR2A harbors recurrent mutations in lung adenocarcinoma (n = 446, FDR = 9.28 × 10−6) at the end of an alpha helical segment that is directly pointed at the major groove of the double stranded DNA (PDB: 5IYB). The open complex of a cryo-EM multicomponent structure where the melted single-stranded template DNA is inserted into the active site and RNA polymerase II locates the transcription start site is visualized. CEBPA harbors recurrent mutations in hematological malignancies (n = 1,018, FDR = 1.16 × 10−7) at the cross-over interface of the two CEBPA homodimers (PDB: 1NWQ). GATA3 (PDB: 4HCA) harbors recurrent mutations in breast cancer (n = 1,443, FDR < 10−20) at Asn334, which is located in the GATA-type 2 zinc finger (res317–res341), as well as the residue Met294, which is located peripheral to the GATA-type 1 zinc finger domain (res263–res287). RUNX1 harbors recurrent mutations in breast cancer (n = 1,443, FDR = 2.22 × 10−4) and hematological malignancies (n = 1018, FDR = 1.94 × 10−5). Arg174 plays an important role for DNA recognition and facilitates the formation of hydrogen bond interactions to a guanosine base from the consensus DNA binding sequence of RUNX1 (PDB: 1H9D).

Supplementary information

Supplementary Information

Supplementary Figs. 1–64 and Note

Reporting Summary

Supplementary Tables

Supplementary Tables 1–5

Supplementary Data 1

Mutation annotation file of the whole-exome sequencing data used in this study to identify driver genes.

Supplementary Data 2

MutPanning software as an executable app file for MacOS users.

Supplementary Data 3

MutPanning software as an executable exe file for Windows users.

Supplementary Data 4

MutPanning software as an executable jar file for users of other operating systems.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dietlein, F., Weghorn, D., Taylor-Weiner, A. et al. Identification of cancer driver genes based on nucleotide context. Nat Genet (2020). https://doi.org/10.1038/s41588-019-0572-y

Download citation