CancerMine: a literature-mined resource for drivers, oncogenes and tumor suppressors in cancer


Tumors from individuals with cancer are frequently genetically profiled for information about the driving forces behind the disease. We present the CancerMine resource, a text-mined and routinely updated database of drivers, oncogenes and tumor suppressors in different types of cancer. All data are available online ( and downloadable under a Creative Commons Zero license for ease of use.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Performance of the relation classifier and frequently extracted genes and cancers.
Fig. 2: Comparison to other resources and cancer-type clustering by gene role.

Data availability

The data can be viewed and downloaded through the online viewer ( The February 2019 CancerMine release was used for this analysis ( All releases can be found at

Code availability

All code for text mining and the analysis in this paper are available in the Github repository ( The specific code release is archived in Zenodo (


  1. 1.

    Radtke, F. & Raj, K. Nat. Rev. Cancer 3, 756 (2003).

  2. 2.

    Kristensen, V. N. et al. Nat. Rev. Cancer 14, 299–313 (2014).

  3. 3.

    Zender, L. et al. Cell 135, 852–864 (2008).

  4. 4.

    Futreal, P. A. et al. Nat. Rev. Cancer 4, 177 (2004).

  5. 5.

    Repana, D. et al. Genome Biol. 20, 1 (2019).

  6. 6.

    Gonzalez-Perez, A. et al. Nat. Methods 10, 1081 (2013).

  7. 7.

    Liu, Y., Sun, J. & Zhao, M. J. Genet. Genom. 44, 119–121 (2017).

  8. 8.

    Zhao, M., Kim, P., Mitra, R., Zhao, J. & Zhao, Z. Nucleic Acids Res. 44, D1023–D1031 (2015).

  9. 9.

    Griffith, M. et al. Nat. Genet. 49, 170 (2017).

  10. 10.

    Chun, H.-W. et al. Pac. Symp. Biocomput. 2006, 4–15 (2006).

  11. 11.

    Singhal, A., Simmons, M. & Lu, Z. PLoS Comput. Biol. 12, e1005017 (2016).

  12. 12.

    Lever, J. & Jones, S. BioNLP 2017, 176–183 (2017).

  13. 13.

    Comeau, D. C. et al. Database 2013, bat064 (2013).

  14. 14.

    Kibbe, W. A. et al. Nucleic Acids Res. 43, D1071–D1078 (2014).

  15. 15.

    Maglott, D. et al. Nucleic Acids Res. 39 (Suppl. 1), D52–D57 (2010).

  16. 16.

    Bodenreider, O. Nucleic Acids Res. 32 (Suppl. 1), D267–D270 (2004).

  17. 17.

    Galili, T., O’Callaghan, A., Sidi, J. & Sievert, C. Bioinformatics 34, 1600–1602 (2017).

Download references


J.L. was supported by a Vanier Canada Graduate Scholarship. Funding for S.J.M.J. and M.R.J. was provided through the Personalized Oncogenomics (POG) program, which is generously supported by the BC Cancer Foundation and Genome British Columbia (project B20POG). The authors would like to thank Compute Canada for the use of computational infrastructure for this research.

Author information




J.L., M.R.J. and S.J.M.J. conceived the idea. J.L. implemented the software and carried out the analysis. J.L., E.Y.Z. and J.G. annotated the sentence data. All authors contributed to the writing of the manuscript.

Corresponding author

Correspondence to Steven J. M. Jones.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Fig. 1 Sources of the text-mined associations from within full-text articles.

The link between a cancer gene association’s novelty to CancerMine and the location within the paper where it is founded is shown. Mentions in the Introduction are much more likely to be not novel compared to other sections.

Supplementary Fig. 2 Sources of text-mined associations across years.

The mentions of drivers, oncogenes and tumor suppressors are increasingly being extracted from the main section of full-text articles. Please note that the minor dip in 2018 numbers (compared to 2017) is due to many 2018 papers only becoming accessible for text-mining later into 2019.

Supplementary Fig. 3 Comparison against other resources using a lower-threshold classifier.

To explore the effect of high-precision low-recall classifier, we show a comparison without the strict thresholding. All classifiers use 0.5 instead of the higher thresholds. There is still not substantial overlap with the CGC and IntOGen resources suggesting that many of the gene associations in these databases are not mentioned in the literature in a form that can be extracted by CancerMine.

Supplementary Fig. 4 Validation of cancer profiles using TCGA somatic mutation data.

All samples in seven TCGA projects are analyzed for likely loss-of-function mutations compared with the CancerMine tumor suppressor profiles and matched with the closest profile. Percentages shown in each cell are the proportion of samples labeled with each CancerMine profile that are from the different TCGA projects. Samples that match no tumor suppressor in these profiles or are ambiguous are assigned to none. The TCGA projects are breast cancer (BRCA), colorectal adenocarcinoma (COAD), liver hepatocellular carcinoma (LIHC), prostate adenocarcinoma (PRAD), low grade glioma (LGG), lung adenocarcinoma (LUAD) and stomach adenocarcinoma (STAD).

Supplementary Information

Supplementary Information

Supplementary Figs. 1–4 and Supplementary Tables 1–7

Reporting Summary

Source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lever, J., Zhao, E.Y., Grewal, J. et al. CancerMine: a literature-mined resource for drivers, oncogenes and tumor suppressors in cancer. Nat Methods 16, 505–507 (2019).

Download citation

Further reading