Article series: Computational tools

Expanding the computational toolbox for mining cancer genomes

Journal name:
Nature Reviews Genetics
Year published:
Published online


High-throughput DNA sequencing has revolutionized the study of cancer genomics with numerous discoveries that are relevant to cancer diagnosis and treatment. The latest sequencing and analysis methods have successfully identified somatic alterations, including single-nucleotide variants, insertions and deletions, copy-number aberrations, structural variants and gene fusions. Additional computational techniques have proved useful for defining the mutations, genes and molecular networks that drive diverse cancer phenotypes and that determine clonal architectures in tumour samples. Collectively, these tools have advanced the study of genomic, transcriptomic and epigenomic alterations in cancer, and their association to clinical properties. Here, we review cancer genomics software and the insights that have been gained from their application.

At a glance


  1. Sample procurement, sequencing and analysis roadmap.
    Figure 1: Sample procurement, sequencing and analysis roadmap.

    a | Most cancer genomic investigations sequence the genome of a tumour sample from a primary or metastatic lesion, starting with a nonspecific 'global' sample pooled from a biopsy specimen or resection. As the spatial distribution of any resident subclones is not known a priori, it will become increasingly common to sequence specific regions from a tumour section separately. In the limit, single-cell sequencing can also be carried out on nuclei sorted by flow cytometry to assess cellular diversity. b | Tumour and adjacent healthy tissue samples are sequenced using high-throughput methods, such as whole-genome sequencing (WGS), exome sequencing and RNA sequencing (RNA-seq). After alignment, a range of detection tools identifies both small alterations (such as single-nucleotide variants (SNVs), and insertions and deletions (indels)) and large alterations (such as copy-number aberrations (CNAs), structural variants (SVs) and gene fusions), which are then annotated and analysed individually (Level I) — for example, for likely functional implications — and collectively (Level II) — for example, to identify relevant gene pathways and networks. CHASM, CancerSpecific High-throughput Annotation of Somatic Mutations; CREST, clipping reveals structure; Dendrix, De Novo Driver Exclusivity; GASV, geometric analysis of structural variants; GATK, Genome Analysis Toolkit; Genome STRiP, Genome STRucture In Populations; MEMo, Mutual Exclusivity Modules in cancer; SIFT, sorting intolerant from tolerant; SNP, single-nucleotide polymorphism; TieDIE, Tied Diffusion Through Interacting Events; TIGRA, targeted iterative graph routing assembler; VEP, Variant Effect Predictor.

  2. Biological factors relevant to assessing significantly mutated genes in cancer.
    Figure 2: Biological factors relevant to assessing significantly mutated genes in cancer.

    Genomic analyses establish mutation frequencies of genes and help to characterize background mutation rates (BMRs). Specific mutation hot spots have been found in the various cancer types. Other factors such as gene length, expression level and replication timing have also been shown to affect the BMR of a gene. As gene expression level and replication timing are correlated, both are shown on the x axis. State-of-the-art tools, such as MuSiC and MutSig, give proper consideration to these and many other factors — for example, transition versus transversion frequency — in determining the significantly mutated genes (SMGs) that contribute substantially to cancer initiation and progression.

  3. Significantly mutated genes, pathways and networks.
    Figure 3: Significantly mutated genes, pathways and networks.

    Given the mutational status of genes across several patients, one can distinguish driver mutations from passenger mutations using several strategies. Single-gene tests determine whether the observed number of samples having a mutation in the gene is significantly greater than that expected under an appropriate null model. Pathway or gene-set approaches examine whether multiple genes in pre-defined sets — as obtained, for example, from curated databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG), Gene Ontology (GO) and the Molecular Signatures Database (MSigDB) — have more mutations than expected. These tests are biased to the prior knowledge of gene sets in these databases, but the numbers of tests are fairly small, and the risks associated with type I error therefore tend to be manageable. Conversely, network approaches rely only on knowledge of known protein–protein or protein–DNA interactions — such as those in the iRefIndex, high-quality interactomes (HINT), BioGRID and search tool for the retrieval of interacting genes/proteins (STRING) databases — in examining combinations of mutations on whole-genome interaction networks, for example, using the heat diffusion process. As these approaches are unbiased, it is possible to infer novel combinations of genes that are relevant to cancer, but larger numbers of hypothesis tests imply that greater care must be taken for multiple-testing correction. Indel, insertion and deletion; SNV, single-nucleotide variant; SV, structural variant.

  4. A conceptual example of clonal evolution model and clonality analyses.
    Figure 4: A conceptual example of clonal evolution model and clonality analyses.

    a | The founding clone (yellow) persists during the course of the disease. Another clone (green) that is present at time point 1 faces extinction before time point 2, but new subclones (blue and orange) emerge during disease progression. b | The SciClone algorithm detects the presence of 3 mutation clusters at time point 3.


  1. Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating inhibitors. Proc. Natl Acad. Sci. USA 74, 54635467 (1977).
  2. Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating inhibitors. 1977. Biotechnology 24, 104108 (1992).
  3. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860921 (2001).
  4. Ley, T. J. et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 6672 (2008).
  5. Shendure, J. & Lieberman Aiden, E. The expanding scope of DNA sequencing. Nature Biotech. 30, 10841094 (2012).
  6. Majewski, J., Schwartzentruber, J., Lalonde, E., Montpetit, A. & Jabado, N. What can exome sequencing do for you? J. Med. Genet. 48, 580589 (2011).
  7. Ozsolak, F. & Milos, P. M. RNA sequencing: advances, challenges and opportunities. Nature Rev. Genet. 12, 8798 (2011).
  8. Krueger, F., Kreck, B., Franke, A. & Andrews, S. R. DNA methylome analysis using short bisulfite sequencing data. Nature Methods 9, 145151 (2012).
  9. Ding, L. et al. Genome remodelling in a basal-like breast cancer metastasis and xenograft. Nature 464, 9991005 (2010).
  10. Nowell, P. C. The clonal evolution of tumor cell populations. Science 194, 2328 (1976).
  11. Ding, L. et al. Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing. Nature 481, 506510 (2012).
  12. Gerlinger, M. et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N. Engl. J. Med. 366, 883892 (2012).
  13. Navin, N. et al. Inferring tumor progression from genomic heterogeneity. Genome Res. 20, 6880 (2010).
  14. Navin, N. E. & Hicks, J. Tracing the tumor lineage. Mol. Oncol. 4, 267283 (2010).
  15. Navin, N. et al. Tumour evolution inferred by single-cell sequencing. Nature 472, 9094 (2011).
  16. Hou, Y. et al. Single-cell exome sequencing and monoclonal evolution of a JAK2-negative myeloproliferative neoplasm. Cell 148, 873885 (2012).
  17. Xu, X. et al. Single-cell exome sequencing reveals single-nucleotide mutation characteristics of a kidney tumor. Cell 148, 886895 (2012).
  18. Gundry, M., Li, W., Maqbool, S. B. & Vijg, J. Direct, genome-wide assessment of DNA mutations in single cells. Nucleic Acids Res. 40, 20322040 (2012).
  19. Baslan, T. et al. Genome-wide copy number analysis of single cells. Nature Protoc. 7, 10241041 (2012).
  20. Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455477 (2012).
  21. Kim, S. Y. & Speed, T. P. Comparing somatic mutation-callers: beyond Venn diagrams. BMC Bioinformatics 14, 189 (2013).
  22. Goode, D. L. et al. A simple consensus approach improves somatic mutation prediction accuracy. Genome Med. 5, 90 (2013).
  23. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 12971303 (2010).
    The GATK is a broad and widely used toolkit for variant discovery and data processing.
  24. Koboldt, D. C. et al. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25, 22832285 (2009).
  25. Koboldt, D. C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568576 (2012).
    VarScan (described in references 24 and 25) is one of the early programs for somatic SNV detection and has since added additional capability for germline, copy-number and indel events.
  26. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 20782079 (2009).
    SAMtools is a broad set of utilities for processing sequence data in the standardized SAM/BAM format, including variant calling.
  27. Larson, D. E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28, 311317 (2012).
  28. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotech. 31, 213219 (2013).
    MuTect is a widely used program for identifying somatic SNVs in tumour–normal pair sequencing data.
  29. Saunders, C. T. et al. Strelka: accurate somatic small-variant calling from sequenced tumor–normal sample pairs. Bioinformatics 28, 18111817 (2012).
  30. Goya, R. et al. SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors. Bioinformatics 26, 730736 (2010).
  31. Roth, A. et al. JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics 28, 907913 (2012).
  32. Lunter, G. Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics 23, i289i296 (2007).
  33. Cartwright, R. A. Problems and solutions for estimating indel rates and length distributions. Mol. Biol. Evol. 26, 473480 (2009).
  34. Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 18511858 (2008).
  35. Smith, C. C. et al. Validation of ITD mutations in FLT3 as a therapeutic target in human acute myeloid leukaemia. Nature 485, 260263 (2012).
  36. Spencer, D. H. et al. Detection of FLT3 internal tandem duplication in targeted, short-read-length, next-generation sequencing data. J. Mol. Diagn. 15, 8193 (2013).
  37. Albers, C. A. et al. Dindel: accurate indel calls from short-read data. Genome Res. 21, 961973 (2011).
  38. Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 28652871 (2009).
    Pindel is focused on identifying breakpoints at single-base-resolution of indels, inversions and tandem duplications.
  39. Ye, K., Kosters, W. A. & Ijzerman, A. P. An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences. Bioinformatics 23, 687693 (2007).
  40. Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333i339 (2012).
  41. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv 1303.3997 [q-bio. GN] (2013).
  42. Chen, K. et al. TIGRA: a targeted iterative graph routing assembler for breakpoint assembly. Genome Res. 24, 310317 (2014).
  43. Bignell, G. R. et al. Signatures of mutation and selection in the cancer genome. Nature 463, 893898 (2010).
  44. Beroukhim, R. et al. The landscape of somatic copy-number alteration across human cancers. Nature 463, 899905 (2010).
  45. Yoon, S., Xuan, Z., Makarov, V., Ye, K. & Sebat, J. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 19, 15861592 (2009).
  46. Campbell, P. J. et al. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nature Genet. 40, 722729 (2008).
  47. Beroukhim, R. et al. Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc. Natl Acad. Sci. USA 104, 2000720012 (2007).
    GISTIC is one of the standard tools for finding genes that are affected by CNAs which have a bearing on cancer initiation or progression.
  48. Zhang, Q. et al. CMDS: a population-based method for identifying recurrent DNA copy number aberrations in cancer from high-resolution data. Bioinformatics 26, 464469 (2010).
  49. Raphael, B. J., Volik, S., Collins, C. & Pevzner, P. A. Reconstructing tumor genome architectures. Bioinformatics 19 (Suppl. 2), ii162ii171 (2003).
  50. Raphael, B. J. et al. A sequence-based survey of the complex structural organization of tumor genomes. Genome Biol. 9, R59 (2008).
  51. Volik, S. et al. Decoding the fine-scale structure of a breast cancer genome and transcriptome. Genome Res. 16, 394404 (2006).
  52. Volik, S. et al. End-sequence profiling: sequence-based analysis of aberrant genomes. Proc. Natl Acad. Sci. USA 100, 76967701 (2003).
  53. Bignell, G. R. et al. Architectures of somatic genomic rearrangement in human cancer amplicons at sequence-level resolution. Genome Res. 17, 12961303 (2007).
  54. Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature Methods 6, 677681 (2009).
    BreakDancer is a general tool for identifying structural variations (including insertions, deletions, inversions and translocations) using the concept of discordant read pairs.
  55. Wang, J. et al. CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nature Methods 8, 652654 (2011).
  56. Hormozdiari, F., Alkan, C., Eichler, E. E. & Sahinalp, S. C. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 19, 12701278 (2009).
  57. Sindi, S., Helman, E., Bashir, A. & Raphael, B. J. A geometric approach for classification and comparison of structural variants. Bioinformatics 25, i222i230 (2009).
  58. Sindi, S. S., Onal, S., Peng, L. C., Wu, H. T. & Raphael, B. J. An integrative probabilistic model for identification of structural variation in sequencing data. Genome Biol. 13, R22 (2012).
  59. Handsaker, R. E., Korn, J. M., Nemesh, J. & McCarroll, S. A. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nature Genet. 43, 269276 (2011).
  60. Rowley, J. D. A new consistent chromosomal abnormality in chronic myelogenous leukaemia identified by quinacrine fluorescence and Giemsa staining. Nature 243, 290293 (1973).
  61. Huang, M. E. et al. Use of all-trans retinoic acid in the treatment of acute promyelocytic leukemia. Blood 72, 567572 (1988).
  62. Huang, M. E. [Treatment of acute promyelocytic leukemia with all-trans retinoic acid]. Zhonghua Yi Xue Za Zhi 68, 131133, 10 (in Chinese) (1988).
  63. Tomlins, S. A. et al. Integrative molecular concept modeling of prostate cancer progression. Nature Genet. 39, 4151 (2007).
  64. Kim, Y. K. et al. Cooperation of H2O2-mediated ERK activation with Smad pathway in TGF-β1 induction of p21WAF1/Cip1. Cell. Signall. 18, 236243 (2006).
  65. McPherson, A. et al. deFuse: an algorithm for gene fusion discovery in tumor RNA-seq data. PLoS Comput. Biol. 7, e1001138 (2011).
  66. Wang, K. et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 38, e178 (2010).
  67. Iyer, M. K., Chinnaiyan, A. M. & Maher, C. A. ChimeraScan: a tool for identifying chimeric transcription in sequencing data. Bioinformatics 27, 29032904 (2011).
  68. Chen, K. et al. BreakFusion: targeted assembly-based identification of gene fusions in whole transcriptome paired-end sequencing data. Bioinformatics 28, 19231924 (2012).
  69. Berger, M. F. et al. The genomic complexity of primary human prostate cancer. Nature 470, 214220 (2011).
  70. Stephens, P. J. et al. Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell 144, 2740 (2011).
  71. McPherson, A. et al. Comrad: detection of expressed rearrangements by integrated analysis of RNA-seq and low coverage genome sequence data. Bioinformatics 27, 14811488 (2011).
  72. McPherson, A. et al. nFuse: discovery of complex genomic rearrangements in cancer using high-throughput sequencing. Genome Res. 22, 22502261 (2012).
  73. Chen, K. et al. BreakTrans: uncovering the genomic architecture of gene fusions. Genome Biol. 14, R87 (2013).
  74. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
    ANNOVAR is a versatile and widely used tool for functional annotation of variants. It is often accessed through its web interface wANNOVAR.
  75. Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SNPeff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6, 8092 (2012).
  76. Woolfe, A., Mullikin, J. C. & Elnitski, L. Genomic features defining exonic variants that modulate splicing. Genome Biol. 11, R20 (2010).
  77. Khurana, E. et al. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 342, 1235587 (2013).
  78. Chelala, C., Khan, A. & Lemoine, N. R. SNPnexus: a web database for functional annotation of newly discovered and public domain single nucleotide polymorphisms. Bioinformatics 25, 655661 (2009).
  79. Yandell, M. et al. A probabilistic disease-gene finder for personal genomes. Genome Res. 21, 15291542 (2011).
  80. Paila, U., Chapman, B. A., Kirchner, R. & Quinlan, A. R. GEMINI: integrative exploration of genetic variation and genome annotations. PLoS Comput. Biol. 9, e1003153 (2013).
  81. Nakken, S., Alseth, I. & Rognes, T. Computational prediction of the effects of non-synonymous single nucleotide polymorphisms in human DNA repair genes. Neuroscience 145, 12731279 (2007).
    PolyPhen is a concatenation of 'polymorphism phenotyping' and predicts the impact of amino acid changes on proteins. It is often used in conjunction with SIFT.
  82. Ng, P. C. & Henikoff, S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 31, 38123814 (2003).
    SIFT infers whether amino acid substitution has an effect on subsequent functioning of proteins and is often used in conjunction with PolyPhen.
  83. Reva, B., Antipin, Y. & Sander, C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res. 39, e118 (2011).
  84. Gonzalez-Perez, A. & Lopez-Bigas, N. Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am. J. Hum. Genet. 88, 440449 (2011).
  85. Wong, W. C. et al. CHASM and SNVBox: toolkit for detecting biologically important single nucleotide mutations in cancer. Bioinformatics 27, 21472148 (2011).
  86. Carter, H. et al. Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. Cancer Res. 69, 66606667 (2009).
    CHASM (described in references 85 and 86) is a popular tool for assessing functional impact of somatic missense mutations on the basis of whether they confer selective advantage on cancerous cells.
  87. Gonzalez-Perez, A., Deu-Pons, J. & Lopez-Bigas, N. Improving the prediction of the functional impact of cancer mutations by baseline tolerance transformation. Genome Med. 4, 89 (2012).
  88. Gonzalez-Perez, A. & Lopez-Bigas, N. Functional impact bias reveals cancer drivers. Nucleic Acids Res. 40, e169 (2012).
  89. Reimand, J. & Bader, G. D. Systematic analysis of somatic mutations in phosphorylation signaling predicts novel cancer drivers. Mol. Systems Biol. 9, 637 (2013).
  90. Greenman, C., Wooster, R., Futreal, P. A., Stratton, M. R. & Easton, D. F. Statistical analysis of pathogenicity of somatic mutations in cancer. Genetics 173, 21872198 (2006).
  91. Getz, G. et al. Comment on “The consensus coding sequences of human breast and colorectal cancers”. Science 317, 1500 (2007).
  92. Dees, N. D. et al. MuSiC: Identifying mutational significance in cancer genomes. Genome Res. 22, 15891598 (2012).
  93. Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214218 (2013).
  94. Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 10611068 (2008).
  95. Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609615 (2011).
  96. Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 6170 (2012).
  97. Cancer Genome Atlas Research Network. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N. Engl. J. Med. 368, 20592074 (2013).
  98. Ding, L. et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature 455, 10691075 (2008).
  99. Vogelstein, B. et al. Cancer genome landscapes. Science 339, 15461558 (2013).
  100. Davoli, T. et al. Cumulative haploinsufficiency and triplosensitivity drive aneuploidy patterns and shape the cancer genome. Cell 155, 948962 (2013).
  101. Ye, J., Pavlicek, A., Lunney, E. A., Rejto, P. A. & Teng, C. H. Statistical method on nonrandom clustering with application to somatic mutations in cancer. BMC Bioinformatics 11, 11 (2010).
  102. Ryslik, G. A., Cheng, Y., Cheung, K. H., Modis, Y. & Zhao, H. Utilizing protein structure to identify non-random somatic mutations. BMC Bioinformatics 14, 190 (2013).
  103. Lawrence, M. S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495501 (2014).
  104. Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M. & Hirakawa, M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 38, D355D360 (2010).
  105. Ashburner, M. et al. Gene ontology: tool for the unification of biology. Gene Ontol. Consort. Nature Genet. 25, 2529 (2000).
  106. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 1554515550 (2005).
  107. Lin, J. et al. A multidimensional analysis of genes mutated in breast and colorectal cancers. Genome Res. 17, 13041318 (2007).
  108. Huang da, W., Sherman, B. T. & Lempicki, R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 113 (2009).
  109. Wendl, M. C. et al. PathScan: a tool for discerning mutational significance in groups of putative cancer genes. Bioinformatics 27, 15951602 (2011).
  110. Boca, S. M., Kinzler, K. W., Velculescu, V. E., Vogelstein, B. & Parmigiani, G. Patient-oriented gene set analysis for cancer mutation data. Genome Biol. 11, R112 (2010).
  111. Peri, S. et al. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 13, 23632371 (2003).
  112. Croft, D. et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 39, D691D697 (2011).
  113. Chatr-Aryamontri, A. et al. The BioGRID interaction database: 2013 update. Nucleic Acids Res. 41, D816D823 (2013).
  114. Franceschini, A. et al. STRING v9.1: protein–protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 41, D808D815 (2013).
  115. Das, J. & Yu, H. HINT: high-quality protein interactomes and their applications in understanding human disease. BMC Systems Biol. 6, 92 (2012).
  116. Razick, S., Magklaras, G. & Donaldson, I. M. iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics 9, 405 (2008).
  117. Bernstein, B. E. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 5774 (2012).
  118. Khurana, E., Fu, Y., Chen, J. & Gerstein, M. Interpretation of genomic variants using a unified biological network approach. PLoS Comput. Biol. 9, e1002886 (2013).
  119. Vandin, F., Upfal, E. & Raphael, B. J. Algorithms for detecting significantly mutated pathways in cancer. J. Comput. Biol. 18, 507522 (2011).
  120. Cancer Genome Atlas Research Network. Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499, 4349 (2013).
  121. Hofree, M., Shen, J. P., Carter, H., Gross, A. & Ideker, T. Network-based stratification of tumor mutations. Nature Methods 10, 11081115 (2013).
  122. Ciriello, G., Cerami, E., Sander, C. & Schultz, N. Mutual exclusivity analysis identifies oncogenic network modules. Genome Res. 22, 398406 (2012).
  123. Vogelstein, B. & Kinzler, K. W. Cancer genes and the pathways they control. Nature Med. 10, 789799 (2004).
  124. Yeang, C. H., McCormick, F. & Levine, A. Combinatorial patterns of somatic gene mutations in cancer. Faseb J. 22, 26052622 (2008).
  125. Paull, E. O. et al. Discovering causal pathways linking genomic events to transcriptional states using Tied Diffusion Through Interacting Events (TieDIE). Bioinformatics 29, 27572764 (2013).
  126. Vaske, C. J. et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics 26, i237i245 (2010).
  127. Saal, L. H. et al. PIK3CA mutations correlate with hormone receptors, node metastasis, and ERBB2, and are mutually exclusive with PTEN loss in human breast carcinoma. Cancer Res. 65, 25542559 (2005).
  128. Vandin, F., Upfal, E. & Raphael, B. J. De novo discovery of mutated driver pathways in cancer. Genome Res. 22, 375385 (2012).
  129. Leiserson, M. D., Blokh, D., Sharan, R. & Raphael, B. J. Simultaneous identification of multiple driver pathways in cancer. PLoS Comput. Biol. 9, e1003054 (2013).
  130. Miller, C. A., Settle, S. H., Sulman, E. P., Aldape, K. D. & Milosavljevic, A. Discovering functional modules by identifying recurrent and mutually exclusive mutational patterns in tumors. BMC Med. Genom. 4, 34 (2011).
  131. Kandoth, C. et al. Mutational landscape and significance across 12 major cancer types. Nature 502, 333339 (2013).
  132. Nik-Zainal, S. et al. Mutational processes molding the genomes of 21 breast cancers. Cell 149, 979993 (2012).
  133. Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415421 (2013).
  134. Albertson, D. G., Collins, C., McCormick, F. & Gray, J. W. Chromosome aberrations in solid tumors. Nature Genet. 34, 369376 (2003).
  135. Rausch, T. et al. Genome sequencing of pediatric medulloblastoma links catastrophic DNA rearrangements with TP53 mutations. Cell 148, 5971 (2012).
  136. Maher, C. A. & Wilson, R. K. Chromothripsis and human disease: piecing together the shattering process. Cell 148, 2932 (2012).
  137. Forment, J. V., Kaidi, A. & Jackson, S. P. Chromothripsis and cancer: causes and consequences of chromosome shattering. Nature Rev. Cancer 12, 663670 (2012).
  138. Baca, S. C. et al. Punctuated evolution of prostate cancer genomes. Cell 153, 666677 (2013).
  139. Malhotra, A. et al. Breakpoint profiling of 64 cancer genomes reveals numerous complex rearrangements spawned by homology-independent mechanisms. Genome Res. 23, 762776 (2013).
  140. Sorzano, C. O., Pascual-Montano, A., Sanchez de Diego, A., Martinez, A. C. & van Wely, K. H. Chromothripsis: breakage–fusion–bridge over and over again. Cell Cycle 12, 20162023 (2013).
  141. Korbel, J. O. & Campbell, P. J. Criteria for inference of chromothripsis in cancer genomes. Cell 152, 12261236 (2013).
  142. Oesper, L., Ritz, A., Aerni, S. J., Drebin, R. & Raphael, B. J. Reconstructing cancer genomes from paired-end sequencing data. BMC Bioinformatics 13 (Suppl. 6), S10 (2012).
  143. Landau, D. A. et al. Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell 152, 714726 (2013).
  144. Keats, J. J. et al. Clonal competition with alternating dominance in multiple myeloma. Blood 120, 10671076 (2012).
  145. Turke, A. B. et al. Preexistence and clonal selection of MET amplification in EGFR mutant NSCLC. Cancer Cell 17, 7788 (2010).
  146. Parzen, E. On estimation of a probability density function and mode. Ann. Math. Statist. 33, 10651076 (1962).
  147. Rosenblatt, M. Remarks on some non-parametric estimates of a density function. Ann. Math. Statist. 27, 832837 (1956).
  148. Carter, S. L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nature Biotech. 30, 413421 (2012).
  149. Shah, S. P. et al. The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature 486, 395399 (2012).
  150. Nik-Zainal, S. et al. The life history of 21 breast cancers. Cell 149, 9941007 (2012).
  151. Oesper, L., Mahmoody, A. & Raphael, B. J. THetA: inferring intra-tumor heterogeneity from high-throughput DNA sequencing data. Genome Biol. 14, R80 (2013).
  152. Gonzalez-Perez, A. et al. Computational approaches to identify functional genetic variants in cancer genomes. Nature Methods 10, 723729 (2013).
  153. Raphael, B. J., Dobson, J. R., Oesper, L. & Vandin, F. Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine. Genome Med. 6, 5 (2014).
  154. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603607 (2012).
  155. Kolata, G. In Treatment for Leukemia, Glimpses of the Future. The New York Times A1 (7 July 2012).
  156. Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231239 (1988).
  157. Wendl, M. C. & Wilson, R. K. Aspects of coverage in medical DNA sequencing. BMC Bioinformatics 9, 239 (2008).
  158. Bashir, A., Volik, S., Collins, C., Bafna, V. & Raphael, B. J. Evaluation of paired-end sequencing strategies for detection of genome rearrangements in cancer. PLoS Comput. Biol. 4, e1000051 (2008).
  159. Wendl, M. C. & Wilson, R. K. Statistical aspects of discerning indel-type structural variation via DNA sequence alignment. BMC Genomics 10, 359 (2009).
  160. Boffetta, P. & Nyberg, F. Contribution of environmental factors to cancer risk. Br. Med. Bull. 68, 7194 (2003).
  161. Cerwenka, A. & Lanier, L. L. Natural killer cells, viruses and cancer. Nature Rev. Immunol. 1, 4149 (2001).
  162. Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330337 (2012).
  163. Stransky, N. et al. The mutational landscape of head and neck squamous cell carcinoma. Science 333, 11571160 (2011).
  164. Parkin, D. M. The global health burden of infection-associated cancers in the year 2002. Int. J. Cancer 118, 30303044 (2006).
  165. Kostic, A. D. et al. PathSeq: software to identify or discover microbes by deep sequencing of human tissue. Nature Biotech. 29, 393396 (2011).
  166. Bhaduri, A., Qu, K., Lee, C. S., Ungewickell, A. & Khavari, P. A. Rapid identification of non-human sequences in high-throughput sequencing datasets. Bioinformatics 28, 11741175 (2012).
  167. Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).
  168. Van Loo, P. et al. Allele-specific copy number analysis of tumors. Proc. Natl Acad. Sci. USA 107, 1691016915 (2010).
  169. Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nature Methods 7, 248249 (2010).
  170. McLaren, W. et al. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26, 20692070 (2010).
  171. Tamborero, D., Lopez-Bigas, N. & Gonzalez-Perez, A. Oncodrive-CIS: a method to reveal likely driver genes based on the impact of their copy number changes on expression. PLoS ONE 8, e55489 (2013).
  172. Tamborero, D., Gonzalez-Perez, A. & Lopez-Bigas, N. OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes. Bioinformatics 29, 22382244 (2013).

Download references

Author information


  1. The Genome Institute, Washington University in St. Louis, 4444 Forest Park Ave., St. Louis, Missouri 63108, USA.

    • Li Ding,
    • Michael C. Wendl &
    • Joshua F. McMichael
  2. Department of Medicine, Washington University in St. Louis, 660 S. Euclid Ave., St. Louis, Missouri 63110, USA.

    • Li Ding
  3. Department of Genetics, Washington University in St. Louis, 660 S. Euclid Ave., St. Louis, Missouri 63110, USA.

    • Li Ding &
    • Michael C. Wendl
  4. Siteman Cancer Center, Washington University in St. Louis, 4921 Parkview Place, St. Louis, Missouri 63110, USA.

    • Li Ding
  5. Department of Mathematics, Washington University in St. Louis, 1 Brookings Drive, St. Louis, Missouri 63130, USA.

    • Michael C. Wendl
  6. Department of Computer Science and Center for Computational Molecular Biology, Brown University, 115 Waterman Street, Providence, Rhode Island 02912, USA.

    • Benjamin J. Raphael

Competing interests statement

The authors declare no competing interests.

Corresponding author

Correspondence to:

Author details

  • Li Ding

    Li Ding has concentrated her research on understanding somatic and germline genetic changes that are relevant to cancer initiation and progression, as well as to drug response. Her recent efforts include the discovery of 127 cancer genes across more than 3,000 tumours from 12 major cancer types. She is the principle investigator for the National Human Genome Research Institute (NHGRI) sponsored Genome Sequencing Informatics (GS-IT) Center at Washington University in St. Louis, Missouri, USA; an assistant director at the Genome Institute and an assistant professor of Medicine and Genetics at Washington University in St. Louis.

  • Michael C. Wendl

    Michael C. Wendl focuses on applying mathematics and computational methods to pressing problems in the biomedical sciences. He developed much of DNA sequencing theory and co-wrote the PHRED trace analyser used for processing Sanger sequencing data, including those in the Human Genome Project. He now concentrates on problems in cancer genomics, including detection of somatic mutations, pathway analyses and modelling of clonal evolution.

  • Joshua F. McMichael

    Joshua F. McMichael creates user interfaces and data visualizations for bioinformatics, and specializes in cancer genomics. He worked on the genome modelling system for high-throughput sequencing data analyses and has produced many of the visualizations for cancer genomic discoveries, including clonal evolution in acute myeloid leukaemia. He currently works as a software developer at the Genome Institute at Washington University in St. Louis, Missouri, USA.

  • Benjamin J. Raphael

    Benjamin J. Raphael develops novel combinatorial and statistical algorithms for the interpretation of genomes. His recent work focuses on structural variation in human and cancer genomes, as well as on network and pathway analyses of somatic mutations in cancer. He is an associate professor in the Department of Computer Science and Director of the Center for Computational Molecular Biology at Brown University, Providence, Rhode Island, USA.

Additional data