The detection of somatic mutations from cancer genome sequences is key to understanding the genetic basis of disease progression, patient survival and response to therapy. Benchmarking is needed for tool assessment and improvement but is complicated by a lack of gold standards, by extensive resource requirements and by difficulties in sharing personal genomic information. To resolve these issues, we launched the ICGC-TCGA DREAM Somatic Mutation Calling Challenge, a crowdsourced benchmark of somatic mutation detection algorithms. Here we report the BAMSurgeon tool for simulating cancer genomes and the results of 248 analyses of three in silico tumors created with it. Different algorithms exhibit characteristic error profiles, and, intriguingly, false positives show a trinucleotide profile very similar to one found in human tumors. Although the three simulated tumors differ in sequence contamination (deviation from normal cell sequence) and in subclonality, an ensemble of pipelines outperforms the best individual pipeline in all cases. BAMSurgeon is available at https://github.com/adamewing/bamsurgeon/.
At a glance
- Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495–501 (2014). et al.
- Emerging landscape of oncogenic signatures across human cancers. Nat. Genet. 45, 1127–1133 (2013). et al.
- The Cancer Genome Atlas Research Network. Integrated genomic characterization of endometrial carcinoma. Nature 497, 67–73 (2013).
- Anonymous. Adaptive BATTLE trial uses biomarkers to guide lung cancer treatment. Nat. Rev. Drug Discov 9, 423 (2010).
- Feasibility of real time next generation sequencing of cancer genes linked to drug response: results from a clinical trial. Int. J. Cancer 132, 1547–1555 (2013). et al.
- Cancer genomics: technology, discovery, and translation. J. Clin. Oncol. 30, 647–660 (2012). et al.
- Comparing somatic mutation-callers: beyond Venn diagrams. BMC Bioinformatics 14, 189 (2013). &
- Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med. 5, 28 (2013). et al.
- SeqControl: process control for DNA sequencing. Nat. Methods 11, 1071–1075 (2014). et al.
- Global optimization of somatic variant identification in cancer genomes with a global community challenge. Nat. Genet. 46, 318–319 (2014). et al.
- Evaluation of CASP8 model quality predictions. Proteins 77 (suppl. 9), 157–166 (2009). , &
- Systematic analysis of challenge-driven improvements in molecular prognostic models for breast cancer. Sci. Transl. Med. 5, 181re1 (2013). et al.
- Wisdom of crowds for robust gene network inference. Nat. Methods 9, 796–804 (2012). et al.
- Toward better benchmarking: challenge-based methods assessment in cancer genomics. Genome Biol. 15, 462 (2014). , , , &
- pIRS: profile-based Illumina pair-end reads simulator. Bioinformatics 28, 1533–1535 (2012). et al.
- Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013). et al.
- RADIA: RNA and DNA integrated analysis for somatic mutation detection. PLoS ONE 9, e111516 (2014). et al.
- Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 28, 1811–1817 (2012). et al.
- SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28, 311–317 (2012). et al.
- Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas. Nat. Genet. 45, 1121–1126 (2013). et al.
- Mutational processes molding the genomes of 21 breast cancers. Cell 149, 979–993 (2012). et al.
- Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics. 8, 25 (2007). , , &
- Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 39, e90 (2011). et al.
- Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics. 12, 451 (2011). et al.
- Discovering motifs that induce sequencing errors. BMC Bioinformatics. 14 (suppl. 5), S1 (2013). et al.
- Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013). et al.
- Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009). &
- Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at http://arxiv.org/abs/1303.3997 (2013).
- Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012). &
- Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881 (2010). &
- Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011). et al.
- The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011). et al.
- VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25, 2283–2285 (2009). et al.
- BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010). &
- The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009). et al.
- The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010). et al.
- Random Forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 43, 1947–1958 (2003). et al.
- Survival ensembles. Biostatistics 7, 355–373 (2006). , , , &
- Conditional variable importance for random forests. BMC Bioinformatics. 9, 307 (2008). , , , &
- Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100, 9440–9445 (2003). &
- Supplementary Text and Figures (20,871 KB)
Supplementary Figures 1–33 and Supplementary Notes 1 and 2
- Supplementary Table 1: Characteristics of Tumour and Normal BAM Files (4 KB)
A summary of the characteristics of the tumour and normal BAM files, including coverage, number of reads, and percent of positions with greater than 20x coverage.
- Supplementary Table 2: Performance of Submitted Algorithms (25 KB)
List of all entries to the SMC Challenge along with the team name, number of predicted SNVs, number of true positives, number of false positives, recall, precision and F-score.
- Supplementary Table 3: Effect and Significance of each Chromosome on F-score, Precision and Recall (9 KB)
Effect, confidence interval, p-value and FDR adjusted p-value from two-way ANOVA on F-score, precision and recall, separately.
- Supplementary Table 4: Methods to Generate Values for Twelve Genomic Variables (109 KB)
List of methods used to generate reference allele counts, non-reference allele counts, base quality, tumour coverage, normal coverage, mapping quality, median read position, homopolymer rate, GC content, trinucleotide sequence, genomic element and distance to nearest germline SNP.
- Supplementary Table 5: Comparison of Genomic Variables across Chromosomes (196 KB)
Median and standard deviation of all eleven continuous genomic variables for each chromosome along with the Bonferroni adjusted p-value comparing the values of each chromosome to all other chromosomes for true positives. Median and standard deviation of genomic variables on chromosome 21 and Bonferroni adjusted p-value comparing the values on chromosome 21 to the rest of the genome for false positives. A bias in submission 2319000 towards germline SNPs was detected, therefore, false positives called made by this algorithm only were omitted for the purpose of this analysis (only).
- Supplementary Table 6: Number of Observations used in RandomForest Model (9 KB)
The number of observations used for each submission and call type (false positives vs false negatives) in individual RandomForest models.
- Supplementary Table 7: Correlation of Genomic Variables and Trinucleotide Abundances (6 KB)
Spearman correlation values of ten genomic variables against trinucleotide abundances in false positives.