Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection

Journal name:
Nature Methods
Volume:
12,
Pages:
623–630
Year published:
DOI:
doi:10.1038/nmeth.3407
Received
Accepted
Published online

Abstract

The detection of somatic mutations from cancer genome sequences is key to understanding the genetic basis of disease progression, patient survival and response to therapy. Benchmarking is needed for tool assessment and improvement but is complicated by a lack of gold standards, by extensive resource requirements and by difficulties in sharing personal genomic information. To resolve these issues, we launched the ICGC-TCGA DREAM Somatic Mutation Calling Challenge, a crowdsourced benchmark of somatic mutation detection algorithms. Here we report the BAMSurgeon tool for simulating cancer genomes and the results of 248 analyses of three in silico tumors created with it. Different algorithms exhibit characteristic error profiles, and, intriguingly, false positives show a trinucleotide profile very similar to one found in human tumors. Although the three simulated tumors differ in sequence contamination (deviation from normal cell sequence) and in subclonality, an ensemble of pipelines outperforms the best individual pipeline in all cases. BAMSurgeon is available at https://github.com/adamewing/bamsurgeon/.

At a glance

Figures

  1. BAMSurgeon simulates tumor genome sequences.
    Figure 1: BAMSurgeon simulates tumor genome sequences.

    (a) Overview of SNV spike-in. (1) A list of positions is selected in a BAM alignment. (2) The desired base change is made at a user-specified variant allele fraction (VAF) in reads overlapping the chosen positions. (3) Altered reads are remapped to the reference genome. (4) Realigned reads replace corresponding unmodified reads in the original BAM. (b) Overview of workflow for creating synthetic tumor-normal pairs. Starting with a high-depth mate-pair BAM alignment, SNVs and structural variants (SVs) are spiked in to yield a 'burn-in' BAM. Paired reads from this BAM are randomly partitioned into a normal BAM and a pre-tumor BAM that receives spike-ins via BAMSurgeon to yield the synthetic tumor and a 'truth' VCF file containing spiked-in positions. Mutation predictions are evaluated against this ground truth. (c,d) To test the robustness of BAMSurgeon with respect to changes in aligner (c) and cell line (d), we compared the rank of RADIA, MuTect, SomaticSniper and Strelka on two new tumor-normal data sets: one with an alternative aligner, NovoAlign, and the other on an alternative cell line, HCC1954. RADIA and SomaticSniper retained the top two positions, whereas MuTect and Strelka remained third and fourth, independently of aligner and cell line. (e) Summary of the three in silico tumors described here.

  2. Overview of the SMC-DNA Challenge data set.
    Figure 2: Overview of the SMC-DNA Challenge data set.

    (a) Precision-recall plot for all IS1 entries. Colors represent individual teams, and the best submission (top F-score) from each team is circled. The inset highlights top-ranking submissions. (b) Performance of an ensemble somatic SNV predictor. The ensemble was generated by taking the majority vote of calls made by a subset of the top-performing IS1 submissions. At each rank k, the gray dot indicates performance of the ensemble algorithms ranking 1 to k, and the colored dot indicates the performance of the algorithm at that rank.

  3. Effects of algorithm tuning.
    Figure 3: Effects of algorithm tuning.

    (a) The performance of groups on the training data set and on the held-out portion of the genome (~10%) are tightly correlated (Spearman's ρ = 0.98) and fall near the plotted y = x line for all three tumors. (b) F-score, precision and recall of all submissions made by each team on IS1 are plotted in the order they were submitted. Teams were ranked by the F-score of their best submissions. Color coding as in a. The horizontal red lines give the F-score, precision and recall of the best-scoring algorithm submitted by the Challenge administrators, SomaticSniper. A clear improvement in recall, precision and F-score can be seen as participants adjusted parameters over the course of the challenge. Bar width corresponds to the number of submissions made by each team. (c) For each tumor, each team's initial (“naive”) and final (“optimized”) submissions are shown, with dot size and color indicating overall ranking within these two groups. An “X” indicates that a team did not submit to a specific tumor (or changed the team name). Algorithm rankings were moderately changed by parameterization. (d) For each tumor, we assessed how much each team was able to improve its performance. The color scale represents bins of F-score improvement.

  4. Effects of genomic localization.
    Figure 4: Effects of genomic localization.

    (a) Box plots show the median (line), interquartile range (IQR; box) and ±1.5× IQR (whiskers). For IS1, F-scores were highest in coding and untranslated regions and lowest in introns and intergenic (P = 6.61 × 10−7; Friedman rank-sum test). (b) Rows show individual submissions to IS1; columns show genes with nonsynonymous SNV calls. Green shading means a call was made. The upper bar plot indicates the fraction of submissions agreeing on these calls, and the color indicates whether these are FPs or TPs. The bar plot on the right gives the F-score of the submission over the whole genome. The right-hand side covariate shows the submitting team. All TPs are shown, along with a subset of FPs.

  5. Characteristics of prediction errors.
    Figure 5: Characteristics of prediction errors.

    (aj) Random Forests assess the importance of 12 genomic variables on SNV prediction accuracy (Online Methods). Random Forest analysis of FPs (a,c,e,g,i) and FNs (b,d,f,h,j) for IS1 (a,b) and IS2 (c,d) as well as for all three tumors using default settings with widely used algorithms MuTect (e,f), SomaticSniper (g,h) and Strelka (i,j). Dot size reflects mean change in accuracy caused by removing this variable from the model. Color reflects the directional effect of each variable (red for increasing metric values associated with increased error; blue for decreasing values associated with increased error; black for factors). Background shading indicates the accuracy of the model fit (see bar at bottom for scale). Each row represents a single set of predictions for a given in silico tumor, and each column shows a genomic variable. SNP, single-nucleotide polymorphism.

  6. Trinucleotide error profiles.
    Figure 6: Trinucleotide error profiles.

    Proportions of FP SNVs are normalized to the number observed in the entire genome (top) binned by trinucleotide context (bottom) for IS1–IS3.

References

  1. Lawrence, M.S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495501 (2014).
  2. Ciriello, G. et al. Emerging landscape of oncogenic signatures across human cancers. Nat. Genet. 45, 11271133 (2013).
  3. The Cancer Genome Atlas Research Network. Integrated genomic characterization of endometrial carcinoma. Nature 497, 6773 (2013).
  4. Anonymous. Adaptive BATTLE trial uses biomarkers to guide lung cancer treatment. Nat. Rev. Drug Discov 9, 423 (2010).
  5. Tran, B. et al. Feasibility of real time next generation sequencing of cancer genes linked to drug response: results from a clinical trial. Int. J. Cancer 132, 15471555 (2013).
  6. Tran, B. et al. Cancer genomics: technology, discovery, and translation. J. Clin. Oncol. 30, 647660 (2012).
  7. Kim, S.Y. & Speed, T.P. Comparing somatic mutation-callers: beyond Venn diagrams. BMC Bioinformatics 14, 189 (2013).
  8. O'Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med. 5, 28 (2013).
  9. Chong, L.C. et al. SeqControl: process control for DNA sequencing. Nat. Methods 11, 10711075 (2014).
  10. Boutros, P.C. et al. Global optimization of somatic variant identification in cancer genomes with a global community challenge. Nat. Genet. 46, 318319 (2014).
  11. Cozzetto, D., Kryshtafovych, A. & Tramontano, A. Evaluation of CASP8 model quality predictions. Proteins 77 (suppl. 9), 157166 (2009).
  12. Margolin, A.A. et al. Systematic analysis of challenge-driven improvements in molecular prognostic models for breast cancer. Sci. Transl. Med. 5, 181re1 (2013).
  13. Marbach, D. et al. Wisdom of crowds for robust gene network inference. Nat. Methods 9, 796804 (2012).
  14. Boutros, P.C., Margolin, A.A., Stuart, J.M., Califano, A. & Stolovitzky, G. Toward better benchmarking: challenge-based methods assessment in cancer genomics. Genome Biol. 15, 462 (2014).
  15. Hu, X. et al. pIRS: profile-based Illumina pair-end reads simulator. Bioinformatics 28, 15331535 (2012).
  16. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213219 (2013).
  17. Radenbaugh, A.J. et al. RADIA: RNA and DNA integrated analysis for somatic mutation detection. PLoS ONE 9, e111516 (2014).
  18. Saunders, C.T. et al. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 28, 18111817 (2012).
  19. Larson, D.E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28, 311317 (2012).
  20. Omberg, L. et al. Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas. Nat. Genet. 45, 11211126 (2013).
  21. Nik-Zainal, S. et al. Mutational processes molding the genomes of 21 breast cancers. Cell 149, 979993 (2012).
  22. Strobl, C., Boulesteix, A.L., Zeileis, A. & Hothorn, T. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics. 8, 25 (2007).
  23. Nakamura, K. et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 39, e90 (2011).
  24. Meacham, F. et al. Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics. 12, 451 (2011).
  25. Allhoff, M. et al. Discovering motifs that induce sequencing errors. BMC Bioinformatics. 14 (suppl. 5), S1 (2013).
  26. Alexandrov, L.B. et al. Signatures of mutational processes in human cancer. Nature 500, 415421 (2013).
  27. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 17541760 (2009).
  28. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at http://arxiv.org/abs/1303.3997 (2013).
  29. Langmead, B. & Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357359 (2012).
  30. Wu, T.D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873881 (2010).
  31. Robinson, J.T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 2426 (2011).
  32. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 21562158 (2011).
  33. Koboldt, D.C. et al. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25, 22832285 (2009).
  34. Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841842 (2010).
  35. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 20782079 (2009).
  36. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 12971303 (2010).
  37. Svetnik, V. et al. Random Forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 43, 19471958 (2003).
  38. Hothorn, T., Bühlmann, P., Dudoit, S., Molinaro, A. & van der Laan, M.J. Survival ensembles. Biostatistics 7, 355373 (2006).
  39. Strobl, C., Boulesteix, A.L., Kneib, T., Augustin, T. & Zeileis, A. Conditional variable importance for random forests. BMC Bioinformatics. 9, 307 (2008).
  40. Storey, J.D. & Tibshirani, R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100, 94409445 (2003).

Download references

Author information

  1. These authors contributed equally to this work.

    • Adam D Ewing,
    • Kathleen E Houlahan &
    • Yin Hu
  2. These authors jointly supervised this work.

    • Adam A Margolin,
    • Joshua M Stuart &
    • Paul C Boutros

Affiliations

  1. Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, California, USA.

    • Adam D Ewing,
    • Kyle Ellrott,
    • Amie Radenbaugh,
    • David Haussler &
    • Joshua M Stuart
  2. Mater Research Institute, University of Queensland, Woolloongabba, Queensland, Australia.

    • Adam D Ewing
  3. Informatics and Biocomputing Program, Ontario Institute for Cancer Research, Toronto, Ontario, Canada.

    • Kathleen E Houlahan,
    • Cristian Caloian,
    • Takafumi N Yamaguchi,
    • Christine P'ng,
    • Daryl Waggott,
    • Veronica Y Sabelnykova,
    • Cheryl C K Lau &
    • Paul C Boutros
  4. Sage Bionetworks, Seattle, Washington, USA.

    • Yin Hu,
    • J Christopher Bare,
    • Justin Guinney,
    • Michael R Kellen,
    • Thea C Norman,
    • Stephen H Friend &
    • Adam A Margolin
  5. IBM Computational Biology Center, T.J. Watson Research Center, Yorktown Heights, New York, USA.

    • Gustavo Stolovitzky
  6. Computational Biology Program, Oregon Health & Science University, Portland, Oregon, USA.

    • Adam A Margolin
  7. Department of Biomedical Engineering, Oregon Health & Science University, Portland, Oregon, USA.

    • Adam A Margolin
  8. Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada.

    • Paul C Boutros
  9. Department of Pharmacology & Toxicology, University of Toronto, Toronto, Ontario, Canada.

    • Paul C Boutros
  10. Team Wang Wheeler HGSC, Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA.

    • Liu Xi,
    • Ninad Dewal &
    • David Wheeler
  11. Team Wang Wheeler HGSC, Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA.

    • Yu Fan &
    • Wenyi Wang
  12. Team Wang Wheeler HGSC, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA.

    • David Wheeler
  13. Team LoFreq Somatic GIS, Genome Institute of Singapore, Computational and Systems Biology, Singapore.

    • Andreas Wilm,
    • Grace Hui Ting,
    • Chenhao Li,
    • Denis Bertrand &
    • Niranjan Nagarajan
  14. Team DMUT, Center for Biomedical Informatics and Information Technology, National Cancer Institute, National Institutes of Health (NIH), Bethesda, Maryland, USA.

    • Qing-Rong Chen,
    • Chih-Hao Hsu,
    • Ying Hu,
    • Chunhua Yan,
    • Warren Kibbe &
    • Daoud Meerzaman
  15. Team Broad, The Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA.

    • Kristian Cibulskis,
    • Mara Rosenberg,
    • Louis Bergelson &
    • Adam Kiezun
  16. Team SLC Platform, Synergie Lyon Cancer Foundation, Centre Léon Bérard, Lyon, France.

    • Anne-Sophie Sertier,
    • Anthony Ferrari &
    • Laurie Tonton
  17. Team Virmid, Bioinformatics and Systems Biology, University of California, San Diego, La Jolla, California, USA.

    • Kunal Bhutani
  18. Team Shimmer, National Human Genome Research Institute, NIH, Bethesda, Maryland, USA.

    • Nancy F Hansen
  19. Team 2014, Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, DC, USA.

    • Difei Wang
  20. Team 2014, Innovation Center for Biomedical Informatics, Georgetown University Medical Center, Washington, DC, USA.

    • Difei Wang &
    • Lei Song
  21. Team AstraZeneca, AstraZeneca, Waltham, Massachusetts, USA.

    • Zhongwu Lai
  22. Team WEHI-Subread, Department of Medical Biology, The University of Melbourne, Melbourne, Victoria, Australia.

    • Yang Liao
  23. Team WEHI-Subread, Department of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia.

    • Wei Shi
  24. Team Germmatic, Functional Genomics Node (INB) at Príncipe Felipe Research Center (CIPF), Valencia, Spain.

    • José Carbonell-Caballero &
    • Joaquín Dopazo

Consortia

  1. ICGC-TCGA DREAM Somatic Mutation Calling Challenge participants

    • Liu Xi,
    • Ninad Dewal,
    • Yu Fan,
    • Wenyi Wang,
    • David Wheeler,
    • Andreas Wilm,
    • Grace Hui Ting,
    • Chenhao Li,
    • Denis Bertrand,
    • Niranjan Nagarajan,
    • Qing-Rong Chen,
    • Chih-Hao Hsu,
    • Ying Hu,
    • Chunhua Yan,
    • Warren Kibbe,
    • Daoud Meerzaman,
    • Kristian Cibulskis,
    • Mara Rosenberg,
    • Louis Bergelson,
    • Adam Kiezun,
    • Amie Radenbaugh,
    • Anne-Sophie Sertier,
    • Anthony Ferrari,
    • Laurie Tonton,
    • Kunal Bhutani,
    • Nancy F Hansen,
    • Difei Wang,
    • Lei Song,
    • Zhongwu Lai,
    • Yang Liao,
    • Wei Shi,
    • José Carbonell-Caballero,
    • Joaquín Dopazo,
    • Cheryl C K Lau &
    • Justin Guinney

Contributions

P.C.B., J.M.S. and A.A.M. initiated the project. A.D.E. created BAMSurgeon. A.D.E., K.E.H., Y.H., K.E., C.C., J.C.B., C.P., M.R.K., T.C.N., G.S., A.A.M., J.M.S. and P.C.B. created the ICGC-TCGA DREAM Somatic Mutation Calling Challenge. A.D.E., K.E.H., Y.H., C.C. and T.N.Y. created data sets and analyzed sequencing data. A.D.E., K.E.H., Y.H., D.W., V.Y.S. and P.C.B. were responsible for statistical modeling. Research was supervised by D.H., S.H.F., G.S., A.A.M., J.M.S. and P.C.B. The first draft of the manuscript was written by A.D.E., K.E.H., Y.H. and P.C.B., extensively edited by A.A.M. and J.M.S., and approved by all authors.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (20,871 KB)

    Supplementary Figures 1–33 and Supplementary Notes 1 and 2

Excel files

  1. Supplementary Table 1: Characteristics of Tumour and Normal BAM Files (4 KB)

    A summary of the characteristics of the tumour and normal BAM files, including coverage, number of reads, and percent of positions with greater than 20x coverage.

  2. Supplementary Table 2: Performance of Submitted Algorithms (25 KB)

    List of all entries to the SMC Challenge along with the team name, number of predicted SNVs, number of true positives, number of false positives, recall, precision and F-score.

  3. Supplementary Table 3: Effect and Significance of each Chromosome on F-score, Precision and Recall (9 KB)

    Effect, confidence interval, p-value and FDR adjusted p-value from two-way ANOVA on F-score, precision and recall, separately.

  4. Supplementary Table 4: Methods to Generate Values for Twelve Genomic Variables (109 KB)

    List of methods used to generate reference allele counts, non-reference allele counts, base quality, tumour coverage, normal coverage, mapping quality, median read position, homopolymer rate, GC content, trinucleotide sequence, genomic element and distance to nearest germline SNP.

  5. Supplementary Table 5: Comparison of Genomic Variables across Chromosomes (196 KB)

    Median and standard deviation of all eleven continuous genomic variables for each chromosome along with the Bonferroni adjusted p-value comparing the values of each chromosome to all other chromosomes for true positives. Median and standard deviation of genomic variables on chromosome 21 and Bonferroni adjusted p-value comparing the values on chromosome 21 to the rest of the genome for false positives. A bias in submission 2319000 towards germline SNPs was detected, therefore, false positives called made by this algorithm only were omitted for the purpose of this analysis (only).

  6. Supplementary Table 6: Number of Observations used in RandomForest Model (9 KB)

    The number of observations used for each submission and call type (false positives vs false negatives) in individual RandomForest models.

  7. Supplementary Table 7: Correlation of Genomic Variables and Trinucleotide Abundances (6 KB)

    Spearman correlation values of ten genomic variables against trinucleotide abundances in false positives.

Zip files

  1. Supplementary Data 1 (1,967 KB)

    Tool Chains of Submitted Algorithms

  2. Supplementary Data 2 (437 KB)

    Archive containing positions and alleles of spiked-in mutations for IS1-3 synthetic tumours in VCF format. Retrieved from https://www.synapse.org/#!Synapse:syn2177211.

Additional data