A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data

Article metrics


Cancer genomic analysis requires accurate identification of somatic variants in sequencing data. Manual review to refine somatic variant calls is required as a final step after automated processing. However, manual variant refinement is time-consuming, costly, poorly standardized, and non-reproducible. Here, we systematized and standardized somatic variant refinement using a machine learning approach. The final model incorporates 41,000 variants from 440 sequencing cases. This model accurately recapitulated manual refinement labels for three independent testing sets (13,579 variants) and accurately predicted somatic variants confirmed by orthogonal validation sequencing data (212,158 variants). The model improves on manual somatic refinement by reducing bias on calls otherwise subject to high inter-reviewer variability.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Deep learning and random forest models achieved very high manual review classification performance during tenfold cross-validation.
Fig. 2: Machine learning models and manual reviewers use similar features when making manual review classification decisions.
Fig. 3: Model confidence closely parallels reviewer confidence.
Fig. 4: Machine learning models accurately predict orthogonal validation sequencing results.
Fig. 5: The deep learning model performance on three independent test sets validated with manual review and subsequent correction for batch effect using re-training with 5% increments of the test data.
Fig. 6: Manual review misclassifications recovered by the deep learning model.

Data availability

All analysis, preprocessing code, readcount training data, manual review calls, and trained deep learning and random forest models are available on the DeepSVR GitHub repository (https://github.com/griffithlab/DeepSVR). The raw sequencing data are publicly available for most projects included in this study (Supplementary Table 8). Users can access the classifier command line interface via our open-sourced GitHub repository and can install the package through Bioconda49. After installation, the tool can be used to (1) train and save a deep learning classifier, (2) prepare data for training a classifier or classification, and (3) classify data using either the provided deep learning model or a custom model. A walkthrough of this process is available on the DeepSVR GitHub Wiki.


  1. 1.

    Griffith, M. et al. Genome modeling system: a knowledge management platform for genomics. PLoS Comput. Biol. 11, e1004274 (2015).

  2. 2.

    Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

  3. 3.

    Robinson, J. T., Thorvaldsdóttir, H., Wenger, A. M., Zehir, A. & Mesirov, J. P. Variant review with the integrative genomics viewer. Cancer Res. 77, e31–e34 (2017).

  4. 4.

    Li, M. M. et al. Standards and guidelines for the interpretation and reporting of sequence variants in cancer: a joint consensus recommendation of the Association for Molecular Pathology, American Society of Clinical Oncology, and College of American Pathologists. J. Mol. Diagn. 19, 4–23 (2017).

  5. 5.

    Roy, S. et al. Standards and guidelines for validating next-generation sequencing bioinformatics pipelines: a joint recommendation of the Association for Molecular Pathology and the College of American Pathologists. J. Mol. Diagn. 20, 4–27 (2017).

  6. 6.

    Rheinbay, E. et al. Recurrent and functional regulatory mutations in breast cancer. Nature 547, 55–60 (2017).

  7. 7.

    Ott, P. A. et al. An immunogenic personal neoantigen vaccine for patients with melanoma. Nature 547, 217–221 (2017).

  8. 8.

    Ma, C. X. et al. A phase I trial of BKM120 (Buparlisib) in combination with fulvestrant in postmenopausal women with estrogen receptor-positive metastatic breast cancer. Clin. Cancer Res. 22, 1583–1591 (2016).

  9. 9.

    The Cancer Genome Atlas Research Network. Comprehensive, integrative genomic analysis of diffuse lower-grade gliomas. N. Engl. J. Med. 372, 2481–2498 (2015).

  10. 10.

    Rasche, L. et al. Spatial genomic heterogeneity in multiple myeloma revealed by multi-region sequencing. Nat. Commun. 8, 268 (2017).

  11. 11.

    Barnell, E. K. et al. Standard operating procedure for somatic variant refinement of sequencing data with paired tumor and normal samples. Genet. Med. https://doi.org/10.1038/s41436-018-0278-z (2018).

  12. 12.

    Griffith, O. L. et al. Truncating prolactin receptor mutations promote tumor growth in murine estrogen receptor-alpha mammary carcinomas. Cell Rep. 17, 249–260 (2016).

  13. 13.

    Koboldt, D. C. et al. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25, 2283–2285 (2009).

  14. 14.

    Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

  15. 15.

    Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865–2871 (2009).

  16. 16.

    Simola, D. F. & Kim, J. Sniper: improved SNP discovery by multiply mapping deep sequenced reads. Genome. Biol. 12, R55 (2011).

  17. 17.

    Saunders, C. T. et al. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 28, 1811–1817 (2012).

  18. 18.

    Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).

  19. 19.

    Ding, J. et al. Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data. Bioinformatics 28, 167–175 (2012).

  20. 20.

    Spinella, J.-F. et al. SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing. BMC Genomics 17, 912 (2016).

  21. 21.

    Strom, S. P. Current practices and guidelines for clinical next-generation sequencing oncology testing. Cancer Biol. Med. 13, 3–11 (2016).

  22. 22.

    Bettegowda, C. et al. Detection of circulating tumor DNA in early- and late-stage human malignancies. Sci. Transl. Med. 6, 224ra24 (2014).

  23. 23.

    McHugh, M. L. Interrater reliability: the kappa statistic. Biochem. Med. 22, 276–282 (2012).

  24. 24.

    Griffith, M. et al. Optimizing cancer genome sequencing and analysis. Cell Syst. 1, 210–223 (2015).

  25. 25.

    Ellrott, K. et al. Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst. 6, 271–281.e7 (2018).

  26. 26.

    Griffith, M. et al. CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nat. Genet. 49, 170–174 (2017).

  27. 27.

    Costello, M. et al. Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res. 41, e67 (2013).

  28. 28.

    Swamidass, S. J., Bittker, J. A., Bodycombe, N. E., Ryder, S. P. & Clemons, P. A. An economic framework to prioritize confirmatory tests after a high-throughput screen. J. Biomol. Screen. 15, 680–686 (2010).

  29. 29.

    Settles, B. & Craven, M. An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP ’08, https://doi.org/10.3115/1613715.1613855 (Association for Computational Linguistics, Stroudsburg, PA, USA; 2008).

  30. 30.

    Settles, B. in Synthesis Lectures on Artificial Intelligence and Machine Learning Vol. 6 1–114 (Morgan & Claypool, San Rafael, 2012).

  31. 31.

    Griffith, M. et al. Comprehensive genomic analysis reveals FLT3 activation and a therapeutic strategy for a patient with relapsed adult B-lymphoblastic leukemia. Exp. Hematol. 44, 603–613 (2016).

  32. 32.

    Krysiak, K. et al. Recurrent somatic mutations affecting B-cell receptor signaling pathway genes in follicular lymphoma. Blood 129, 473–483 (2017).

  33. 33.

    Klco, J. M. et al. Association between mutation clearance after induction therapy and outcomes in acute myeloid leukemia. JAMA 314, 811–822 (2015).

  34. 34.

    Uy, G. L. et al. Dynamic changes in the clonal structure of MDS and AML in response to epigenetic therapy. Leukemia 31, 872–881 (2017).

  35. 35.

    Lesurf, R. et al. Genomic characterization of HER2-positive breast cancer and response to neoadjuvant trastuzumab and chemotherapy-results from the ACOSOG Z1041 (Alliance) trial. Ann. Oncol. 28, 1070–1077 (2017).

  36. 36.

    Welch, J. S. et al. TP53 and decitabine in acute myeloid leukemia and myelodysplastic syndromes. N. Engl. J. Med. 375, 2023–2036 (2016).

  37. 37.

    Rohan, T. E. et al. Somatic mutations in benign breast disease tissue and risk of subsequent invasive breast cancer. Br. J. Cancer 118, 1662–1664 (2018).

  38. 38.

    Mahlokozera, T. et al. Biological and therapeutic implications of multisector sequencing in newly diagnosed glioblastoma. Neuro. Oncol. 20, 472–483 (2018).

  39. 39.

    Wagner, A. H. et al. Recurrent WNT pathway alterations are frequent in relapsed small cell lung cancer.Nat. Commun. 9, 3787 (2018).

  40. 40.

    Duncavage, E. J. et al. Mutation clearance after transplantation for myelodysplastic syndrome. N. Engl. J. Med. 379, 1028–1041 (2018).

  41. 41.

    Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).

  42. 42.

    Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).

  43. 43.

    Picard Tools (Broad Institute, 2018); http://broadinstitute.github.io/picard/

  44. 44.

    Varoquaux, G. et al. Scikit-learn: machine learning without learning the machinery. GetMobile 19, 29–33 (2015).

  45. 45.

    Nelli, F. Machine Learning with scikit-learn. In Python Data Analytics 2nd edn, Ch. 7 237–264 (Apress, New York, 2015).

  46. 46.

    Oliphant, T. E. Python for scientific computing. Comput. Sci. Eng. 9, 10–20 (2007).

  47. 47.

    Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

  48. 48.

    Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).

  49. 49.

    Grüning, B. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat. Methods 15, 475–476 (2018).

Download references


The authors thank A. Petti, G. Chang, T. Li, C. Miller, L. Trani, R. Lesurf, Z. Skidmore, K. Krysiak, A. Ramu, and F. Gomez for assisting in data assembly. We also acknowledge L. Trani for performing manual review and for valuable discussion on the project. We gratefully acknowledge L. Wartman, J. DiPersio, M. Jacoby, B. Van Tine, R. Fields, B. Tan, S. Chi, D. Gutmann, and T. Ley for sharing genomic data that made this project possible. The authors also thank the patients and their families for their selfless contribution to the advancement of science. Part of this work was performed as part of the Washington University School of Medicine Genomics Tumor Board, which was funded with private research support from the Division of Oncology and the McDonnell Genome Institute. E.K.B. was supported by the National Cancer Institute (T32GM007200 and U01CA209936). T.E.R. received support from the National Institutes of Health/National Cancer Institute (NIH/NCI) (R01CA142942) and the Breast Cancer Research Foundation. Select sample data was funded by the Genomics of AML PPG (T. Ley, PI, P01 CA101937). A.H.W. was supported by the NCI (NIH NCI F32CA206247). B.J.A. was supported by the Siteman Cancer Center. S. Swamidass is funded by the National Library of Medicine (NIH NLM R01LM012222 and NIH NLM R01LM012482) and acknowledges support from the Institute for Informatics at Washington University School of Medicine. M.G. is funded by the National Human Genome Research Institute (NIH NHGRI R00HG007940). O.L.G. is funded by the National Cancer Institute (NIH NCI K22CA188163 and NIH NCI U01CA209936).

Author information

B.J.A. designed the study, assembled and cleaned training data, performed feature engineering, designed model architecture, tuned hyperparameters, performed model training and analysis, performed manual review, assembled validation data, wrote code, created figures, and wrote the manuscript. E.K.B. designed the study, performed manual review, performed model training and analysis, performed clinical data analysis, assembled validation data, wrote code, created figures, and wrote the manuscript. P.R. and K.M.C. wrote code, performed manual review, and edited the manuscript. A.H.W. wrote code. T.E.R., R.G., R.U., G.P.D, and T.A.F. shared genomic data that was used in training the model and revised the paper. M.G., E.R.M., S.J.S., and O.L.G. designed the study, supervised the project and revised the paper.

Correspondence to S. Joshua Swamidass or Obi L. Griffith.

Ethics declarations

Competing interests

R.G. consults for Eli Lilly and Genentech. R.G. is on the board/honorarium for EMD Serono, Bristol-Myers Squibb, Genentech, Pfizer, Nektar, Merck, Celgene, Adaptimmune, GlaxoSmithKline, Phillips Gilmore. All remaining authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 The deep learning model performs well on the hold out test set (n = 13,530 variants), tenfold cross-validation with a simplified disease feature (n = 27,470 variants), and tenfold cross-validation with the reviewer feature removed (n = 27,470 variants).

a, ROC curve and reliability diagram performance of the deep learning model on the hold out test set with all 71 described features. b, ROC curve and reliability diagram performance of the deep learning model tenfold cross-validation set with the cancer type simplified to solid versus liquid tumor status. c, ROC curve and reliability diagram performance of the deep learning model tenfold cross-validation set with the reviewer feature removed.

Supplementary Figure 2 Deep learning model outputs from the hold out test set (n = 13,530 variants) are well scaled across all predicted classes (ambiguous, fail, and somatic).

The correlation between the model output and the manual review call was assessed for all three different classes of calls (ambiguous, fail, and somatic). For each class, model outputs were binned into ten groups ranging from 0.00–1.00. For each bin, the total number of manual review calls that agree and disagree with the individual class were plotted. The ratio of agreement to disagreement was plotted for each bin and compared to the identity line (x = y) using the Pearson’s correlation coefficient (r).

Supplementary Figure 3 The deep learning model performs better than the random forest model on independent sequencing data with manual review labels (n = 4 small-cell lung cancer cases with 2,686 total variants).

a, ROC curves outlining deep learning and random forest model performances on independent sequencing data with manual review labels (n = 4 small-cell lung cancer cases with 2,686 total variants). b, Curves showing batch effect correction after re-training machine learning models with incremental subsets of variants from the independent sequencing data. Independent sequencing data were partitioned in random stratified increments of 5% (from 0–75%) and used to train a new model (increments = 179 variants). The x axis outlines the number of independent variants included in training. The y axis plots the resulting model’s ROC AUC. The ambiguous class shows significant stochasticity due to low representation in the test dataset (n = 15 variants).

Supplementary Figure 4 IGV snapshots of clinically relevant variants that were originally labeled as somatic by manual reviewers but were subsequently identified as fail using the deep learning model and manual re-review.

a, Failure due to short inserts and directional artifacts. b, Failure due to multiple mismatches across variant-supporting reads. c, Failure due to multiple variant artifacts. d, Failure due to ends of reads artifact.

Supplementary Figure 5 IGV snapshots of clinically relevant variants that were originally labeled as fail or ambiguous by manual reviewers but were subsequently identified as somatic using the deep learning model and manual re-review.

For each snapshot, the normal tracks and the tumor tracks show aligned reads that were obtained from normal tissue and the tumor tissue, respectively. Variant summaries obtained from CIViC show gene name, variant type, variant coordinates, clinical summary, and relevant clinical action items. a, The original reviewer conservatively labeled both PIK3CA variants as ambiguous owing to multiple mismatches in reads; however, both variants appear to be somatic and occur at known cancer driver hotspots (E542K/E545K). b, The original reviewer failed this variant owing to high levels of variant reads in the normal track; however, given that this variant was derived from a hematologic malignancy, this level of tumor in normal is permissible.

Supplementary Figure 6 Variants that show disagreement between the classifier and original manual review demonstrates high levels of inter-reviewer variability.

Of the 10.7% of variants that disagree with the original manual review call, 179 variants were sampled to conduct manual re-review. When comparing the classifier call to the re-review consensus call, 41.9% of variants showed high inter-reviewer variability and/or inability to determine a consensus.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–6

Reporting Summary

Supplementary Table 1

List and description of the 71 features used to train the original machine learning models

Supplementary Table 2

Cross-tabulation performance on hold out test set parsed by reviewer, disease, normal sequencing depth, and tumor sequencing depth

Supplementary Table 3

Distribution of orthogonal validation calls from the AML31 case and the 106 The Cancer Genome Atlas (TCGA) tumor/normal pairs used to assess model performance

Supplementary Table 4

Distribution of manual review calls from the 37 cases used to assess model performance by independent sequencing data with manual review

Supplementary Table 5

Overlap between discrepant variants and CIViC annotations

Supplementary Table 6

Manual re-review of 179 variants by seven reviewers to develop a consensus call

Supplementary Table 7

Guide to development of a consensus label based on manual review calls from seven reviewers

Supplementary Table 8

List of the data availability of sequencing results for all cases used in the model development

Supplementary Table 9

Sequence Ontology IDs and description used to identify variants within the CIViC database that can be analyzed on DNA sequencing platforms

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Further reading