Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Technical Report
  • Published:

A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data

Abstract

Cancer genomic analysis requires accurate identification of somatic variants in sequencing data. Manual review to refine somatic variant calls is required as a final step after automated processing. However, manual variant refinement is time-consuming, costly, poorly standardized, and non-reproducible. Here, we systematized and standardized somatic variant refinement using a machine learning approach. The final model incorporates 41,000 variants from 440 sequencing cases. This model accurately recapitulated manual refinement labels for three independent testing sets (13,579 variants) and accurately predicted somatic variants confirmed by orthogonal validation sequencing data (212,158 variants). The model improves on manual somatic refinement by reducing bias on calls otherwise subject to high inter-reviewer variability.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Deep learning and random forest models achieved very high manual review classification performance during tenfold cross-validation.
Fig. 2: Machine learning models and manual reviewers use similar features when making manual review classification decisions.
Fig. 3: Model confidence closely parallels reviewer confidence.
Fig. 4: Machine learning models accurately predict orthogonal validation sequencing results.
Fig. 5: The deep learning model performance on three independent test sets validated with manual review and subsequent correction for batch effect using re-training with 5% increments of the test data.
Fig. 6: Manual review misclassifications recovered by the deep learning model.

Similar content being viewed by others

Data availability

All analysis, preprocessing code, readcount training data, manual review calls, and trained deep learning and random forest models are available on the DeepSVR GitHub repository (https://github.com/griffithlab/DeepSVR). The raw sequencing data are publicly available for most projects included in this study (Supplementary Table 8). Users can access the classifier command line interface via our open-sourced GitHub repository and can install the package through Bioconda49. After installation, the tool can be used to (1) train and save a deep learning classifier, (2) prepare data for training a classifier or classification, and (3) classify data using either the provided deep learning model or a custom model. A walkthrough of this process is available on the DeepSVR GitHub Wiki.

References

  1. Griffith, M. et al. Genome modeling system: a knowledge management platform for genomics. PLoS Comput. Biol. 11, e1004274 (2015).

    Article  Google Scholar 

  2. Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

    Article  CAS  Google Scholar 

  3. Robinson, J. T., Thorvaldsdóttir, H., Wenger, A. M., Zehir, A. & Mesirov, J. P. Variant review with the integrative genomics viewer. Cancer Res. 77, e31–e34 (2017).

    Article  CAS  Google Scholar 

  4. Li, M. M. et al. Standards and guidelines for the interpretation and reporting of sequence variants in cancer: a joint consensus recommendation of the Association for Molecular Pathology, American Society of Clinical Oncology, and College of American Pathologists. J. Mol. Diagn. 19, 4–23 (2017).

    Article  CAS  Google Scholar 

  5. Roy, S. et al. Standards and guidelines for validating next-generation sequencing bioinformatics pipelines: a joint recommendation of the Association for Molecular Pathology and the College of American Pathologists. J. Mol. Diagn. 20, 4–27 (2017).

    Article  Google Scholar 

  6. Rheinbay, E. et al. Recurrent and functional regulatory mutations in breast cancer. Nature 547, 55–60 (2017).

    Article  CAS  Google Scholar 

  7. Ott, P. A. et al. An immunogenic personal neoantigen vaccine for patients with melanoma. Nature 547, 217–221 (2017).

    Article  CAS  Google Scholar 

  8. Ma, C. X. et al. A phase I trial of BKM120 (Buparlisib) in combination with fulvestrant in postmenopausal women with estrogen receptor-positive metastatic breast cancer. Clin. Cancer Res. 22, 1583–1591 (2016).

    Article  CAS  Google Scholar 

  9. The Cancer Genome Atlas Research Network. Comprehensive, integrative genomic analysis of diffuse lower-grade gliomas. N. Engl. J. Med. 372, 2481–2498 (2015).

    Article  Google Scholar 

  10. Rasche, L. et al. Spatial genomic heterogeneity in multiple myeloma revealed by multi-region sequencing. Nat. Commun. 8, 268 (2017).

    Article  CAS  Google Scholar 

  11. Barnell, E. K. et al. Standard operating procedure for somatic variant refinement of sequencing data with paired tumor and normal samples. Genet. Med. https://doi.org/10.1038/s41436-018-0278-z (2018).

  12. Griffith, O. L. et al. Truncating prolactin receptor mutations promote tumor growth in murine estrogen receptor-alpha mammary carcinomas. Cell Rep. 17, 249–260 (2016).

    Article  CAS  Google Scholar 

  13. Koboldt, D. C. et al. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25, 2283–2285 (2009).

    Article  CAS  Google Scholar 

  14. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  Google Scholar 

  15. Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865–2871 (2009).

    Article  CAS  Google Scholar 

  16. Simola, D. F. & Kim, J. Sniper: improved SNP discovery by multiply mapping deep sequenced reads. Genome. Biol. 12, R55 (2011).

    Article  Google Scholar 

  17. Saunders, C. T. et al. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 28, 1811–1817 (2012).

    Article  CAS  Google Scholar 

  18. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).

    Article  CAS  Google Scholar 

  19. Ding, J. et al. Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data. Bioinformatics 28, 167–175 (2012).

    Article  CAS  Google Scholar 

  20. Spinella, J.-F. et al. SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing. BMC Genomics 17, 912 (2016).

    Article  Google Scholar 

  21. Strom, S. P. Current practices and guidelines for clinical next-generation sequencing oncology testing. Cancer Biol. Med. 13, 3–11 (2016).

    Article  Google Scholar 

  22. Bettegowda, C. et al. Detection of circulating tumor DNA in early- and late-stage human malignancies. Sci. Transl. Med. 6, 224ra24 (2014).

    Article  Google Scholar 

  23. McHugh, M. L. Interrater reliability: the kappa statistic. Biochem. Med. 22, 276–282 (2012).

    Article  Google Scholar 

  24. Griffith, M. et al. Optimizing cancer genome sequencing and analysis. Cell Syst. 1, 210–223 (2015).

    Article  CAS  Google Scholar 

  25. Ellrott, K. et al. Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst. 6, 271–281.e7 (2018).

    Article  CAS  Google Scholar 

  26. Griffith, M. et al. CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nat. Genet. 49, 170–174 (2017).

    Article  CAS  Google Scholar 

  27. Costello, M. et al. Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res. 41, e67 (2013).

    Article  CAS  Google Scholar 

  28. Swamidass, S. J., Bittker, J. A., Bodycombe, N. E., Ryder, S. P. & Clemons, P. A. An economic framework to prioritize confirmatory tests after a high-throughput screen. J. Biomol. Screen. 15, 680–686 (2010).

    Article  Google Scholar 

  29. Settles, B. & Craven, M. An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP ’08, https://doi.org/10.3115/1613715.1613855 (Association for Computational Linguistics, Stroudsburg, PA, USA; 2008).

  30. Settles, B. in Synthesis Lectures on Artificial Intelligence and Machine Learning Vol. 6 1–114 (Morgan & Claypool, San Rafael, 2012).

  31. Griffith, M. et al. Comprehensive genomic analysis reveals FLT3 activation and a therapeutic strategy for a patient with relapsed adult B-lymphoblastic leukemia. Exp. Hematol. 44, 603–613 (2016).

    Article  CAS  Google Scholar 

  32. Krysiak, K. et al. Recurrent somatic mutations affecting B-cell receptor signaling pathway genes in follicular lymphoma. Blood 129, 473–483 (2017).

    Article  CAS  Google Scholar 

  33. Klco, J. M. et al. Association between mutation clearance after induction therapy and outcomes in acute myeloid leukemia. JAMA 314, 811–822 (2015).

    Article  CAS  Google Scholar 

  34. Uy, G. L. et al. Dynamic changes in the clonal structure of MDS and AML in response to epigenetic therapy. Leukemia 31, 872–881 (2017).

    Article  CAS  Google Scholar 

  35. Lesurf, R. et al. Genomic characterization of HER2-positive breast cancer and response to neoadjuvant trastuzumab and chemotherapy-results from the ACOSOG Z1041 (Alliance) trial. Ann. Oncol. 28, 1070–1077 (2017).

    Article  CAS  Google Scholar 

  36. Welch, J. S. et al. TP53 and decitabine in acute myeloid leukemia and myelodysplastic syndromes. N. Engl. J. Med. 375, 2023–2036 (2016).

    Article  CAS  Google Scholar 

  37. Rohan, T. E. et al. Somatic mutations in benign breast disease tissue and risk of subsequent invasive breast cancer. Br. J. Cancer 118, 1662–1664 (2018).

    Article  CAS  Google Scholar 

  38. Mahlokozera, T. et al. Biological and therapeutic implications of multisector sequencing in newly diagnosed glioblastoma. Neuro. Oncol. 20, 472–483 (2018).

    Article  CAS  Google Scholar 

  39. Wagner, A. H. et al. Recurrent WNT pathway alterations are frequent in relapsed small cell lung cancer.Nat. Commun. 9, 3787 (2018).

    Article  Google Scholar 

  40. Duncavage, E. J. et al. Mutation clearance after transplantation for myelodysplastic syndrome. N. Engl. J. Med. 379, 1028–1041 (2018).

    Article  CAS  Google Scholar 

  41. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).

    Article  Google Scholar 

  42. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).

  43. Picard Tools (Broad Institute, 2018); http://broadinstitute.github.io/picard/

  44. Varoquaux, G. et al. Scikit-learn: machine learning without learning the machinery. GetMobile 19, 29–33 (2015).

    Article  Google Scholar 

  45. Nelli, F. Machine Learning with scikit-learn. In Python Data Analytics 2nd edn, Ch. 7 237–264 (Apress, New York, 2015).

  46. Oliphant, T. E. Python for scientific computing. Comput. Sci. Eng. 9, 10–20 (2007).

    Article  CAS  Google Scholar 

  47. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    Article  CAS  Google Scholar 

  48. Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).

    Article  Google Scholar 

  49. Grüning, B. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat. Methods 15, 475–476 (2018).

    Article  Google Scholar 

Download references

Acknowledgements

The authors thank A. Petti, G. Chang, T. Li, C. Miller, L. Trani, R. Lesurf, Z. Skidmore, K. Krysiak, A. Ramu, and F. Gomez for assisting in data assembly. We also acknowledge L. Trani for performing manual review and for valuable discussion on the project. We gratefully acknowledge L. Wartman, J. DiPersio, M. Jacoby, B. Van Tine, R. Fields, B. Tan, S. Chi, D. Gutmann, and T. Ley for sharing genomic data that made this project possible. The authors also thank the patients and their families for their selfless contribution to the advancement of science. Part of this work was performed as part of the Washington University School of Medicine Genomics Tumor Board, which was funded with private research support from the Division of Oncology and the McDonnell Genome Institute. E.K.B. was supported by the National Cancer Institute (T32GM007200 and U01CA209936). T.E.R. received support from the National Institutes of Health/National Cancer Institute (NIH/NCI) (R01CA142942) and the Breast Cancer Research Foundation. Select sample data was funded by the Genomics of AML PPG (T. Ley, PI, P01 CA101937). A.H.W. was supported by the NCI (NIH NCI F32CA206247). B.J.A. was supported by the Siteman Cancer Center. S. Swamidass is funded by the National Library of Medicine (NIH NLM R01LM012222 and NIH NLM R01LM012482) and acknowledges support from the Institute for Informatics at Washington University School of Medicine. M.G. is funded by the National Human Genome Research Institute (NIH NHGRI R00HG007940). O.L.G. is funded by the National Cancer Institute (NIH NCI K22CA188163 and NIH NCI U01CA209936).

Author information

Authors and Affiliations

Authors

Contributions

B.J.A. designed the study, assembled and cleaned training data, performed feature engineering, designed model architecture, tuned hyperparameters, performed model training and analysis, performed manual review, assembled validation data, wrote code, created figures, and wrote the manuscript. E.K.B. designed the study, performed manual review, performed model training and analysis, performed clinical data analysis, assembled validation data, wrote code, created figures, and wrote the manuscript. P.R. and K.M.C. wrote code, performed manual review, and edited the manuscript. A.H.W. wrote code. T.E.R., R.G., R.U., G.P.D, and T.A.F. shared genomic data that was used in training the model and revised the paper. M.G., E.R.M., S.J.S., and O.L.G. designed the study, supervised the project and revised the paper.

Corresponding authors

Correspondence to S. Joshua Swamidass or Obi L. Griffith.

Ethics declarations

Competing interests

R.G. consults for Eli Lilly and Genentech. R.G. is on the board/honorarium for EMD Serono, Bristol-Myers Squibb, Genentech, Pfizer, Nektar, Merck, Celgene, Adaptimmune, GlaxoSmithKline, Phillips Gilmore. All remaining authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 The deep learning model performs well on the hold out test set (n = 13,530 variants), tenfold cross-validation with a simplified disease feature (n = 27,470 variants), and tenfold cross-validation with the reviewer feature removed (n = 27,470 variants).

a, ROC curve and reliability diagram performance of the deep learning model on the hold out test set with all 71 described features. b, ROC curve and reliability diagram performance of the deep learning model tenfold cross-validation set with the cancer type simplified to solid versus liquid tumor status. c, ROC curve and reliability diagram performance of the deep learning model tenfold cross-validation set with the reviewer feature removed.

Supplementary Figure 2 Deep learning model outputs from the hold out test set (n = 13,530 variants) are well scaled across all predicted classes (ambiguous, fail, and somatic).

The correlation between the model output and the manual review call was assessed for all three different classes of calls (ambiguous, fail, and somatic). For each class, model outputs were binned into ten groups ranging from 0.00–1.00. For each bin, the total number of manual review calls that agree and disagree with the individual class were plotted. The ratio of agreement to disagreement was plotted for each bin and compared to the identity line (x = y) using the Pearson’s correlation coefficient (r).

Supplementary Figure 3 The deep learning model performs better than the random forest model on independent sequencing data with manual review labels (n = 4 small-cell lung cancer cases with 2,686 total variants).

a, ROC curves outlining deep learning and random forest model performances on independent sequencing data with manual review labels (n = 4 small-cell lung cancer cases with 2,686 total variants). b, Curves showing batch effect correction after re-training machine learning models with incremental subsets of variants from the independent sequencing data. Independent sequencing data were partitioned in random stratified increments of 5% (from 0–75%) and used to train a new model (increments = 179 variants). The x axis outlines the number of independent variants included in training. The y axis plots the resulting model’s ROC AUC. The ambiguous class shows significant stochasticity due to low representation in the test dataset (n = 15 variants).

Supplementary Figure 4 IGV snapshots of clinically relevant variants that were originally labeled as somatic by manual reviewers but were subsequently identified as fail using the deep learning model and manual re-review.

a, Failure due to short inserts and directional artifacts. b, Failure due to multiple mismatches across variant-supporting reads. c, Failure due to multiple variant artifacts. d, Failure due to ends of reads artifact.

Supplementary Figure 5 IGV snapshots of clinically relevant variants that were originally labeled as fail or ambiguous by manual reviewers but were subsequently identified as somatic using the deep learning model and manual re-review.

For each snapshot, the normal tracks and the tumor tracks show aligned reads that were obtained from normal tissue and the tumor tissue, respectively. Variant summaries obtained from CIViC show gene name, variant type, variant coordinates, clinical summary, and relevant clinical action items. a, The original reviewer conservatively labeled both PIK3CA variants as ambiguous owing to multiple mismatches in reads; however, both variants appear to be somatic and occur at known cancer driver hotspots (E542K/E545K). b, The original reviewer failed this variant owing to high levels of variant reads in the normal track; however, given that this variant was derived from a hematologic malignancy, this level of tumor in normal is permissible.

Supplementary Figure 6 Variants that show disagreement between the classifier and original manual review demonstrates high levels of inter-reviewer variability.

Of the 10.7% of variants that disagree with the original manual review call, 179 variants were sampled to conduct manual re-review. When comparing the classifier call to the re-review consensus call, 41.9% of variants showed high inter-reviewer variability and/or inability to determine a consensus.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–6

Reporting Summary

Supplementary Table 1

List and description of the 71 features used to train the original machine learning models

Supplementary Table 2

Cross-tabulation performance on hold out test set parsed by reviewer, disease, normal sequencing depth, and tumor sequencing depth

Supplementary Table 3

Distribution of orthogonal validation calls from the AML31 case and the 106 The Cancer Genome Atlas (TCGA) tumor/normal pairs used to assess model performance

Supplementary Table 4

Distribution of manual review calls from the 37 cases used to assess model performance by independent sequencing data with manual review

Supplementary Table 5

Overlap between discrepant variants and CIViC annotations

Supplementary Table 6

Manual re-review of 179 variants by seven reviewers to develop a consensus call

Supplementary Table 7

Guide to development of a consensus label based on manual review calls from seven reviewers

Supplementary Table 8

List of the data availability of sequencing results for all cases used in the model development

Supplementary Table 9

Sequence Ontology IDs and description used to identify variants within the CIViC database that can be analyzed on DNA sequencing platforms

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ainscough, B.J., Barnell, E.K., Ronning, P. et al. A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data. Nat Genet 50, 1735–1743 (2018). https://doi.org/10.1038/s41588-018-0257-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-018-0257-y

This article is cited by

Search

Quick links

Nature Briefing: Cancer

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

Get what matters in cancer research, free to your inbox weekly. Sign up for Nature Briefing: Cancer