A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data

Ainscough, Benjamin J.; Barnell, Erica K.; Ronning, Peter; Campbell, Katie M.; Wagner, Alex H.; Fehniger, Todd A.; Dunn, Gavin P.; Uppaluri, Ravindra; Govindan, Ramaswamy; Rohan, Thomas E.; Griffith, Malachi; Mardis, Elaine R.; Swamidass, S. Joshua; Griffith, Obi L.

doi:10.1038/s41588-018-0257-y

Technical Report
Published: 05 November 2018

A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data

Nature Genetics volume 50, pages 1735–1743 (2018)Cite this article

12k Accesses
52 Citations
106 Altmetric
Metrics details

Subjects

Abstract

Cancer genomic analysis requires accurate identification of somatic variants in sequencing data. Manual review to refine somatic variant calls is required as a final step after automated processing. However, manual variant refinement is time-consuming, costly, poorly standardized, and non-reproducible. Here, we systematized and standardized somatic variant refinement using a machine learning approach. The final model incorporates 41,000 variants from 440 sequencing cases. This model accurately recapitulated manual refinement labels for three independent testing sets (13,579 variants) and accurately predicted somatic variants confirmed by orthogonal validation sequencing data (212,158 variants). The model improves on manual somatic refinement by reducing bias on calls otherwise subject to high inter-reviewer variability.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Deep learning and random forest models achieved very high manual review classification performance during tenfold cross-validation.**

**Fig. 2: Machine learning models and manual reviewers use similar features when making manual review classification decisions.**

**Fig. 3: Model confidence closely parallels reviewer confidence.**

**Fig. 4: Machine learning models accurately predict orthogonal validation sequencing results.**

Fig. 5: The deep learning model performance on three independent test sets validated with manual review and subsequent correction for batch effect using re-training with 5% increments of the test data.

**Fig. 6: Manual review misclassifications recovered by the deep learning model.**

Accurate somatic variant detection using weakly supervised deep learning

Article Open access 22 July 2022

Simple combination of multiple somatic variant callers to increase accuracy

Article Open access 25 May 2023

ECOLE: Learning to call copy number variants on whole exome sequencing data

Article Open access 02 January 2024

Data availability

All analysis, preprocessing code, readcount training data, manual review calls, and trained deep learning and random forest models are available on the DeepSVR GitHub repository (https://github.com/griffithlab/DeepSVR). The raw sequencing data are publicly available for most projects included in this study (Supplementary Table 8). Users can access the classifier command line interface via our open-sourced GitHub repository and can install the package through Bioconda⁴⁹. After installation, the tool can be used to (1) train and save a deep learning classifier, (2) prepare data for training a classifier or classification, and (3) classify data using either the provided deep learning model or a custom model. A walkthrough of this process is available on the DeepSVR GitHub Wiki.

References

Griffith, M. et al. Genome modeling system: a knowledge management platform for genomics. PLoS Comput. Biol. 11, e1004274 (2015).
Article Google Scholar
Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
Article CAS Google Scholar
Robinson, J. T., Thorvaldsdóttir, H., Wenger, A. M., Zehir, A. & Mesirov, J. P. Variant review with the integrative genomics viewer. Cancer Res. 77, e31–e34 (2017).
Article CAS Google Scholar
Li, M. M. et al. Standards and guidelines for the interpretation and reporting of sequence variants in cancer: a joint consensus recommendation of the Association for Molecular Pathology, American Society of Clinical Oncology, and College of American Pathologists. J. Mol. Diagn. 19, 4–23 (2017).
Article CAS Google Scholar
Roy, S. et al. Standards and guidelines for validating next-generation sequencing bioinformatics pipelines: a joint recommendation of the Association for Molecular Pathology and the College of American Pathologists. J. Mol. Diagn. 20, 4–27 (2017).
Article Google Scholar
Rheinbay, E. et al. Recurrent and functional regulatory mutations in breast cancer. Nature 547, 55–60 (2017).
Article CAS Google Scholar
Ott, P. A. et al. An immunogenic personal neoantigen vaccine for patients with melanoma. Nature 547, 217–221 (2017).
Article CAS Google Scholar
Ma, C. X. et al. A phase I trial of BKM120 (Buparlisib) in combination with fulvestrant in postmenopausal women with estrogen receptor-positive metastatic breast cancer. Clin. Cancer Res. 22, 1583–1591 (2016).
Article CAS Google Scholar
The Cancer Genome Atlas Research Network. Comprehensive, integrative genomic analysis of diffuse lower-grade gliomas. N. Engl. J. Med. 372, 2481–2498 (2015).
Article Google Scholar
Rasche, L. et al. Spatial genomic heterogeneity in multiple myeloma revealed by multi-region sequencing. Nat. Commun. 8, 268 (2017).
Article CAS Google Scholar
Barnell, E. K. et al. Standard operating procedure for somatic variant refinement of sequencing data with paired tumor and normal samples. Genet. Med. https://doi.org/10.1038/s41436-018-0278-z (2018).
Griffith, O. L. et al. Truncating prolactin receptor mutations promote tumor growth in murine estrogen receptor-alpha mammary carcinomas. Cell Rep. 17, 249–260 (2016).
Article CAS Google Scholar
Koboldt, D. C. et al. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25, 2283–2285 (2009).
Article CAS Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article Google Scholar
Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865–2871 (2009).
Article CAS Google Scholar
Simola, D. F. & Kim, J. Sniper: improved SNP discovery by multiply mapping deep sequenced reads. Genome. Biol. 12, R55 (2011).
Article Google Scholar
Saunders, C. T. et al. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 28, 1811–1817 (2012).
Article CAS Google Scholar
Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).
Article CAS Google Scholar
Ding, J. et al. Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data. Bioinformatics 28, 167–175 (2012).
Article CAS Google Scholar
Spinella, J.-F. et al. SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing. BMC Genomics 17, 912 (2016).
Article Google Scholar
Strom, S. P. Current practices and guidelines for clinical next-generation sequencing oncology testing. Cancer Biol. Med. 13, 3–11 (2016).
Article Google Scholar
Bettegowda, C. et al. Detection of circulating tumor DNA in early- and late-stage human malignancies. Sci. Transl. Med. 6, 224ra24 (2014).
Article Google Scholar
McHugh, M. L. Interrater reliability: the kappa statistic. Biochem. Med. 22, 276–282 (2012).
Article Google Scholar
Griffith, M. et al. Optimizing cancer genome sequencing and analysis. Cell Syst. 1, 210–223 (2015).
Article CAS Google Scholar
Ellrott, K. et al. Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst. 6, 271–281.e7 (2018).
Article CAS Google Scholar
Griffith, M. et al. CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nat. Genet. 49, 170–174 (2017).
Article CAS Google Scholar
Costello, M. et al. Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res. 41, e67 (2013).
Article CAS Google Scholar
Swamidass, S. J., Bittker, J. A., Bodycombe, N. E., Ryder, S. P. & Clemons, P. A. An economic framework to prioritize confirmatory tests after a high-throughput screen. J. Biomol. Screen. 15, 680–686 (2010).
Article Google Scholar
Settles, B. & Craven, M. An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP ’08, https://doi.org/10.3115/1613715.1613855 (Association for Computational Linguistics, Stroudsburg, PA, USA; 2008).
Settles, B. in Synthesis Lectures on Artificial Intelligence and Machine Learning Vol. 6 1–114 (Morgan & Claypool, San Rafael, 2012).
Griffith, M. et al. Comprehensive genomic analysis reveals FLT3 activation and a therapeutic strategy for a patient with relapsed adult B-lymphoblastic leukemia. Exp. Hematol. 44, 603–613 (2016).
Article CAS Google Scholar
Krysiak, K. et al. Recurrent somatic mutations affecting B-cell receptor signaling pathway genes in follicular lymphoma. Blood 129, 473–483 (2017).
Article CAS Google Scholar
Klco, J. M. et al. Association between mutation clearance after induction therapy and outcomes in acute myeloid leukemia. JAMA 314, 811–822 (2015).
Article CAS Google Scholar
Uy, G. L. et al. Dynamic changes in the clonal structure of MDS and AML in response to epigenetic therapy. Leukemia 31, 872–881 (2017).
Article CAS Google Scholar
Lesurf, R. et al. Genomic characterization of HER2-positive breast cancer and response to neoadjuvant trastuzumab and chemotherapy-results from the ACOSOG Z1041 (Alliance) trial. Ann. Oncol. 28, 1070–1077 (2017).
Article CAS Google Scholar
Welch, J. S. et al. TP53 and decitabine in acute myeloid leukemia and myelodysplastic syndromes. N. Engl. J. Med. 375, 2023–2036 (2016).
Article CAS Google Scholar
Rohan, T. E. et al. Somatic mutations in benign breast disease tissue and risk of subsequent invasive breast cancer. Br. J. Cancer 118, 1662–1664 (2018).
Article CAS Google Scholar
Mahlokozera, T. et al. Biological and therapeutic implications of multisector sequencing in newly diagnosed glioblastoma. Neuro. Oncol. 20, 472–483 (2018).
Article CAS Google Scholar
Wagner, A. H. et al. Recurrent WNT pathway alterations are frequent in relapsed small cell lung cancer.Nat. Commun. 9, 3787 (2018).
Article Google Scholar
Duncavage, E. J. et al. Mutation clearance after transplantation for myelodysplastic syndrome. N. Engl. J. Med. 379, 1028–1041 (2018).
Article CAS Google Scholar
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).
Article Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
Picard Tools (Broad Institute, 2018); http://broadinstitute.github.io/picard/
Varoquaux, G. et al. Scikit-learn: machine learning without learning the machinery. GetMobile 19, 29–33 (2015).
Article Google Scholar
Nelli, F. Machine Learning with scikit-learn. In Python Data Analytics 2nd edn, Ch. 7 237–264 (Apress, New York, 2015).
Oliphant, T. E. Python for scientific computing. Comput. Sci. Eng. 9, 10–20 (2007).
Article CAS Google Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article CAS Google Scholar
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Article Google Scholar
Grüning, B. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat. Methods 15, 475–476 (2018).
Article Google Scholar

Download references

Acknowledgements

The authors thank A. Petti, G. Chang, T. Li, C. Miller, L. Trani, R. Lesurf, Z. Skidmore, K. Krysiak, A. Ramu, and F. Gomez for assisting in data assembly. We also acknowledge L. Trani for performing manual review and for valuable discussion on the project. We gratefully acknowledge L. Wartman, J. DiPersio, M. Jacoby, B. Van Tine, R. Fields, B. Tan, S. Chi, D. Gutmann, and T. Ley for sharing genomic data that made this project possible. The authors also thank the patients and their families for their selfless contribution to the advancement of science. Part of this work was performed as part of the Washington University School of Medicine Genomics Tumor Board, which was funded with private research support from the Division of Oncology and the McDonnell Genome Institute. E.K.B. was supported by the National Cancer Institute (T32GM007200 and U01CA209936). T.E.R. received support from the National Institutes of Health/National Cancer Institute (NIH/NCI) (R01CA142942) and the Breast Cancer Research Foundation. Select sample data was funded by the Genomics of AML PPG (T. Ley, PI, P01 CA101937). A.H.W. was supported by the NCI (NIH NCI F32CA206247). B.J.A. was supported by the Siteman Cancer Center. S. Swamidass is funded by the National Library of Medicine (NIH NLM R01LM012222 and NIH NLM R01LM012482) and acknowledges support from the Institute for Informatics at Washington University School of Medicine. M.G. is funded by the National Human Genome Research Institute (NIH NHGRI R00HG007940). O.L.G. is funded by the National Cancer Institute (NIH NCI K22CA188163 and NIH NCI U01CA209936).

Author information

These authors contributed equally: Benjamin J. Ainscough, Erica K. Barnell.

Authors and Affiliations

McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
Benjamin J. Ainscough, Erica K. Barnell, Peter Ronning, Katie M. Campbell, Alex H. Wagner, Malachi Griffith & Obi L. Griffith
Siteman Cancer Center, Washington University School of Medicine, St. Louis, MO, USA
Benjamin J. Ainscough, Todd A. Fehniger, Ramaswamy Govindan, Malachi Griffith & Obi L. Griffith
Division of Oncology, Department of Medicine, Washington University School of Medicine, St. Louis, MO, USA
Todd A. Fehniger, Ramaswamy Govindan, Malachi Griffith & Obi L. Griffith
Department of Neurological Surgery, Center for Human Immunology and Immunotherapy Programs, Washington University School of Medicine, St. Louis, MO, USA
Gavin P. Dunn
Department of Surgery/Otolaryngology, Brigham and Women’s Hospital and Dana-Farber Cancer Institute, Boston, MA, USA
Ravindra Uppaluri
Department of Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, NY, USA
Thomas E. Rohan
Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
Malachi Griffith & Obi L. Griffith
Institute for Genomic Medicine, The Research Institute at Nationwide Children’s Hospital, Columbus, OH, USA
Elaine R. Mardis
Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
Elaine R. Mardis
Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA
S. Joshua Swamidass
Institute for Informatics, Washington University School of Medicine, St. Louis, MO, USA
S. Joshua Swamidass

Authors

Benjamin J. Ainscough
View author publications
You can also search for this author in PubMed Google Scholar
Erica K. Barnell
View author publications
You can also search for this author in PubMed Google Scholar
Peter Ronning
View author publications
You can also search for this author in PubMed Google Scholar
Katie M. Campbell
View author publications
You can also search for this author in PubMed Google Scholar
Alex H. Wagner
View author publications
You can also search for this author in PubMed Google Scholar
Todd A. Fehniger
View author publications
You can also search for this author in PubMed Google Scholar
Gavin P. Dunn
View author publications
You can also search for this author in PubMed Google Scholar
Ravindra Uppaluri
View author publications
You can also search for this author in PubMed Google Scholar
Ramaswamy Govindan
View author publications
You can also search for this author in PubMed Google Scholar
Thomas E. Rohan
View author publications
You can also search for this author in PubMed Google Scholar
Malachi Griffith
View author publications
You can also search for this author in PubMed Google Scholar
Elaine R. Mardis
View author publications
You can also search for this author in PubMed Google Scholar
S. Joshua Swamidass
View author publications
You can also search for this author in PubMed Google Scholar
Obi L. Griffith
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.J.A. designed the study, assembled and cleaned training data, performed feature engineering, designed model architecture, tuned hyperparameters, performed model training and analysis, performed manual review, assembled validation data, wrote code, created figures, and wrote the manuscript. E.K.B. designed the study, performed manual review, performed model training and analysis, performed clinical data analysis, assembled validation data, wrote code, created figures, and wrote the manuscript. P.R. and K.M.C. wrote code, performed manual review, and edited the manuscript. A.H.W. wrote code. T.E.R., R.G., R.U., G.P.D, and T.A.F. shared genomic data that was used in training the model and revised the paper. M.G., E.R.M., S.J.S., and O.L.G. designed the study, supervised the project and revised the paper.

Corresponding authors

Correspondence to S. Joshua Swamidass or Obi L. Griffith.

Ethics declarations

Competing interests

R.G. consults for Eli Lilly and Genentech. R.G. is on the board/honorarium for EMD Serono, Bristol-Myers Squibb, Genentech, Pfizer, Nektar, Merck, Celgene, Adaptimmune, GlaxoSmithKline, Phillips Gilmore. All remaining authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 The deep learning model performs well on the hold out test set (n = 13,530 variants), tenfold cross-validation with a simplified disease feature (n = 27,470 variants), and tenfold cross-validation with the reviewer feature removed (n = 27,470 variants).

a, ROC curve and reliability diagram performance of the deep learning model on the hold out test set with all 71 described features. b, ROC curve and reliability diagram performance of the deep learning model tenfold cross-validation set with the cancer type simplified to solid versus liquid tumor status. c, ROC curve and reliability diagram performance of the deep learning model tenfold cross-validation set with the reviewer feature removed.

Supplementary Figure 2 Deep learning model outputs from the hold out test set (n = 13,530 variants) are well scaled across all predicted classes (ambiguous, fail, and somatic).

The correlation between the model output and the manual review call was assessed for all three different classes of calls (ambiguous, fail, and somatic). For each class, model outputs were binned into ten groups ranging from 0.00–1.00. For each bin, the total number of manual review calls that agree and disagree with the individual class were plotted. The ratio of agreement to disagreement was plotted for each bin and compared to the identity line (x = y) using the Pearson’s correlation coefficient (r).

Supplementary Figure 3 The deep learning model performs better than the random forest model on independent sequencing data with manual review labels (n = 4 small-cell lung cancer cases with 2,686 total variants).

a, ROC curves outlining deep learning and random forest model performances on independent sequencing data with manual review labels (n = 4 small-cell lung cancer cases with 2,686 total variants). b, Curves showing batch effect correction after re-training machine learning models with incremental subsets of variants from the independent sequencing data. Independent sequencing data were partitioned in random stratified increments of 5% (from 0–75%) and used to train a new model (increments = 179 variants). The x axis outlines the number of independent variants included in training. The y axis plots the resulting model’s ROC AUC. The ambiguous class shows significant stochasticity due to low representation in the test dataset (n = 15 variants).

Supplementary Figure 4 IGV snapshots of clinically relevant variants that were originally labeled as somatic by manual reviewers but were subsequently identified as fail using the deep learning model and manual re-review.

a, Failure due to short inserts and directional artifacts. b, Failure due to multiple mismatches across variant-supporting reads. c, Failure due to multiple variant artifacts. d, Failure due to ends of reads artifact.

Supplementary Figure 5 IGV snapshots of clinically relevant variants that were originally labeled as fail or ambiguous by manual reviewers but were subsequently identified as somatic using the deep learning model and manual re-review.

For each snapshot, the normal tracks and the tumor tracks show aligned reads that were obtained from normal tissue and the tumor tissue, respectively. Variant summaries obtained from CIViC show gene name, variant type, variant coordinates, clinical summary, and relevant clinical action items. a, The original reviewer conservatively labeled both PIK3CA variants as ambiguous owing to multiple mismatches in reads; however, both variants appear to be somatic and occur at known cancer driver hotspots (E542K/E545K). b, The original reviewer failed this variant owing to high levels of variant reads in the normal track; however, given that this variant was derived from a hematologic malignancy, this level of tumor in normal is permissible.

Supplementary Figure 6 Variants that show disagreement between the classifier and original manual review demonstrates high levels of inter-reviewer variability.

Of the 10.7% of variants that disagree with the original manual review call, 179 variants were sampled to conduct manual re-review. When comparing the classifier call to the re-review consensus call, 41.9% of variants showed high inter-reviewer variability and/or inability to determine a consensus.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–6

Reporting Summary

Supplementary Table 1

List and description of the 71 features used to train the original machine learning models

Supplementary Table 2

Cross-tabulation performance on hold out test set parsed by reviewer, disease, normal sequencing depth, and tumor sequencing depth

Supplementary Table 3

Distribution of orthogonal validation calls from the AML31 case and the 106 The Cancer Genome Atlas (TCGA) tumor/normal pairs used to assess model performance

Supplementary Table 4

Distribution of manual review calls from the 37 cases used to assess model performance by independent sequencing data with manual review

Supplementary Table 5

Overlap between discrepant variants and CIViC annotations

Supplementary Table 6

Manual re-review of 179 variants by seven reviewers to develop a consensus call

Supplementary Table 7

Guide to development of a consensus label based on manual review calls from seven reviewers

Supplementary Table 8

List of the data availability of sequencing results for all cases used in the model development

Supplementary Table 9

Sequence Ontology IDs and description used to identify variants within the CIViC database that can be analyzed on DNA sequencing platforms

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ainscough, B.J., Barnell, E.K., Ronning, P. et al. A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data. Nat Genet 50, 1735–1743 (2018). https://doi.org/10.1038/s41588-018-0257-y

Download citation

Received: 26 March 2018
Accepted: 14 September 2018
Published: 05 November 2018
Issue Date: December 2018
DOI: https://doi.org/10.1038/s41588-018-0257-y

This article is cited by

A primer on the use of machine learning to distil knowledge from data in biological psychiatry
- Thomas P. Quinn
- Jonathan L. Hess
- Stephen J. Glatt
Molecular Psychiatry (2024)
Validation of genetic variants from NGS data using deep convolutional neural networks
- Marc Vaisband
- Maria Schubert
- Jan Hasenauer
BMC Bioinformatics (2023)
Simple combination of multiple somatic variant callers to increase accuracy
- Alexander J. Trevarton
- Jeffrey T. Chang
- W. Fraser Symmans
Scientific Reports (2023)
Accurate somatic variant detection using weakly supervised deep learning
- Kiran Krishnamachari
- Dylan Lu
- Anders Jacobsen Skanderup
Nature Communications (2022)
Artificial intelligence and machine learning in precision and genomic medicine
- Sameer Quazi
Medical Oncology (2022)