Strelka2: fast and accurate calling of germline and somatic variants

Kim, Sangtae; Scheffler, Konrad; Halpern, Aaron L.; Bekritsky, Mitchell A.; Noh, Eunho; Källberg, Morten; Chen, Xiaoyu; Kim, Yeonbin; Beyter, Doruk; Krusche, Peter; Saunders, Christopher T.

doi:10.1038/s41592-018-0051-x

Brief Communication
Published: 16 July 2018

Strelka2: fast and accurate calling of germline and somatic variants

Nature Methods volume 15, pages 591–594 (2018)Cite this article

20k Accesses
680 Citations
81 Altmetric
Metrics details

Subjects

Abstract

We describe Strelka2 (https://github.com/Illumina/strelka), an open-source small-variant-calling method for research and clinical germline and somatic sequencing applications. Strelka2 introduces a novel mixture-model-based estimation of insertion/deletion error parameters from each sample, an efficient tiered haplotype-modeling strategy, and a normal sample contamination model to improve liquid tumor analysis. For both germline and somatic calling, Strelka2 substantially outperformed the current leading tools in terms of both variant-calling accuracy and computing cost.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Germline-variant-calling accuracy and runtime.**

**Fig. 2: Somatic-variant-calling accuracy and runtime.**

GASOLINE: detecting germline and somatic structural variants from long-reads data

Article Open access 27 November 2023

Alberto Magi, Gianluca Mattei, … Pier Giuseppe Pelicci

Simple combination of multiple somatic variant callers to increase accuracy

Article Open access 25 May 2023

Alexander J. Trevarton, Jeffrey T. Chang & W. Fraser Symmans

Deep whole-genome sequencing of 3 cancer cell lines on 2 sequencing platforms

Article Open access 13 December 2019

Kanika Arora, Minita Shah, … Nicolas Robine

References

McKenna, A. et al. Genome Res. 20, 1297–1303 (2010).
Article PubMed PubMed Central CAS Google Scholar
DePristo, M. A. et al. Nat. Genet. 43, 491–498 (2011).
Article PubMed PubMed Central CAS Google Scholar
Garrison, E. & Marth, G. arXiv Preprint available at https://arxiv.org/abs/1207.3907 (2012).
Cibulskis, K. et al. Nat. Biotechnol. 31, 213–219 (2013).
Article PubMed PubMed Central CAS Google Scholar
Rimmer, A. et al. Nat. Genet. 46, 912–918 (2014).
Article PubMed PubMed Central CAS Google Scholar
Narzisi, G. et al. Nat. Methods 11, 1033–1036 (2014).
Article PubMed PubMed Central CAS Google Scholar
Saunders, C. T. et al. Bioinformatics 28, 1811–1817 (2012).
Article PubMed CAS Google Scholar
Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, UK, 1998).
Book Google Scholar
Ding, J. et al. Bioinformatics 28, 167–175 (2012).
Article PubMed CAS Google Scholar
Van der Auwera, G. A. et al. Curr. Protoc. Bioinforma. 43, 11.10.1–11.10.33 (2013).
Google Scholar
Poplin, R. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2016/12/14/092890 (2016).
Eberle, M. A. et al. Genome Res. 27, 157–164 (2017).
Article PubMed PubMed Central CAS Google Scholar
Altman, R. B. et al. Sci. Transl. Med. 8, 335ps10 (2016).
Article PubMed CAS Google Scholar
Poplin, R. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/11/14/201178.1 (2017).
Zook, J. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/05/25/281006 (2018).
Lai, Z. et al. Nucleic Acids Res. 44, e108 (2016).
Article PubMed PubMed Central CAS Google Scholar
Alioto, T. S. et al. Nat. Commun. 6, 10001 (2015).
Article PubMed PubMed Central CAS Google Scholar
Freed, D. N., Aldana, R., Weber, J. A. & Edwards, J. S. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/05/12/115717 (2017).
Li, H. arXiv Preprint at https://arxiv.org/abs/1303.3997 (2013).
Chen, K. et al. Genome Res. 24, 310–317 (2014).
Article PubMed PubMed Central CAS Google Scholar

Download references

Acknowledgements

We thank S. Kruglyak, B. Moore, J. O’Connell, and E. Kanterakis for helpful discussions and comments.

Author information

Morten Källberg
Present address: Seven Bridges Genomics, London, UK
Doruk Beyter
Present address: deCODE Genetics/Amgen, Inc., Reykjavik, Iceland
These authors contributed equally: Sangtae Kim and Konrad Scheffler.

Authors and Affiliations

Illumina, Inc., San Diego, CA, USA
Sangtae Kim, Konrad Scheffler, Aaron L. Halpern, Eunho Noh, Xiaoyu Chen, Yeonbin Kim & Christopher T. Saunders
Illumina Cambridge Ltd, Essex, UK
Mitchell A. Bekritsky, Morten Källberg & Peter Krusche
Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
Doruk Beyter

Authors

Sangtae Kim
View author publications
You can also search for this author in PubMed Google Scholar
Konrad Scheffler
View author publications
You can also search for this author in PubMed Google Scholar
Aaron L. Halpern
View author publications
You can also search for this author in PubMed Google Scholar
Mitchell A. Bekritsky
View author publications
You can also search for this author in PubMed Google Scholar
Eunho Noh
View author publications
You can also search for this author in PubMed Google Scholar
Morten Källberg
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yeonbin Kim
View author publications
You can also search for this author in PubMed Google Scholar
Doruk Beyter
View author publications
You can also search for this author in PubMed Google Scholar
Peter Krusche
View author publications
You can also search for this author in PubMed Google Scholar
Christopher T. Saunders
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.K., K.S., A.L.H., M.A.B., E.N., M.K., X.C., Y.K., D.B., P.K., and C.T.S. designed the algorithms and implemented the Strelka2 software. S.K. and C.T.S. designed and performed the analyses. S.K., K.S., and C.T.S. wrote the manuscript, with input from all other authors.

Corresponding author

Correspondence to Christopher T. Saunders.

Ethics declarations

Competing interests

S.K., K.S., A.L.H., M.A.B., E.N., X.C., Y.K., P.K., and C.T.S. are employees of Illumina, Inc., a public company that develops and markets systems for genetic analysis.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Strelka2 variant-calling workflows.

Strelka2 supports detection of germline variants in small sample cohorts (up to ~10 individuals), and somatic variants from matched tumor-normal sample pairs. These two types of analyses share several high-level steps, including: (1) parameter estimation, (2) candidate variant discovery, (3) realignment and variant probability inference, and (4) empirical scoring and filtration. Here we diagram an overview of the major workflow components for both (a) germline and (b) somatic analyses.

Supplementary Figure 2 Structure of the germline indel error and variant-calling models in probabilistic graphical model plate notation (some details omitted).

(a) Indel error model. At each locus l, a preliminary estimate of the indel allele count vector C is modeled as a mixture binomial distribution governed by the two true haplotypes h₁ and h₂ (a function of the unobserved genotype hypothesis H), a set of indel error rates e (unobserved) and the total count X (observed). The error rates are selected from the full set of error parameters E according to the sequence context (summarized as an integer pair denoting the size s and number r of STR repeats; observed) and a binary state variable N (unobserved) categorizing the locus as clean (essentially zero error rates) or noisy (prone to indel errors). The genotype H and the noisy-clean state variable N are drawn from prior distributions that depend, respectively, on a context-specific mutation rate θ shared across samples and a context-specific noisy-state probability p_n. (b) Variant calling model. The reads d_j at every locus are modeled as depending on the corresponding base call quality strings q_j, the unobserved haplotype h_j that generated the read, and the locus-specific error rates e. The read-specific haplotype is drawn from the set of haplotypes in the locus-specific hypothesis H, of which the prior again depends on a parameter selected from θ according to context. The error rates are again selected from the global vector E of error parameters (now treated as fixed), with the difference that all loci analyzed by this model are assumed to be in the noisy state.

Supplementary Figure 3 Germline-indel-calling accuracy stratified by indel size and type.

The indel calling accuracy of various pipelines are plotted for the Consistency challenge Garvan dataset (left), Truth challenge HG002 dataset (center) and the GIAB HG005 dataset (right), for insertions (denoted by Ins) and deletions (denoted by Del) of different sizes (length 1-5: 88% of cases; length 6-15: 10% of cases; length 16 + : 2% of cases). For FreeBayes, the recall dropped substantially for long indels. For Strelka2 and GATK4, both of which employ local assembly, the recall drop was considerably smaller.

Supplementary Figure 4 Accuracy of germline indel and SNV calling for additional test datasets.

Results are shown for the Consistency challenge HLI dataset (left) and the Truth challenge HG001 dataset (right). Filled circles denote the pass threshold of each tool.

Supplementary Figure 5 Comparison of performance characteristics for Strelka2 versus Strelka.

(a) Comparison of somatic variant calling accuracy for the in-silico germline mixtures datasets described in Fig. 2a. Strelka2 has improved indel accuracy on impure tumor samples and is far more robust to contamination in the normal sample. (b) Comparison of runtime (wallclock time) and memory usage (peak resident set size) for the same datasets, measured on servers with two Intel Xeon E5-2680 v4 CPUs (total 28 physical cores) with 256 GB of memory.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, S., Scheffler, K., Halpern, A.L. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat Methods 15, 591–594 (2018). https://doi.org/10.1038/s41592-018-0051-x

Download citation

Received: 26 September 2017
Accepted: 09 May 2018
Published: 16 July 2018
Issue Date: August 2018
DOI: https://doi.org/10.1038/s41592-018-0051-x

This article is cited by

COSAP: Comparative Sequencing Analysis Platform
- Mehmet Arif Ergun
- Omer Cinal
- Mehmet Baysan
BMC Bioinformatics (2024)
A cautionary tale of low-pass sequencing and imputation with respect to haplotype accuracy
- David Wragg
- Wengang Zhang
- Dylan N. Clements
Genetics Selection Evolution (2024)
Race-specific coregulatory and transcriptomic profiles associated with DNA methylation and androgen receptor in prostate cancer
- Swathi Ramakrishnan
- Eduardo Cortes-Gomez
- Anna Woloszynska
Genome Medicine (2024)
Missense mutation of NRAS is associated with malignant progression in neurocutaneous melanosis
- Haruhiko Takahashi
- Manabu Natsumeda
- Makoto Oishi
Acta Neuropathologica Communications (2024)
Mutational landscape of inflammatory breast cancer
- François Bertucci
- Florence Lerebours
- Davide Bedognetti
Journal of Translational Medicine (2024)

Strelka2: fast and accurate calling of germline and somatic variants

Subjects

Abstract

Access options

Similar content being viewed by others

GASOLINE: detecting germline and somatic structural variants from long-reads data

Simple combination of multiple somatic variant callers to increase accuracy

Deep whole-genome sequencing of 3 cancer cell lines on 2 sequencing platforms

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Integrated supplementary information

Supplementary Figure 1 Strelka2 variant-calling workflows.

Supplementary Figure 2 Structure of the germline indel error and variant-calling models in probabilistic graphical model plate notation (some details omitted).

Supplementary Figure 3 Germline-indel-calling accuracy stratified by indel size and type.

Supplementary Figure 4 Accuracy of germline indel and SNV calling for additional test datasets.

Supplementary Figure 5 Comparison of performance characteristics for Strelka2 versus Strelka.

Supplementary information

Supplementary Text and Figures

Reporting Summary

Supplementary Table 1

Supplementary Software 1

Rights and permissions

About this article

Cite this article

This article is cited by

COSAP: Comparative Sequencing Analysis Platform

A cautionary tale of low-pass sequencing and imputation with respect to haplotype accuracy

Race-specific coregulatory and transcriptomic profiles associated with DNA methylation and androgen receptor in prostate cancer

Missense mutation of NRAS is associated with malignant progression in neurocutaneous melanosis

Mutational landscape of inflammatory breast cancer

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Integrated supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links