We describe Strelka2 (https://github.com/Illumina/strelka), an open-source small-variant-calling method for research and clinical germline and somatic sequencing applications. Strelka2 introduces a novel mixture-model-based estimation of insertion/deletion error parameters from each sample, an efficient tiered haplotype-modeling strategy, and a normal sample contamination model to improve liquid tumor analysis. For both germline and somatic calling, Strelka2 substantially outperformed the current leading tools in terms of both variant-calling accuracy and computing cost.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $20.17 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
McKenna, A. et al. Genome Res. 20, 1297–1303 (2010).
DePristo, M. A. et al. Nat. Genet. 43, 491–498 (2011).
Garrison, E. & Marth, G. arXiv Preprint available at https://arxiv.org/abs/1207.3907 (2012).
Cibulskis, K. et al. Nat. Biotechnol. 31, 213–219 (2013).
Rimmer, A. et al. Nat. Genet. 46, 912–918 (2014).
Narzisi, G. et al. Nat. Methods 11, 1033–1036 (2014).
Saunders, C. T. et al. Bioinformatics 28, 1811–1817 (2012).
Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, UK, 1998).
Ding, J. et al. Bioinformatics 28, 167–175 (2012).
Van der Auwera, G. A. et al. Curr. Protoc. Bioinforma. 43, 11.10.1–11.10.33 (2013).
Poplin, R. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2016/12/14/092890 (2016).
Eberle, M. A. et al. Genome Res. 27, 157–164 (2017).
Altman, R. B. et al. Sci. Transl. Med. 8, 335ps10 (2016).
Poplin, R. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/11/14/201178.1 (2017).
Zook, J. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/05/25/281006 (2018).
Lai, Z. et al. Nucleic Acids Res. 44, e108 (2016).
Alioto, T. S. et al. Nat. Commun. 6, 10001 (2015).
Freed, D. N., Aldana, R., Weber, J. A. & Edwards, J. S. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/05/12/115717 (2017).
Li, H. arXiv Preprint at https://arxiv.org/abs/1303.3997 (2013).
Chen, K. et al. Genome Res. 24, 310–317 (2014).
We thank S. Kruglyak, B. Moore, J. O’Connell, and E. Kanterakis for helpful discussions and comments.
Integrated supplementary information
Strelka2 supports detection of germline variants in small sample cohorts (up to ~10 individuals), and somatic variants from matched tumor-normal sample pairs. These two types of analyses share several high-level steps, including: (1) parameter estimation, (2) candidate variant discovery, (3) realignment and variant probability inference, and (4) empirical scoring and filtration. Here we diagram an overview of the major workflow components for both (a) germline and (b) somatic analyses.
Supplementary Figure 2 Structure of the germline indel error and variant-calling models in probabilistic graphical model plate notation (some details omitted).
(a) Indel error model. At each locus l, a preliminary estimate of the indel allele count vector C is modeled as a mixture binomial distribution governed by the two true haplotypes h1 and h2 (a function of the unobserved genotype hypothesis H), a set of indel error rates e (unobserved) and the total count X (observed). The error rates are selected from the full set of error parameters E according to the sequence context (summarized as an integer pair denoting the size s and number r of STR repeats; observed) and a binary state variable N (unobserved) categorizing the locus as clean (essentially zero error rates) or noisy (prone to indel errors). The genotype H and the noisy-clean state variable N are drawn from prior distributions that depend, respectively, on a context-specific mutation rate θ shared across samples and a context-specific noisy-state probability pn. (b) Variant calling model. The reads dj at every locus are modeled as depending on the corresponding base call quality strings qj, the unobserved haplotype hj that generated the read, and the locus-specific error rates e. The read-specific haplotype is drawn from the set of haplotypes in the locus-specific hypothesis H, of which the prior again depends on a parameter selected from θ according to context. The error rates are again selected from the global vector E of error parameters (now treated as fixed), with the difference that all loci analyzed by this model are assumed to be in the noisy state.
The indel calling accuracy of various pipelines are plotted for the Consistency challenge Garvan dataset (left), Truth challenge HG002 dataset (center) and the GIAB HG005 dataset (right), for insertions (denoted by Ins) and deletions (denoted by Del) of different sizes (length 1-5: 88% of cases; length 6-15: 10% of cases; length 16 + : 2% of cases). For FreeBayes, the recall dropped substantially for long indels. For Strelka2 and GATK4, both of which employ local assembly, the recall drop was considerably smaller.
Results are shown for the Consistency challenge HLI dataset (left) and the Truth challenge HG001 dataset (right). Filled circles denote the pass threshold of each tool.
(a) Comparison of somatic variant calling accuracy for the in-silico germline mixtures datasets described in Fig. 2a. Strelka2 has improved indel accuracy on impure tumor samples and is far more robust to contamination in the normal sample. (b) Comparison of runtime (wallclock time) and memory usage (peak resident set size) for the same datasets, measured on servers with two Intel Xeon E5-2680 v4 CPUs (total 28 physical cores) with 256 GB of memory.
About this article
Genome Medicine (2019)
BMC Medical Genomics (2019)
Nature Communications (2019)
Nature Communications (2019)
Copy-number variants in clinical genome sequencing: deployment and interpretation for rare and undiagnosed disease
Genetics in Medicine (2019)