Brief Communication | Published:

Strelka2: fast and accurate calling of germline and somatic variants

Abstract

We describe Strelka2 (https://github.com/Illumina/strelka), an open-source small-variant-calling method for research and clinical germline and somatic sequencing applications. Strelka2 introduces a novel mixture-model-based estimation of insertion/deletion error parameters from each sample, an efficient tiered haplotype-modeling strategy, and a normal sample contamination model to improve liquid tumor analysis. For both germline and somatic calling, Strelka2 substantially outperformed the current leading tools in terms of both variant-calling accuracy and computing cost.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    McKenna, A. et al. Genome Res. 20, 1297–1303 (2010).

  2. 2.

    DePristo, M. A. et al. Nat. Genet. 43, 491–498 (2011).

  3. 3.

    Garrison, E. & Marth, G. arXiv Preprint available at https://arxiv.org/abs/1207.3907 (2012).

  4. 4.

    Cibulskis, K. et al. Nat. Biotechnol. 31, 213–219 (2013).

  5. 5.

    Rimmer, A. et al. Nat. Genet. 46, 912–918 (2014).

  6. 6.

    Narzisi, G. et al. Nat. Methods 11, 1033–1036 (2014).

  7. 7.

    Saunders, C. T. et al. Bioinformatics 28, 1811–1817 (2012).

  8. 8.

    Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, UK, 1998).

  9. 9.

    Ding, J. et al. Bioinformatics 28, 167–175 (2012).

  10. 10.

    Van der Auwera, G. A. et al. Curr. Protoc. Bioinforma. 43, 11.10.1–11.10.33 (2013).

  11. 11.

    Poplin, R. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2016/12/14/092890 (2016).

  12. 12.

    Eberle, M. A. et al. Genome Res. 27, 157–164 (2017).

  13. 13.

    Altman, R. B. et al. Sci. Transl. Med. 8, 335ps10 (2016).

  14. 14.

    Poplin, R. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/11/14/201178.1 (2017).

  15. 15.

    Zook, J. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/05/25/281006 (2018).

  16. 16.

    Lai, Z. et al. Nucleic Acids Res. 44, e108 (2016).

  17. 17.

    Alioto, T. S. et al. Nat. Commun. 6, 10001 (2015).

  18. 18.

    Freed, D. N., Aldana, R., Weber, J. A. & Edwards, J. S. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/05/12/115717 (2017).

  19. 19.

    Li, H. arXiv Preprint at https://arxiv.org/abs/1303.3997 (2013).

  20. 20.

    Chen, K. et al. Genome Res. 24, 310–317 (2014).

Download references

Acknowledgements

We thank S. Kruglyak, B. Moore, J. O’Connell, and E. Kanterakis for helpful discussions and comments.

Author information

S.K., K.S., A.L.H., M.A.B., E.N., M.K., X.C., Y.K., D.B., P.K., and C.T.S. designed the algorithms and implemented the Strelka2 software. S.K. and C.T.S. designed and performed the analyses. S.K., K.S., and C.T.S. wrote the manuscript, with input from all other authors.

Competing interests

S.K., K.S., A.L.H., M.A.B., E.N., X.C., Y.K., P.K., and C.T.S. are employees of Illumina, Inc., a public company that develops and markets systems for genetic analysis.

Correspondence to Christopher T. Saunders.

Integrated supplementary information

Supplementary Figure 1 Strelka2 variant-calling workflows.

Strelka2 supports detection of germline variants in small sample cohorts (up to ~10 individuals), and somatic variants from matched tumor-normal sample pairs. These two types of analyses share several high-level steps, including: (1) parameter estimation, (2) candidate variant discovery, (3) realignment and variant probability inference, and (4) empirical scoring and filtration. Here we diagram an overview of the major workflow components for both (a) germline and (b) somatic analyses.

Supplementary Figure 2 Structure of the germline indel error and variant-calling models in probabilistic graphical model plate notation (some details omitted).

(a) Indel error model. At each locus l, a preliminary estimate of the indel allele count vector C is modeled as a mixture binomial distribution governed by the two true haplotypes h1 and h2 (a function of the unobserved genotype hypothesis H), a set of indel error rates e (unobserved) and the total count X (observed). The error rates are selected from the full set of error parameters E according to the sequence context (summarized as an integer pair denoting the size s and number r of STR repeats; observed) and a binary state variable N (unobserved) categorizing the locus as clean (essentially zero error rates) or noisy (prone to indel errors). The genotype H and the noisy-clean state variable N are drawn from prior distributions that depend, respectively, on a context-specific mutation rate θ shared across samples and a context-specific noisy-state probability pn. (b) Variant calling model. The reads dj at every locus are modeled as depending on the corresponding base call quality strings qj, the unobserved haplotype hj that generated the read, and the locus-specific error rates e. The read-specific haplotype is drawn from the set of haplotypes in the locus-specific hypothesis H, of which the prior again depends on a parameter selected from θ according to context. The error rates are again selected from the global vector E of error parameters (now treated as fixed), with the difference that all loci analyzed by this model are assumed to be in the noisy state.

Supplementary Figure 3 Germline-indel-calling accuracy stratified by indel size and type.

The indel calling accuracy of various pipelines are plotted for the Consistency challenge Garvan dataset (left), Truth challenge HG002 dataset (center) and the GIAB HG005 dataset (right), for insertions (denoted by Ins) and deletions (denoted by Del) of different sizes (length 1-5: 88% of cases; length 6-15: 10% of cases; length 16 + : 2% of cases). For FreeBayes, the recall dropped substantially for long indels. For Strelka2 and GATK4, both of which employ local assembly, the recall drop was considerably smaller.

Supplementary Figure 4 Accuracy of germline indel and SNV calling for additional test datasets.

Results are shown for the Consistency challenge HLI dataset (left) and the Truth challenge HG001 dataset (right). Filled circles denote the pass threshold of each tool.

Supplementary Figure 5 Comparison of performance characteristics for Strelka2 versus Strelka.

(a) Comparison of somatic variant calling accuracy for the in-silico germline mixtures datasets described in Fig. 2a. Strelka2 has improved indel accuracy on impure tumor samples and is far more robust to contamination in the normal sample. (b) Comparison of runtime (wallclock time) and memory usage (peak resident set size) for the same datasets, measured on servers with two Intel Xeon E5-2680 v4 CPUs (total 28 physical cores) with 256 GB of memory.

Supplementary information

Supplementary Text and Figures

Supplementary Figs. 1–5 and Supplementary Notes 1–3

Reporting Summary

Supplementary Table 1

Germline variant calling accuracy

Supplementary Software 1

Strelka2 source code for version 2.9.0

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Further reading

Fig. 1: Germline-variant-calling accuracy and runtime.
Fig. 2: Somatic-variant-calling accuracy and runtime.
Supplementary Figure 1: Strelka2 variant-calling workflows.
Supplementary Figure 2: Structure of the germline indel error and variant-calling models in probabilistic graphical model plate notation (some details omitted).
Supplementary Figure 3: Germline-indel-calling accuracy stratified by indel size and type.
Supplementary Figure 4: Accuracy of germline indel and SNV calling for additional test datasets.
Supplementary Figure 5: Comparison of performance characteristics for Strelka2 versus Strelka.