Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

A unified haplotype-based method for accurate and comprehensive variant calling

Abstract

Almost all haplotype-based variant callers were designed specifically for detecting common germline variation in diploid populations, and give suboptimal results in other scenarios. Here we present Octopus, a variant caller that uses a polymorphic Bayesian genotyping model capable of modeling sequencing data from a range of experimental designs within a unified haplotype-aware framework. Octopus combines sequencing reads and prior information to phase-called genotypes of arbitrary ploidy, including those with somatic mutations. We show that Octopus accurately calls germline variants in individuals, including single nucleotide variants, indels and small complex replacements such as microinversions. Using a synthetic tumor data set derived from clean sequencing data from a sample with known germline haplotypes and observed mutations in a large cohort of tumor samples, we show that Octopus is more sensitive to low-frequency somatic variation, yet calls considerably fewer false positives than other methods. Octopus also outputs realigned evidence BAM files to aid validation and interpretation.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Overview of the unified haplotype-based algorithm, showing joint calling of two samples with the population calling model.
Fig. 2: Germline variant calling accuracy.
Fig. 3: Overview of synthetic-tumor creation.
Fig. 4: Somatic mutation calling accuracy with a paired normal sample.
Fig. 5: Somatic mutation calling accuracy in synthetic PACA tumors without a paired normal sample for various sequencing depths.

Data availability

All germline data used in this manuscript are publicly available from GIAB, Precision FDA and ENA. Links are provided in Supplementary Note 1. Trio data from the WGS500 project are available from the European Nucleotide Archive under accession no. PRJEB9151 (samples AW_SC_4654, AW_SC_4655 and AW_SC_4659). The synthetic-tumor data have been deposited in the Sequence Read Archive under BioProject accession no. PRJNA694520. The corresponding truth sets have been deposited to figshare (https://doi.org/10.6084/m9.figshare.13902212).

Code availability

Octopus source code and documentation is freely available under the MIT licence from https://github.com/luntergroup/octopus. Custom code used for data analysis is available from https://github.com/luntergroup/octopus-paper.

References

  1. 1.

    Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).

    CAS  Article  Google Scholar 

  2. 2.

    Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591–594 (2018).

    CAS  Article  Google Scholar 

  3. 3.

    Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    CAS  Article  Google Scholar 

  4. 4.

    DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

    CAS  Article  Google Scholar 

  5. 5.

    Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907(2012).

  6. 6.

    Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2017).

  7. 7.

    Lo, Y. et al. Comparing variant calling algorithms for target-exon sequencing in a large sample. BMC Bioinf. 16, 75 (2015).

    Article  Google Scholar 

  8. 8.

    Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).

    CAS  Article  Google Scholar 

  9. 9.

    Hayward, N. K. et al. Whole-genome landscapes of major melanoma subtypes. Nature 545, 175–180 (2017).

    CAS  Article  Google Scholar 

  10. 10.

    Northcott, P. A. et al. The whole-genome landscape of medulloblastoma subtypes. Nature 547, 311–317 (2017).

    CAS  Article  Google Scholar 

  11. 11.

    Waddell, N. et al. Whole genomes redefine the mutational landscape of pancreatic cancer. Nature 518, 495–501 (2015).

    CAS  Article  Google Scholar 

  12. 12.

    Besenbacher, S. et al. Multi-nucleotide de novo mutations in humans. PLoS Genet. 12, e1006315 (2016).

    Article  Google Scholar 

  13. 13.

    Jonsson, H. et al. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature 549, 519–522 (2017).

    Article  Google Scholar 

  14. 14.

    Deciphering Developmental Disorders, S. Prevalence and architecture of de novo mutations in developmental disorders. Nature 542, 433–438 (2017).

    Article  Google Scholar 

  15. 15.

    Goldmann, J. M. et al. Germline de novo mutation clusters arise during oocyte aging in genomic regions with high double-strand-break incidence. Nat. Genet. 50, 487–492 (2018).

    CAS  Article  Google Scholar 

  16. 16.

    Walker, T. M. et al. Whole-genome sequencing for prediction of mycobacterium tuberculosis drug susceptibility and resistance: a retrospective cohort study. Lancet Infect. Dis. 15, 1193–1202 (2015).

    CAS  Article  Google Scholar 

  17. 17.

    Tewhey, R., Bansal, V., Torkamani, A., Topol, E. J. & Schork, N. J. The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011).

    CAS  Article  Google Scholar 

  18. 18.

    Doucet, A. & Johansen, A. M. A tutorial on particle filtering and smoothing: fifteen years later. In Handbook of Nonlinear Filtering 12, 656–704 (2009).

  19. 19.

    Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

    CAS  Article  Google Scholar 

  20. 20.

    Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).

    Article  Google Scholar 

  21. 21.

    Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).

  22. 22.

    Cleary, J. G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. Preprint at https://www.biorxiv.org/content/10.1101/023754v2 (2015).

  23. 23.

    Veltman, J. A. & Brunner, H. G. De novo mutations in human genetic disease. Nat. Rev. Genet. 13, 565–575 (2012).

    CAS  Article  Google Scholar 

  24. 24.

    Xu, B. et al. De novo gene mutations highlight patterns of genetic and neural complexity in schizophrenia. Nat. Genet. 44, 1365–1369 (2012).

    CAS  Article  Google Scholar 

  25. 25.

    Gilissen, C. et al. Genome sequencing identifies major causes of severe intellectual disability. Nature 511, 344–347 (2014).

    CAS  Article  Google Scholar 

  26. 26.

    Kong, A. et al. Rate of de novo mutations and the importance of father’s age to disease risk. Nature 488, 471–475 (2012).

    CAS  Article  Google Scholar 

  27. 27.

    Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. 28.

    Alioto, T. S. et al. A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing. Nat. Commun. 6, 10001 (2015).

    CAS  Article  Google Scholar 

  29. 29.

    Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).

    CAS  Article  Google Scholar 

  30. 30.

    Ewing, A. D. et al. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nat. Methods 12, 623–630 (2015).

    CAS  Article  Google Scholar 

  31. 31.

    Wilm, A. et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 40, 11189–11201 (2012).

    CAS  Article  Google Scholar 

  32. 32.

    Narzisi, G. et al. Genome-wide somatic variant calling using localized colored De Bruijn graphs. Commun. Biol. 1, 20 (2018).

    Article  Google Scholar 

  33. 33.

    Lai, Z. et al. VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res. 44, e108 (2016).

    Article  Google Scholar 

  34. 34.

    Decker, B. et al. Biallelic BRCA2 mutations shape the somatic mutational landscape of aggressive prostate tumors. Am. J. Hum. Genet. 98, 818–829 (2016).

    CAS  Article  Google Scholar 

  35. 35.

    Hause, R. J., Pritchard, C. C., Shendure, J. & Salipante, S. J. Classification and characterization of microsatellite instability across 18 cancer types. Nat. Med. 22, 1342–1350 (2016).

    CAS  Article  Google Scholar 

  36. 36.

    Maruvka, Y. E. et al. Analysis of somatic microsatellite indels identifies driver events in human tumors. Nat. Biotechnol. 35, 951–959 (2017).

    CAS  Article  Google Scholar 

  37. 37.

    The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature. 578, 82-93 (2020).

  38. 38.

    Montgomery, S. B. et al. The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Res. 23, 749–761 (2013).

    CAS  Article  Google Scholar 

  39. 39.

    Fu, Y. X. Probability of a segregating pattern in a sample of DNA sequences. Theor. Popul. Biol. 54, 1–10 (1998).

    CAS  Article  Google Scholar 

  40. 40.

    Wright, M. N. & Ziegler, A. Ranger: a fast implementation of random forests for high dimensional data in C++ and R. J. Stat. Softw. 77, 1–17 (2017).

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by The Wellcome Trust Genomic Medicine and Statistics PhD Program (grant nos. 203735/Z/16/Z to D.P.C.). The computational aspects of this research were supported by the Wellcome Trust Core Award grant number 203141/Z/16/Z and the NIHR Oxford BRC. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

Author information

Affiliations

Authors

Contributions

D.P.C. and G.L. designed the algorithm and wrote the manuscript. D.P.C. implemented the algorithm and performed the evaluation. D.C.W. provided data for the synthetic tumors and critically reviewed the manuscript.

Corresponding author

Correspondence to Daniel P. Cooke.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Biotechnology thanks Federico Abascal and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–6, Tables 1–4 and Notes 1–4.

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cooke, D.P., Wedge, D.C. & Lunter, G. A unified haplotype-based method for accurate and comprehensive variant calling. Nat Biotechnol (2021). https://doi.org/10.1038/s41587-021-00861-3

Download citation

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing