Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications

Rimmer, Andy; Phan, Hang; Mathieson, Iain; Iqbal, Zamin; Twigg, Stephen R F; Wilkie, Andrew O M; McVean, Gil; Lunter, Gerton

doi:10.1038/ng.3036

Technical Report
Published: 13 July 2014

Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications

Andy Rimmer¹^na1,
Hang Phan¹^na1,
Iain Mathieson¹,
Zamin Iqbal¹,
Stephen R F Twigg²,
WGS500 Consortium,
Andrew O M Wilkie²,
Gil McVean^1,3^na1 &
…
Gerton Lunter¹

Nature Genetics volume 46, pages 912–918 (2014)Cite this article

21k Accesses
648 Citations
80 Altmetric
Metrics details

Subjects

Abstract

High-throughput DNA sequencing technology has transformed genetic research and is starting to make an impact on clinical practice. However, analyzing high-throughput sequencing data remains challenging, particularly in clinical settings where accuracy and turnaround times are critical. We present a new approach to this problem, implemented in a software package called Platypus. Platypus achieves high sensitivity and specificity for SNPs, indels and complex polymorphisms by using local de novo assembly to generate candidate variants, followed by local realignment and probabilistic haplotype estimation. It is an order of magnitude faster than existing tools and generates calls from raw aligned read data without preprocessing. We demonstrate the performance of Platypus in clinically relevant experimental designs by comparing with SAMtools and GATK on whole-genome and exome-capture data, by identifying de novo variation in 15 parent-offspring trios with high sensitivity and specificity, and by estimating human leukocyte antigen genotypes directly from variant calls.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Simplified flow diagram of the integrated calling algorithm.**

**Figure 2: Size distribution of indel calls in the NA12878 trio.**

**Figure 3: Genotypes of the *HLA-A*, *HLA-B* and *HLA-C* loci at two- and four-digit resolution.**

Efficient phasing and imputation of low-coverage sequencing data using large reference panels

Article 07 January 2021

Utility of long-read sequencing for All of Us

Article Open access 29 January 2024

Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank

Article Open access 29 June 2023

References

DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Article CAS Google Scholar
Albers, C.A. et al. Dindel: accurate indel calls from short-read data. Genome Res. 21, 961–973 (2011).
Article CAS Google Scholar
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
Article CAS Google Scholar
Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124–1132 (2009).
Article CAS Google Scholar
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).
Article CAS Google Scholar
Raczy, C. et al. Isaac: ultra-fast whole genome secondary analysis on Illumina sequencing platforms. Bioinformatics 29, 2041–2043 (2013).
Article CAS Google Scholar
O'Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med. 5, 28 (2013).
Article CAS Google Scholar
Montgomery, S.B. et al. The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Res. 23, 749–761 (2013).
Article CAS Google Scholar
Holcomb, C.L. et al. A multi-site study using high-resolution HLA genotyping by next generation sequencing. Tissue Antigens 77, 206–217 (2011).
Article CAS Google Scholar
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).
Article CAS Google Scholar
Lunter, G. & Goodson, M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 21, 936–939 (2011).
Article CAS Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS Google Scholar
Li, R., Li, Y., Kristiansen, K. & Wang, J. SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713–714 (2008).
Article CAS Google Scholar
Garrison, A. & Marth, G. Haplotype-based variant detection from short-read sequencing, http://arxiv.org/abs/1207.3907 (2012).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article CAS Google Scholar
Lunter, G. et al. Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res. 18, 298–309 (2008).
Article CAS Google Scholar
Vinson, J.P. et al. Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. Genome Res. 15, 1127–1135 (2005).
Article Google Scholar
Kim, J.H., Waterman, M.S. & Li, L.M. Diploid genome reconstruction of Ciona intestinalis and comparative analysis with Ciona savignyi. Genome Res. 17, 1101–1110 (2007).
Article CAS Google Scholar
Donmez, N. & Brudno, M. in Research in Computational Molecular Biology, Lecture Notes in Computer Science Vol. 6577 (eds. Bafna, V. & Sahinalp, S.) 38–52 (Springer, Berlin, Heidelberg, 2011).
Pevzner, P.A., Tang, H. & Waterman, M.S. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA 98, 9748–9753 (2001).
Article CAS Google Scholar
Zerbino, D.R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).
Article CAS Google Scholar
Myers, E.W. Toward simplifying and accurately formulating fragment assembly. J. Comput. Biol. 2, 275–290 (1995).
Article CAS Google Scholar
Simpson, J.T. & Durbin, R. Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26, i367–i373 (2010).
Article CAS Google Scholar
Martin, H.C. et al. Clinical whole-genome sequencing in severe early-onset epilepsy reveals new genes and improves molecular diagnosis. Hum. Mol. Genet. 23, 3200–3211 (2014).
Article CAS Google Scholar
1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Kidd, J.M. et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat. Methods 7, 365–371 (2010).
Article CAS Google Scholar
Averof, M., Rokas, A., Wolfe, K.H. & Sharp, P.M. Evidence for a high frequency of simultaneous double-nucleotide substitutions. Science 287, 1283–1286 (2000).
Article CAS Google Scholar
McVey, M. & Lee, S.E. MMEJ repair of double-strand breaks (director's cut): deleted sequences and alternative endings. Trends Genet. 24, 529–538 (2008).
Article CAS Google Scholar
O'Roak, B.J. et al. Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations. Nat. Genet. 43, 585–589 (2011).
Article CAS Google Scholar
Ku, C.S., Tan, E.K. & Cooper, D.N. From the periphery to centre stage: de novo single nucleotide variants play a key role in human genetic disease. J. Med. Genet. 50, 203–211 (2013).
Article CAS Google Scholar
Sanders, S.J. et al. De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 485, 237–241 (2012).
Article CAS Google Scholar
Michaelson, J.J. et al. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell 151, 1431–1442 (2012).
Article CAS Google Scholar
Veeramah, K.R. et al. De novo pathogenic SCN8A mutation identified by whole-genome sequencing of a family quartet affected by infantile epileptic encephalopathy and SUDEP. Am. J. Hum. Genet. 90, 502–510 (2012).
Article CAS Google Scholar
Kong, A. et al. Rate of de novo mutations and the importance of father's age to disease risk. Nature 488, 471–475 (2012).
Article CAS Google Scholar
Conrad, D.F. et al. Variation in genome-wide mutation rates within and between human families. Nat. Genet. 43, 712–714 (2011).
Article CAS Google Scholar
Chen, J.M., Ferec, C. & Cooper, D.N. Transient hypermutability, chromothripsis and replication-based mechanisms in the generation of concurrent clustered mutations. Mutat. Res. 750, 52–59 (2012).
Article CAS Google Scholar
Itoh, Y. et al. High-throughput DNA typing of HLA-A, -B, -C, and -DRB1 loci by a PCR-SSOP-Luminex method in the Japanese population. Immunogenetics 57, 717–729 (2005).
Article CAS Google Scholar
Leslie, S., Donnelly, P. & McVean, G. A statistical method for predicting classical HLA alleles from SNP data. Am. J. Hum. Genet. 82, 48–56 (2008).
Article CAS Google Scholar
de Bakker, P.I.W. et al. A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nat. Genet. 38, 1166–1172 (2006).
Article CAS Google Scholar
Ruark, E. et al. Mosaic PPM1D mutations are associated with predisposition to breast and ovarian cancer. Nature 493, 406–410 (2013).
Article CAS Google Scholar
Pagnamenta, A.T. et al. Exome sequencing can detect pathogenic mosaic mutations present at low allele frequencies. J. Hum. Genet. 57, 70–72 (2012).
Article CAS Google Scholar
Untergasser, A. et al. Primer3—new capabilities and interfaces. Nucleic Acids Res. 40, e115 (2012).
Article CAS Google Scholar
Koressaar, T. & Remm, M. Enhancements and modifications of primer design program Primer3. Bioinformatics 23, 1289–1291 (2007).
Article CAS Google Scholar

Download references

Acknowledgements

This study was funded by Biotechnology and Biological Sciences Research Council (BBSRC) grant BB/I02593X/1 (G.L., G.M., A.R. and H.P.), by Wellcome Trust grants 102731/Z/13/Z (A.O.M.W. and S.R.F.T.), 089250/Z/09/Z (I.M.) and 090532/Z/09/Z (G.M., G.L. and A.R.), and by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre Programme. The views expressed are those of the authors and not necessarily those of the National Health Service (NHS), NIHR or the UK Department of Health.

Author information

Andy Rimmer, Hang Phan and Gil McVean: These authors contributed equally to this work.

Authors and Affiliations

Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
Andy Rimmer, Hang Phan, Iain Mathieson, Zamin Iqbal, Gil McVean & Gerton Lunter
Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Headington, Oxford, UK
Stephen R F Twigg & Andrew O M Wilkie
Department of Statistics, University of Oxford, Oxford, UK
Gil McVean

Authors

Andy Rimmer
View author publications
You can also search for this author in PubMed Google Scholar
Hang Phan
View author publications
You can also search for this author in PubMed Google Scholar
Iain Mathieson
View author publications
You can also search for this author in PubMed Google Scholar
Zamin Iqbal
View author publications
You can also search for this author in PubMed Google Scholar
Stephen R F Twigg
View author publications
You can also search for this author in PubMed Google Scholar
Andrew O M Wilkie
View author publications
You can also search for this author in PubMed Google Scholar
Gil McVean
View author publications
You can also search for this author in PubMed Google Scholar
Gerton Lunter
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

WGS500 Consortium

Contributions

A.R. developed Platypus. A.R., H.P., I.M., Z.I. and G.L. contributed code and algorithms. A.R., H.P. and G.L. analyzed data. H.P., S.R.F.T. and A.O.M.W. performed validation experiments. WGS500 contributed data. A.O.M.W., G.M. and G.L. wrote the manuscript. G.L. initiated and led the project.

Corresponding author

Correspondence to Gerton Lunter.

Ethics declarations

Competing interests

G.M. and G.L. are cofounders and shareholders of Genomics, Ltd. A.R. is currently employed by Genomics, Ltd. The other authors declare no competing financial interests.

Additional information

A list of members and affiliations appears in the Supplementary Note.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–5, Supplementary Tables 1–6 and Supplementary Note. (PDF 13583 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rimmer, A., Phan, H., Mathieson, I. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat Genet 46, 912–918 (2014). https://doi.org/10.1038/ng.3036

Download citation

Received: 22 November 2013
Accepted: 23 June 2014
Published: 13 July 2014
Issue Date: August 2014
DOI: https://doi.org/10.1038/ng.3036

This article is cited by

ACT-Discover: identifying karyotype heterogeneity in pancreatic cancer evolution using ctDNA
- Ariana Huebner
- James R. M. Black
- Rodrigo A. Toledo
Genome Medicine (2023)
Shining the spotlight on the neglected: new high-quality genome assemblies as a gateway to understanding the evolution of Trypanosomatidae
- Amanda T. S. Albanaz
- Mark Carrington
- Anzhelika Butenko
BMC Genomics (2023)
Transformation of alignment files improves performance of variant callers for long-read RNA sequencing data
- Vladimir B. C. de Souza
- Ben T. Jordan
- Mark D. Robinson
Genome Biology (2023)
Completing a genomic characterisation of microscopic tumour samples with copy number
- Joel Nulsen
- Nosheen Hussain
- Ahmed Ashour Ahmed
BMC Bioinformatics (2023)
Evaluating the performance of low-frequency variant calling tools for the detection of variants from short-read deep sequencing data
- Xudong Xiang
- Bowen Lu
- Dan Pu
Scientific Reports (2023)