A fundamental requirement for genetic studies is an accurate determination of sequence variation. While human genome sequence diversity is increasingly well characterized, there is a need for efficient ways to use this knowledge in sequence analysis. Here we present Graphtyper, a publicly available novel algorithm and software for discovering and genotyping sequence variants. Graphtyper realigns short-read sequence data to a pangenome, a variation-aware graph structure that encodes sequence variation within a population by representing possible haplotypes as graph paths. Our results show that Graphtyper is fast, highly scalable, and provides sensitive and accurate genotype calls. Graphtyper genotyped 89.4 million sequence variants in the whole genomes of 28,075 Icelanders using less than 100,000 CPU days, including detailed genotyping of six human leukocyte antigen (HLA) genes. We show that Graphtyper is a valuable tool in characterizing sequence variation in both small and population-scale sequencing studies.
Subscribe to Journal
Get full journal access for 1 year
only $18.75 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 46, 818–825 (2014).
Gudbjartsson, D.F. et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47, 435–444 (2015).
Sudmant, P.H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Seo, J.-S. et al. De novo assembly and phasing of a Korean human genome. Nature 538, 243–247 (2016).
Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).
Tiwari, J.L. & Terasaki, P.I. HLA and Disease Associations (Springer, 1985).
Robinson, J. et al. The IPD and IMGT/HLA database: allele variant databases. Nucleic Acids Res. 43, D423–D431 (2015).
Dilthey, A., Cox, C., Iqbal, Z., Nelson, M.R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Shao, H. et al. A population model for genotyping indels from next-generation sequence data. Nucleic Acids Res. 41, e46 (2013).
Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Brief. Bioinform. http://dx.doi.org/10.1093/bib/bbw089 (2016).
Dilthey, A.T. et al. High-accuracy HLA type inference from whole-genome sequencing data using population reference graphs. PLoS Comput. Biol. 12, e1005151 (2016).
Paten, B., Novak, A. & Haussler, D. Mapping to a reference genome structure. Preprint at https://arxiv.org/abs/1404.5010 (2014).
Huang, L., Popic, V. & Batzoglou, S. Short read alignment with populations of genomes. Bioinformatics 29, i361–i370 (2013).
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).
Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).
Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010).
Sirén, J., Välimäki, N. & Mäkinen, V. Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. 11, 375–388 (2014).
Schneeberger, K. et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).
Zhao, M., Lee, W.P., Garrison, E.P. & Marth, G.T. SSW library: an SIMD Smith–Waterman C/C++ library for use in genomic applications. PLoS One 8, e82138 (2013).
Novak, A.M. et al. Genome Graphs. Preprint at bioRxiv https://arxiv.org/abs/1404.5010 (2017).
Church, D.M. et al. Extending reference assembly models. Genome Biol. 16, 13 (2015).
Paten, B., Novak, A.M., Eizenga, J.M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).
Sirén, J. in 2017 Proceedings of the Nineteenth Workshop on Algorithm Engineering and Experiments (ALENEX) (eds. Fekete, S. & Ramachandran, V.) 13–27 (Society for Industrial and Applied Mathematics, 2017).
Kehr, B., Trappe, K., Holtgrewe, M. & Reinert, K. Genome alignment with graph data structures: a comparison. BMC Bioinformatics 15, 99 (2014).
Maciuca, S., Elias, C.D.O., McVean, G. & Iqbal, Z. in Lecture Notes in Computer Science 9838, 222–233 (2016).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).
Jónsson, H. et al. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature http://dx.doi.org/10.1038/nature24018 (2017).
Jónsson, H. et al. Whole genome characterization of sequence diversity of 15,220 Icelanders. Sci. Data (in press).
Eberle, M.A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).
Szolek, A. et al. OptiType: precision HLA typing from next-generation sequencing data. Bioinformatics 30, 3310–3316 (2014).
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
Eggertsson, H.P. Gyper: A Graph-Based HLA Genotyper Using Aligned DNA Sequences. MS thesis, Univ. of Iceland, Reykjavík (2015).
We are grateful to our colleagues from deCODE Genetics/Amgen for their contributions. We also wish to thank all research participants who provided biological samples to deCODE Genetics.
All authors are employees of deCODE Genetics/Amgen, Inc.
Integrated supplementary information
Supplementary Figure 1 IGV visualization of mapped sequence reads of two Icelanders carrying a 40-bp deletion.
The genomic region shown is chr. 21: 21,559,430–21,559,518 (GRCh38), and the deleted sequence is between the two vertical green lines. (a) A heterozygous carrier of the deletion. Graphtyper was the only genotyping pipeline that correctly recognized the individual as a carrier. The other pipelines called the false sequence variants due to misalignments around the indel (red boxes). (b) A homozygous carrier of the deletion. In this case, most of the reads are correctly mapped as carrying the deletion, but again some misalignment artifacts are observed (red box).
Supplementary Figure 2 Alternative allele transmission rate in 230 Icelandic parent–offspring trios.
All genotyping pipelines have an excess of sequence variants that are never transmitted from parent to offspring, which may be calls due to sequencing error or non-germline variation. Bin width is 0.1.
Supplementary Figure 3 Alternative allele transmission rate in 230 Icelandic parent–offspring trios by SNP mutation type.
The mutation ratio of transitions and transversions is estimated to be around two in the human autosomal genome. We observed that the transition/transversion ratio improved at higher transmission rates, indicating that transmission rate is measuring quality. There was also a large excess of transversions that are not transmitted.
(a) Semi-global banded alignment of a sequenced read to an extracted reference sequence. (b) Observed variation with respect to the reference sequence.
(a) An example of a graph with two sequence variants. The graph has a total of six haplotypes. (b) Two sequence variants are closer than 5 bp to each other and are grouped together. (c) If we only observed four of six haplotypes in a population, we can reduce the number of haplotypes in the graph to four as shown here.
(a) Both parents are homozygous and the offspring’s genotype can be inferred. (b) At least one parent is heterozygous and we cannot uniquely infer the offspring’s genotype. We can measure the transmission rate of an allele in these types of trios.
About this article
Cite this article
Eggertsson, H., Jonsson, H., Kristmundsdottir, S. et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nat Genet 49, 1654–1660 (2017). https://doi.org/10.1038/ng.3964
Assessing graph-based read mappers against a baseline approach highlights strengths and weaknesses of current methods
BMC Genomics (2020)
Journal of Human Genetics (2020)
Current Opinion in Plant Biology (2020)
Genome Biology (2020)
Assessing genomic diversity and signatures of selection in Original Braunvieh cattle using whole-genome sequencing data
BMC Genomics (2020)