Abstract
A fundamental requirement for genetic studies is an accurate determination of sequence variation. While human genome sequence diversity is increasingly well characterized, there is a need for efficient ways to use this knowledge in sequence analysis. Here we present Graphtyper, a publicly available novel algorithm and software for discovering and genotyping sequence variants. Graphtyper realigns short-read sequence data to a pangenome, a variation-aware graph structure that encodes sequence variation within a population by representing possible haplotypes as graph paths. Our results show that Graphtyper is fast, highly scalable, and provides sensitive and accurate genotype calls. Graphtyper genotyped 89.4 million sequence variants in the whole genomes of 28,075 Icelanders using less than 100,000 CPU days, including detailed genotyping of six human leukocyte antigen (HLA) genes. We show that Graphtyper is a valuable tool in characterizing sequence variation in both small and population-scale sequencing studies.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 46, 818–825 (2014).
Gudbjartsson, D.F. et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47, 435–444 (2015).
Sudmant, P.H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Seo, J.-S. et al. De novo assembly and phasing of a Korean human genome. Nature 538, 243–247 (2016).
Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).
Tiwari, J.L. & Terasaki, P.I. HLA and Disease Associations (Springer, 1985).
Robinson, J. et al. The IPD and IMGT/HLA database: allele variant databases. Nucleic Acids Res. 43, D423–D431 (2015).
Dilthey, A., Cox, C., Iqbal, Z., Nelson, M.R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Shao, H. et al. A population model for genotyping indels from next-generation sequence data. Nucleic Acids Res. 41, e46 (2013).
Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Brief. Bioinform. http://dx.doi.org/10.1093/bib/bbw089 (2016).
Dilthey, A.T. et al. High-accuracy HLA type inference from whole-genome sequencing data using population reference graphs. PLoS Comput. Biol. 12, e1005151 (2016).
Paten, B., Novak, A. & Haussler, D. Mapping to a reference genome structure. Preprint at https://arxiv.org/abs/1404.5010 (2014).
Huang, L., Popic, V. & Batzoglou, S. Short read alignment with populations of genomes. Bioinformatics 29, i361–i370 (2013).
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).
Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).
Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010).
Sirén, J., Välimäki, N. & Mäkinen, V. Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. 11, 375–388 (2014).
Schneeberger, K. et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).
Zhao, M., Lee, W.P., Garrison, E.P. & Marth, G.T. SSW library: an SIMD Smith–Waterman C/C++ library for use in genomic applications. PLoS One 8, e82138 (2013).
Novak, A.M. et al. Genome Graphs. Preprint at bioRxiv https://arxiv.org/abs/1404.5010 (2017).
Church, D.M. et al. Extending reference assembly models. Genome Biol. 16, 13 (2015).
Paten, B., Novak, A.M., Eizenga, J.M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).
Sirén, J. in 2017 Proceedings of the Nineteenth Workshop on Algorithm Engineering and Experiments (ALENEX) (eds. Fekete, S. & Ramachandran, V.) 13–27 (Society for Industrial and Applied Mathematics, 2017).
Kehr, B., Trappe, K., Holtgrewe, M. & Reinert, K. Genome alignment with graph data structures: a comparison. BMC Bioinformatics 15, 99 (2014).
Maciuca, S., Elias, C.D.O., McVean, G. & Iqbal, Z. in Lecture Notes in Computer Science 9838, 222–233 (2016).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).
Jónsson, H. et al. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature http://dx.doi.org/10.1038/nature24018 (2017).
Jónsson, H. et al. Whole genome characterization of sequence diversity of 15,220 Icelanders. Sci. Data (in press).
Eberle, M.A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).
Szolek, A. et al. OptiType: precision HLA typing from next-generation sequencing data. Bioinformatics 30, 3310–3316 (2014).
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
Eggertsson, H.P. Gyper: A Graph-Based HLA Genotyper Using Aligned DNA Sequences. MS thesis, Univ. of Iceland, Reykjavík (2015).
Acknowledgements
We are grateful to our colleagues from deCODE Genetics/Amgen for their contributions. We also wish to thank all research participants who provided biological samples to deCODE Genetics.
Author information
Authors and Affiliations
Contributions
H.P.E. implemented the Graphtyper software. H.P.E., P.M., and B.V.H. designed the Graphtyper algorithm. H.P.E., D.F.G., P.M., B.V.H., and K.S. designed the experiments. H.P.E., E.H., G.M., and F.Z. ran all evaluated genotypers. H.P.E., H.J., and K.E.H. analyzed the call sets. Aslaug Jonasdottir, Adalbjorg Jonasdottir, and I.J. were responsible for PCR validation. H.J. and S.K. contributed software for the project. H.P.E. wrote the initial version of the manuscript, and H.J., S.K., B.K., P.M., B.V.H., and K.S. contributed to subsequent versions. All authors reviewed and approved the final version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
All authors are employees of deCODE Genetics/Amgen, Inc.
Integrated supplementary information
Supplementary Figure 1 IGV visualization of mapped sequence reads of two Icelanders carrying a 40-bp deletion.
The genomic region shown is chr. 21: 21,559,430–21,559,518 (GRCh38), and the deleted sequence is between the two vertical green lines. (a) A heterozygous carrier of the deletion. Graphtyper was the only genotyping pipeline that correctly recognized the individual as a carrier. The other pipelines called the false sequence variants due to misalignments around the indel (red boxes). (b) A homozygous carrier of the deletion. In this case, most of the reads are correctly mapped as carrying the deletion, but again some misalignment artifacts are observed (red box).
Supplementary Figure 2 Alternative allele transmission rate in 230 Icelandic parent–offspring trios.
All genotyping pipelines have an excess of sequence variants that are never transmitted from parent to offspring, which may be calls due to sequencing error or non-germline variation. Bin width is 0.1.
Supplementary Figure 3 Alternative allele transmission rate in 230 Icelandic parent–offspring trios by SNP mutation type.
The mutation ratio of transitions and transversions is estimated to be around two in the human autosomal genome. We observed that the transition/transversion ratio improved at higher transmission rates, indicating that transmission rate is measuring quality. There was also a large excess of transversions that are not transmitted.
Supplementary Figure 4 Detection of novel alleles.
(a) Semi-global banded alignment of a sequenced read to an extracted reference sequence. (b) Observed variation with respect to the reference sequence.
Supplementary Figure 5 Merging sequence variants can reduce the number of haplotypes in the graph.
(a) An example of a graph with two sequence variants. The graph has a total of six haplotypes. (b) Two sequence variants are closer than 5 bp to each other and are grouped together. (c) If we only observed four of six haplotypes in a population, we can reduce the number of haplotypes in the graph to four as shown here.
Supplementary Figure 6 Mendelian inheritance of alleles in parent–offspring trios.
(a) Both parents are homozygous and the offspring’s genotype can be inferred. (b) At least one parent is heterozygous and we cannot uniquely infer the offspring’s genotype. We can measure the transmission rate of an allele in these types of trios.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–6, Supplementary Tables 1–3 and Supplementary Note (PDF 1978 kb)
Rights and permissions
About this article
Cite this article
Eggertsson, H., Jonsson, H., Kristmundsdottir, S. et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nat Genet 49, 1654–1660 (2017). https://doi.org/10.1038/ng.3964
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng.3964
This article is cited by
-
Identification and characterization of structural variants related to meat quality in pigs using chromosome-level genome assemblies
BMC Genomics (2024)
-
Co-linear chaining on pangenome graphs
Algorithms for Molecular Biology (2024)
-
A comprehensive benchmark of graph-based genetic variant genotyping algorithms on plant genomes for creating an accurate ensemble pipeline
Genome Biology (2024)
-
The correlation between CpG methylation and gene expression is driven by sequence variants
Nature Genetics (2024)
-
Hippo–YAP/TAZ signalling coordinates adipose plasticity and energy balance by uncoupling leptin expression from fat mass
Nature Metabolism (2024)