Graphtyper enables population-scale genotyping using pangenome graphs

Eggertsson, Hannes P; Jonsson, Hakon; Kristmundsdottir, Snaedis; Hjartarson, Eirikur; Kehr, Birte; Masson, Gisli; Zink, Florian; Hjorleifsson, Kristjan E; Jonasdottir, Aslaug; Jonasdottir, Adalbjorg; Jonsdottir, Ingileif; Gudbjartsson, Daniel F; Melsted, Pall; Stefansson, Kari; Halldorsson, Bjarni V

doi:10.1038/ng.3964

Technical Report
Published: 25 September 2017

Graphtyper enables population-scale genotyping using pangenome graphs

Hannes P Eggertsson ORCID: orcid.org/0000-0002-1674-9978^1,2,
Hakon Jonsson ORCID: orcid.org/0000-0001-6197-494X¹,
Snaedis Kristmundsdottir^1,3,
Eirikur Hjartarson¹,
Birte Kehr^1,4,
Gisli Masson¹,
Florian Zink¹,
Kristjan E Hjorleifsson¹,
Aslaug Jonasdottir¹,
Adalbjorg Jonasdottir¹,
Ingileif Jonsdottir ORCID: orcid.org/0000-0001-8339-150X^1,5,
Daniel F Gudbjartsson ORCID: orcid.org/0000-0002-5222-9857^1,2,
Pall Melsted^1,2,
Kari Stefansson^1,5 &
…
Bjarni V Halldorsson ORCID: orcid.org/0000-0003-0756-0767^1,3

Nature Genetics volume 49, pages 1654–1660 (2017)Cite this article

10k Accesses
138 Citations
84 Altmetric
Metrics details

Subjects

Abstract

A fundamental requirement for genetic studies is an accurate determination of sequence variation. While human genome sequence diversity is increasingly well characterized, there is a need for efficient ways to use this knowledge in sequence analysis. Here we present Graphtyper, a publicly available novel algorithm and software for discovering and genotyping sequence variants. Graphtyper realigns short-read sequence data to a pangenome, a variation-aware graph structure that encodes sequence variation within a population by representing possible haplotypes as graph paths. Our results show that Graphtyper is fast, highly scalable, and provides sensitive and accurate genotype calls. Graphtyper genotyped 89.4 million sequence variants in the whole genomes of 28,075 Icelanders using less than 100,000 CPU days, including detailed genotyping of six human leukocyte antigen (HLA) genes. We show that Graphtyper is a valuable tool in characterizing sequence variation in both small and population-scale sequencing studies.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Genotyping pipeline designs.**

**Figure 2: Importance of variation-aware alignment.**

**Figure 3: Graphtyper's sequence alignment algorithm.**

GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs

Article Open access 27 November 2019

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

Article 02 August 2019

Pangenome graphs improve the analysis of structural variants in rare genetic diseases

Article Open access 22 January 2024

References

Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 46, 818–825 (2014).
Gudbjartsson, D.F. et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47, 435–444 (2015).
Article CAS Google Scholar
Sudmant, P.H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).
Article CAS Google Scholar
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article Google Scholar
Seo, J.-S. et al. De novo assembly and phasing of a Korean human genome. Nature 538, 243–247 (2016).
Article CAS Google Scholar
Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).
Article CAS Google Scholar
Tiwari, J.L. & Terasaki, P.I. HLA and Disease Associations (Springer, 1985).
Robinson, J. et al. The IPD and IMGT/HLA database: allele variant databases. Nucleic Acids Res. 43, D423–D431 (2015).
Article CAS Google Scholar
Dilthey, A., Cox, C., Iqbal, Z., Nelson, M.R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015).
Article CAS Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS Google Scholar
DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Article CAS Google Scholar
Shao, H. et al. A population model for genotyping indels from next-generation sequence data. Nucleic Acids Res. 41, e46 (2013).
Article CAS Google Scholar
Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Brief. Bioinform. http://dx.doi.org/10.1093/bib/bbw089 (2016).
Dilthey, A.T. et al. High-accuracy HLA type inference from whole-genome sequencing data using population reference graphs. PLoS Comput. Biol. 12, e1005151 (2016).
Article Google Scholar
Paten, B., Novak, A. & Haussler, D. Mapping to a reference genome structure. Preprint at https://arxiv.org/abs/1404.5010 (2014).
Huang, L., Popic, V. & Batzoglou, S. Short read alignment with populations of genomes. Bioinformatics 29, i361–i370 (2013).
Article CAS Google Scholar
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).
Article CAS Google Scholar
Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).
Article CAS Google Scholar
Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010).
Article CAS Google Scholar
Sirén, J., Välimäki, N. & Mäkinen, V. Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. 11, 375–388 (2014).
Article Google Scholar
Schneeberger, K. et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).
Article Google Scholar
Zhao, M., Lee, W.P., Garrison, E.P. & Marth, G.T. SSW library: an SIMD Smith–Waterman C/C++ library for use in genomic applications. PLoS One 8, e82138 (2013).
Article Google Scholar
Novak, A.M. et al. Genome Graphs. Preprint at bioRxiv https://arxiv.org/abs/1404.5010 (2017).
Church, D.M. et al. Extending reference assembly models. Genome Biol. 16, 13 (2015).
Article Google Scholar
Paten, B., Novak, A.M., Eizenga, J.M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).
Article CAS Google Scholar
Sirén, J. in 2017 Proceedings of the Nineteenth Workshop on Algorithm Engineering and Experiments (ALENEX) (eds. Fekete, S. & Ramachandran, V.) 13–27 (Society for Industrial and Applied Mathematics, 2017).
Kehr, B., Trappe, K., Holtgrewe, M. & Reinert, K. Genome alignment with graph data structures: a comparison. BMC Bioinformatics 15, 99 (2014).
Article Google Scholar
Maciuca, S., Elias, C.D.O., McVean, G. & Iqbal, Z. in Lecture Notes in Computer Science 9838, 222–233 (2016).
Article CAS Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article CAS Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article Google Scholar
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).
Jónsson, H. et al. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature http://dx.doi.org/10.1038/nature24018 (2017).
Jónsson, H. et al. Whole genome characterization of sequence diversity of 15,220 Icelanders. Sci. Data (in press).
Eberle, M.A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).
Article CAS Google Scholar
Szolek, A. et al. OptiType: precision HLA typing from next-generation sequencing data. Bioinformatics 30, 3310–3316 (2014).
Article CAS Google Scholar
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
Article CAS Google Scholar
Eggertsson, H.P. Gyper: A Graph-Based HLA Genotyper Using Aligned DNA Sequences. MS thesis, Univ. of Iceland, Reykjavík (2015).

Download references

Acknowledgements

We are grateful to our colleagues from deCODE Genetics/Amgen for their contributions. We also wish to thank all research participants who provided biological samples to deCODE Genetics.

Author information

Authors and Affiliations

deCODE Genetics/Amgen, Inc., Reykjavik, Iceland
Hannes P Eggertsson, Hakon Jonsson, Snaedis Kristmundsdottir, Eirikur Hjartarson, Birte Kehr, Gisli Masson, Florian Zink, Kristjan E Hjorleifsson, Aslaug Jonasdottir, Adalbjorg Jonasdottir, Ingileif Jonsdottir, Daniel F Gudbjartsson, Pall Melsted, Kari Stefansson & Bjarni V Halldorsson
School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland
Hannes P Eggertsson, Daniel F Gudbjartsson & Pall Melsted
School of Science and Engineering, Reykjavik University, Reykjavik, Iceland
Snaedis Kristmundsdottir & Bjarni V Halldorsson
Berlin Institute of Health (BIH), Berlin, Germany
Birte Kehr
Faculty of Medicine, School of Health Sciences, University of Iceland, Reykjavik, Iceland
Ingileif Jonsdottir & Kari Stefansson

Authors

Hannes P Eggertsson
View author publications
You can also search for this author in PubMed Google Scholar
Hakon Jonsson
View author publications
You can also search for this author in PubMed Google Scholar
Snaedis Kristmundsdottir
View author publications
You can also search for this author in PubMed Google Scholar
Eirikur Hjartarson
View author publications
You can also search for this author in PubMed Google Scholar
Birte Kehr
View author publications
You can also search for this author in PubMed Google Scholar
Gisli Masson
View author publications
You can also search for this author in PubMed Google Scholar
Florian Zink
View author publications
You can also search for this author in PubMed Google Scholar
Kristjan E Hjorleifsson
View author publications
You can also search for this author in PubMed Google Scholar
Aslaug Jonasdottir
View author publications
You can also search for this author in PubMed Google Scholar
Adalbjorg Jonasdottir
View author publications
You can also search for this author in PubMed Google Scholar
Ingileif Jonsdottir
View author publications
You can also search for this author in PubMed Google Scholar
Daniel F Gudbjartsson
View author publications
You can also search for this author in PubMed Google Scholar
Pall Melsted
View author publications
You can also search for this author in PubMed Google Scholar
Kari Stefansson
View author publications
You can also search for this author in PubMed Google Scholar
Bjarni V Halldorsson
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.P.E. implemented the Graphtyper software. H.P.E., P.M., and B.V.H. designed the Graphtyper algorithm. H.P.E., D.F.G., P.M., B.V.H., and K.S. designed the experiments. H.P.E., E.H., G.M., and F.Z. ran all evaluated genotypers. H.P.E., H.J., and K.E.H. analyzed the call sets. Aslaug Jonasdottir, Adalbjorg Jonasdottir, and I.J. were responsible for PCR validation. H.J. and S.K. contributed software for the project. H.P.E. wrote the initial version of the manuscript, and H.J., S.K., B.K., P.M., B.V.H., and K.S. contributed to subsequent versions. All authors reviewed and approved the final version of the manuscript.

Corresponding authors

Correspondence to Hannes P Eggertsson or Bjarni V Halldorsson.

Ethics declarations

Competing interests

All authors are employees of deCODE Genetics/Amgen, Inc.

Integrated supplementary information

Supplementary Figure 1 IGV visualization of mapped sequence reads of two Icelanders carrying a 40-bp deletion.

The genomic region shown is chr. 21: 21,559,430–21,559,518 (GRCh38), and the deleted sequence is between the two vertical green lines. (a) A heterozygous carrier of the deletion. Graphtyper was the only genotyping pipeline that correctly recognized the individual as a carrier. The other pipelines called the false sequence variants due to misalignments around the indel (red boxes). (b) A homozygous carrier of the deletion. In this case, most of the reads are correctly mapped as carrying the deletion, but again some misalignment artifacts are observed (red box).

Supplementary Figure 2 Alternative allele transmission rate in 230 Icelandic parent–offspring trios.

All genotyping pipelines have an excess of sequence variants that are never transmitted from parent to offspring, which may be calls due to sequencing error or non-germline variation. Bin width is 0.1.

Supplementary Figure 3 Alternative allele transmission rate in 230 Icelandic parent–offspring trios by SNP mutation type.

The mutation ratio of transitions and transversions is estimated to be around two in the human autosomal genome. We observed that the transition/transversion ratio improved at higher transmission rates, indicating that transmission rate is measuring quality. There was also a large excess of transversions that are not transmitted.

Supplementary Figure 4 Detection of novel alleles.

(a) Semi-global banded alignment of a sequenced read to an extracted reference sequence. (b) Observed variation with respect to the reference sequence.

Supplementary Figure 5 Merging sequence variants can reduce the number of haplotypes in the graph.

(a) An example of a graph with two sequence variants. The graph has a total of six haplotypes. (b) Two sequence variants are closer than 5 bp to each other and are grouped together. (c) If we only observed four of six haplotypes in a population, we can reduce the number of haplotypes in the graph to four as shown here.

Supplementary Figure 6 Mendelian inheritance of alleles in parent–offspring trios.

(a) Both parents are homozygous and the offspring’s genotype can be inferred. (b) At least one parent is heterozygous and we cannot uniquely infer the offspring’s genotype. We can measure the transmission rate of an allele in these types of trios.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–6, Supplementary Tables 1–3 and Supplementary Note (PDF 1978 kb)

Life Sciences Reporting Summary (PDF 1102 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Eggertsson, H., Jonsson, H., Kristmundsdottir, S. et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nat Genet 49, 1654–1660 (2017). https://doi.org/10.1038/ng.3964

Download citation

Received: 20 June 2017
Accepted: 01 September 2017
Published: 25 September 2017
Issue Date: 01 November 2017
DOI: https://doi.org/10.1038/ng.3964

This article is cited by

Identification and characterization of structural variants related to meat quality in pigs using chromosome-level genome assemblies
- Daehong Kwon
- Nayoung Park
- Jaebum Kim
BMC Genomics (2024)
Co-linear chaining on pangenome graphs
- Jyotshna Rajput
- Ghanshyam Chandra
- Chirag Jain
Algorithms for Molecular Biology (2024)
A comprehensive benchmark of graph-based genetic variant genotyping algorithms on plant genomes for creating an accurate ensemble pipeline
- Ze-Zhen Du
- Jia-Bao He
- Wen-Biao Jiao
Genome Biology (2024)
Sequence variant affects GCSAML splicing, mast cell specific proteins, and risk of urticaria
- Ragnar P. Kristjansson
- Gudjon R. Oskarsson
- Kari Stefansson
Communications Biology (2023)
Large-scale plasma proteomics comparisons through genetics and disease associations
- Grimur Hjorleifsson Eldjarn
- Egil Ferkingstad
- Kari Stefansson
Nature (2023)