Graphtyper enables population-scale genotyping using pangenome graphs

Abstract

A fundamental requirement for genetic studies is an accurate determination of sequence variation. While human genome sequence diversity is increasingly well characterized, there is a need for efficient ways to use this knowledge in sequence analysis. Here we present Graphtyper, a publicly available novel algorithm and software for discovering and genotyping sequence variants. Graphtyper realigns short-read sequence data to a pangenome, a variation-aware graph structure that encodes sequence variation within a population by representing possible haplotypes as graph paths. Our results show that Graphtyper is fast, highly scalable, and provides sensitive and accurate genotype calls. Graphtyper genotyped 89.4 million sequence variants in the whole genomes of 28,075 Icelanders using less than 100,000 CPU days, including detailed genotyping of six human leukocyte antigen (HLA) genes. We show that Graphtyper is a valuable tool in characterizing sequence variation in both small and population-scale sequencing studies.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Genotyping pipeline designs.
Figure 2: Importance of variation-aware alignment.
Figure 3: Graphtyper's sequence alignment algorithm.
Figure 4: Genotyping time summary.

References

  1. 1

    Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 46, 818–825 (2014).

  2. 2

    Gudbjartsson, D.F. et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47, 435–444 (2015).

    CAS  Article  Google Scholar 

  3. 3

    Sudmant, P.H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).

    CAS  Article  Google Scholar 

  4. 4

    Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  Google Scholar 

  5. 5

    Seo, J.-S. et al. De novo assembly and phasing of a Korean human genome. Nature 538, 243–247 (2016).

    CAS  Article  Google Scholar 

  6. 6

    Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).

    CAS  Article  Google Scholar 

  7. 7

    Tiwari, J.L. & Terasaki, P.I. HLA and Disease Associations (Springer, 1985).

  8. 8

    Robinson, J. et al. The IPD and IMGT/HLA database: allele variant databases. Nucleic Acids Res. 43, D423–D431 (2015).

    CAS  Article  Google Scholar 

  9. 9

    Dilthey, A., Cox, C., Iqbal, Z., Nelson, M.R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015).

    CAS  Article  Google Scholar 

  10. 10

    Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    CAS  Article  Google Scholar 

  11. 11

    DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

    CAS  Article  Google Scholar 

  12. 12

    Shao, H. et al. A population model for genotyping indels from next-generation sequence data. Nucleic Acids Res. 41, e46 (2013).

    CAS  Article  Google Scholar 

  13. 13

    Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Brief. Bioinform. http://dx.doi.org/10.1093/bib/bbw089 (2016).

  14. 14

    Dilthey, A.T. et al. High-accuracy HLA type inference from whole-genome sequencing data using population reference graphs. PLoS Comput. Biol. 12, e1005151 (2016).

    Article  Google Scholar 

  15. 15

    Paten, B., Novak, A. & Haussler, D. Mapping to a reference genome structure. Preprint at https://arxiv.org/abs/1404.5010 (2014).

  16. 16

    Huang, L., Popic, V. & Batzoglou, S. Short read alignment with populations of genomes. Bioinformatics 29, i361–i370 (2013).

    CAS  Article  Google Scholar 

  17. 17

    Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).

    CAS  Article  Google Scholar 

  18. 18

    Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).

    CAS  Article  Google Scholar 

  19. 19

    Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010).

    CAS  Article  Google Scholar 

  20. 20

    Sirén, J., Välimäki, N. & Mäkinen, V. Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. 11, 375–388 (2014).

    Article  Google Scholar 

  21. 21

    Schneeberger, K. et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).

    Article  Google Scholar 

  22. 22

    Zhao, M., Lee, W.P., Garrison, E.P. & Marth, G.T. SSW library: an SIMD Smith–Waterman C/C++ library for use in genomic applications. PLoS One 8, e82138 (2013).

    Article  Google Scholar 

  23. 23

    Novak, A.M. et al. Genome Graphs. Preprint at bioRxiv https://arxiv.org/abs/1404.5010 (2017).

  24. 24

    Church, D.M. et al. Extending reference assembly models. Genome Biol. 16, 13 (2015).

    Article  Google Scholar 

  25. 25

    Paten, B., Novak, A.M., Eizenga, J.M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).

    CAS  Article  Google Scholar 

  26. 26

    Sirén, J. in 2017 Proceedings of the Nineteenth Workshop on Algorithm Engineering and Experiments (ALENEX) (eds. Fekete, S. & Ramachandran, V.) 13–27 (Society for Industrial and Applied Mathematics, 2017).

  27. 27

    Kehr, B., Trappe, K., Holtgrewe, M. & Reinert, K. Genome alignment with graph data structures: a comparison. BMC Bioinformatics 15, 99 (2014).

    Article  Google Scholar 

  28. 28

    Maciuca, S., Elias, C.D.O., McVean, G. & Iqbal, Z. in Lecture Notes in Computer Science 9838, 222–233 (2016).

    CAS  Article  Google Scholar 

  29. 29

    McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    CAS  Article  Google Scholar 

  30. 30

    Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  Google Scholar 

  31. 31

    Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).

  32. 32

    Jónsson, H. et al. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature http://dx.doi.org/10.1038/nature24018 (2017).

  33. 33

    Jónsson, H. et al. Whole genome characterization of sequence diversity of 15,220 Icelanders. Sci. Data (in press).

  34. 34

    Eberle, M.A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).

    CAS  Article  Google Scholar 

  35. 35

    Szolek, A. et al. OptiType: precision HLA typing from next-generation sequencing data. Bioinformatics 30, 3310–3316 (2014).

    CAS  Article  Google Scholar 

  36. 36

    Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).

    CAS  Article  Google Scholar 

  37. 37

    Eggertsson, H.P. Gyper: A Graph-Based HLA Genotyper Using Aligned DNA Sequences. MS thesis, Univ. of Iceland, Reykjavík (2015).

Download references

Acknowledgements

We are grateful to our colleagues from deCODE Genetics/Amgen for their contributions. We also wish to thank all research participants who provided biological samples to deCODE Genetics.

Author information

Affiliations

Authors

Contributions

H.P.E. implemented the Graphtyper software. H.P.E., P.M., and B.V.H. designed the Graphtyper algorithm. H.P.E., D.F.G., P.M., B.V.H., and K.S. designed the experiments. H.P.E., E.H., G.M., and F.Z. ran all evaluated genotypers. H.P.E., H.J., and K.E.H. analyzed the call sets. Aslaug Jonasdottir, Adalbjorg Jonasdottir, and I.J. were responsible for PCR validation. H.J. and S.K. contributed software for the project. H.P.E. wrote the initial version of the manuscript, and H.J., S.K., B.K., P.M., B.V.H., and K.S. contributed to subsequent versions. All authors reviewed and approved the final version of the manuscript.

Corresponding authors

Correspondence to Hannes P Eggertsson or Bjarni V Halldorsson.

Ethics declarations

Competing interests

All authors are employees of deCODE Genetics/Amgen, Inc.

Integrated supplementary information

Supplementary Figure 1 IGV visualization of mapped sequence reads of two Icelanders carrying a 40-bp deletion.

The genomic region shown is chr. 21: 21,559,430–21,559,518 (GRCh38), and the deleted sequence is between the two vertical green lines. (a) A heterozygous carrier of the deletion. Graphtyper was the only genotyping pipeline that correctly recognized the individual as a carrier. The other pipelines called the false sequence variants due to misalignments around the indel (red boxes). (b) A homozygous carrier of the deletion. In this case, most of the reads are correctly mapped as carrying the deletion, but again some misalignment artifacts are observed (red box).

Supplementary Figure 2 Alternative allele transmission rate in 230 Icelandic parent–offspring trios.

All genotyping pipelines have an excess of sequence variants that are never transmitted from parent to offspring, which may be calls due to sequencing error or non-germline variation. Bin width is 0.1.

Supplementary Figure 3 Alternative allele transmission rate in 230 Icelandic parent–offspring trios by SNP mutation type.

The mutation ratio of transitions and transversions is estimated to be around two in the human autosomal genome. We observed that the transition/transversion ratio improved at higher transmission rates, indicating that transmission rate is measuring quality. There was also a large excess of transversions that are not transmitted.

Supplementary Figure 4 Detection of novel alleles.

(a) Semi-global banded alignment of a sequenced read to an extracted reference sequence. (b) Observed variation with respect to the reference sequence.

Supplementary Figure 5 Merging sequence variants can reduce the number of haplotypes in the graph.

(a) An example of a graph with two sequence variants. The graph has a total of six haplotypes. (b) Two sequence variants are closer than 5 bp to each other and are grouped together. (c) If we only observed four of six haplotypes in a population, we can reduce the number of haplotypes in the graph to four as shown here.

Supplementary Figure 6 Mendelian inheritance of alleles in parent–offspring trios.

(a) Both parents are homozygous and the offspring’s genotype can be inferred. (b) At least one parent is heterozygous and we cannot uniquely infer the offspring’s genotype. We can measure the transmission rate of an allele in these types of trios.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–6, Supplementary Tables 1–3 and Supplementary Note (PDF 1978 kb)

Life Sciences Reporting Summary (PDF 1102 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Eggertsson, H., Jonsson, H., Kristmundsdottir, S. et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nat Genet 49, 1654–1660 (2017). https://doi.org/10.1038/ng.3964

Download citation

Further reading

Search

Quick links

Sign up for the Nature Briefing newsletter for a daily update on COVID-19 science.
Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing