Accurate genotyping across variant classes and lengths using variant graphs

Sibbesen, Jonas Andreas; Maretty, Lasse; Krogh, Anders

doi:10.1038/s41588-018-0145-5

Technical Report
Published: 18 June 2018

Accurate genotyping across variant classes and lengths using variant graphs

Jonas Andreas Sibbesen¹^na1,
Lasse Maretty¹^na1,
The Danish Pan-Genome Consortium &
…
Anders Krogh ORCID: orcid.org/0000-0002-5147-6282¹

Nature Genetics volume 50, pages 1054–1059 (2018)Cite this article

5627 Accesses
40 Citations
57 Altmetric
Metrics details

Subjects

Abstract

Genotype estimates from short-read sequencing data are typically based on the alignment of reads to a linear reference, but reads originating from more complex variants (for example, structural variants) often align poorly, resulting in biased genotype estimates. This bias can be mitigated by first collecting a set of candidate variants across discovery methods, individuals and databases, and then realigning the reads to the variants and reference simultaneously. However, this realignment problem has proved computationally difficult. Here, we present a new method (BayesTyper) that uses exact alignment of read k-mers to a graph representation of the reference and variants to efficiently perform unbiased, probabilistic genotyping across the variation spectrum. We demonstrate that BayesTyper generally provides superior variant sensitivity and genotyping accuracy relative to existing methods when used to integrate variants across discovery approaches and individuals. Finally, we demonstrate that including a ‘variation-prior’ database containing already known variants significantly improves sensitivity.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Comparison of genotyping methods on PG data (50×).**

**Fig. 3: Effect of using BayesTyper with a variation prior on structural variation calling performance on the PG (50×) and GoNL (13×) data sets.**

Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes

Article Open access 11 April 2022

GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs

Article Open access 27 November 2019

Accurate, scalable and integrative haplotype estimation

Article Open access 28 November 2019

References

Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Article PubMed PubMed Central CAS Google Scholar
Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).
Article PubMed PubMed Central CAS Google Scholar
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article CAS Google Scholar
Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).
Article PubMed PubMed Central CAS Google Scholar
Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 46, 818–825 (2014).
Article CAS Google Scholar
Hehir-Kwa, J. Y. et al. A high-quality human reference panel reveals the complexity and distribution of genomic structural variants. Nat. Commun. 7, 12989 (2016).
Article PubMed PubMed Central CAS Google Scholar
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).
Handsaker, R. E. et al. Large multiallelic copy number variations in humans. Nat. Genet. 47, 296–303 (2015).
Article PubMed PubMed Central CAS Google Scholar
Sirén, J. Indexing variation graphs. Preprint at https://arxiv.org/abs/1604.06605 (2016).
Paten, B., Novak, A. M., Eizenga, J. M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).
Article PubMed PubMed Central CAS Google Scholar
Schneeberger, K. et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).
Article PubMed PubMed Central CAS Google Scholar
Huang, L., Popic, V. & Batzoglou, S. Short read alignment with populations of genomes. Bioinformatics 29, i361–i370 (2013).
Article PubMed PubMed Central CAS Google Scholar
Sirén, J., Välimäki, N. & Mäkinen, V. Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. 11, 375–388 (2014).
Article PubMed Google Scholar
Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015).
Article PubMed PubMed Central CAS Google Scholar
Eggertsson, H. P. et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nat. Genet. 49, 1654–1660 (2017).
Article PubMed CAS Google Scholar
Deorowicz, S., Kokot, M., Grabowski, S. & Debudaj-Grabysz, A. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31, 1569–1576 (2015).
Article PubMed CAS Google Scholar
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Article PubMed PubMed Central CAS Google Scholar
Chiang, C. et al. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat. Methods 12, 6–10 (2015).
Article CAS Google Scholar
Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).
Article PubMed PubMed Central CAS Google Scholar
Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).
Article PubMed CAS Google Scholar
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 1–26 (2016).
Article CAS Google Scholar
Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).
Google Scholar
Maretty, L. et al. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature 548, 87–91 (2017).
Article PubMed CAS Google Scholar
Chiang, C. et al. The impact of structural variation on human gene expression. Nat. Genet. 49, 692–699 (2017).
Article PubMed PubMed Central CAS Google Scholar
Hwang, S., Kim, E., Lee, I. & Marcotte, E. M. Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci. Rep. 5, 17875 (2015).
Article PubMed PubMed Central CAS Google Scholar
Jain, M., Olsen, H. E., Paten, B. & Akeson, M. The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 17, 239 (2016).
Article PubMed PubMed Central CAS Google Scholar
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).
Article PubMed PubMed Central CAS Google Scholar
Mohamadi, H., Chu, J., Vandervalk, B. P. & Birol, I. NtHash: Recursive nucleotide hashing. Bioinformatics 32, 3492–3494 (2016).
Article PubMed PubMed Central CAS Google Scholar
Albers, C. A. et al. Dindel: accurate indel calls from short-read data. Genome Res. 21, 961–973 (2011).
Article PubMed PubMed Central CAS Google Scholar
Maretty, L., Sibbesen, J. A. & Krogh, A. Bayesian transcriptome assembly. Genome Biol. 15, 501 (2014).
Article PubMed PubMed Central CAS Google Scholar
Zhao, H. et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007 (2014).
Article PubMed CAS Google Scholar
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
Article PubMed PubMed Central CAS Google Scholar
Hu, X. et al. pIRS: Profile-based Illumina pair-end reads simulator. Bioinformatics 28, 1533–1535 (2012).
Article PubMed CAS Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central CAS Google Scholar
Cleary, J. G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. Preprint at https://www.biorxiv.org/content/early/2015/08/03/023754 (2015).

Download references

Acknowledgements

This work was supported by grants from the Novo Nordisk Foundation (grant number NNF10SA1016550) to A.K. and Innovation Fund Denmark (grant number 019-2011-2). We thank the Platinum Genomes project and the Genome of the Netherlands consortium for providing access to their data.

Author information

These authors contributed equally: Jonas Andreas Sibbesen, Lasse Maretty.
A complete list of consortium members is provided in the Supplementary Note.

Authors and Affiliations

The Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark
Jonas Andreas Sibbesen, Lasse Maretty & Anders Krogh

Authors

Jonas Andreas Sibbesen
View author publications
You can also search for this author in PubMed Google Scholar
Lasse Maretty
View author publications
You can also search for this author in PubMed Google Scholar
Anders Krogh
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

The Danish Pan-Genome Consortium

Contributions

J.A.S. designed and implemented the algorithm, performed the analyses and wrote the manuscript. L.M. designed and implemented the algorithm, performed the analyses and wrote the manuscript. A.K. designed the algorithm and wrote the manuscript.

Corresponding author

Correspondence to Anders Krogh.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated Supplementary Information

Supplementary Figure 1 Comparison of homopolymer genotyping performance across methods on Platinum Genomes data (50×).

BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from 13 individuals in the Platinum Genomes pedigree using variants discovered by merging calls from four different methods as variant candidate input (Table 1). a,b, The number of called variant alleles (i.e., variant sensitivity) (a) and genotyping accuracy (b) estimated by validating genotypes using pedigree inheritance information shown as a function of the reference homopolymer length.

Supplementary Figure 2 Comparison of genotyping methods on Platinum Genomes data (50×) against the Genome in a Bottle ‘ground-truth’ set for NA12878.

BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from 13 individuals in the Platinum Genomes pedigree using variants discovered by merging calls from four different methods as variant candidate input (Table 1). Variants longer than 50 nt were excluded from the analyses. a, Variant allele sensitivity (right) and precision (left) across variant classes; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. b, Sensitivity (two upper panels) and precision (two lower panels) for structural variants as a function of the net change in sequence length relative to the reference (5-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 3 Marginal contribution of discovery methods to the Platinum Genomes (50×) genotypes.

Variant discovery was conducted by merging calls from HaplotypeCaller, Platypus, FreeBayes and Manta across 13 individuals in the Platinum Genomes pedigree (Table 1). Plots show the absolute number of variants called by BayesTyper for each discovery method (top) and the marginal fraction of all called variants identified by each discovery method (bottom) for structural variants as a function of the net change in sequence length relative to the reference (50-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 4 Comparison of genotyping methods on simulated data (30×).

We simulated 30× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, SVTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as variant candidate input (Table 1). Values were aggregated across all ten individuals to provide a single estimate. a, Variant allele sensitivity (right) and precision (left) across variant classes; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. b, Sensitivity (two upper panels) and precision (two lower panels) for structural variants as a function of the net change in sequence length relative to the reference (50-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 5 Comparison of indel genotyping performance across methods on simulated data (30×).

We simulated 30× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as variant candidate input (Table 1). Values were aggregated across all ten individuals to provide a single estimate. Plots show sensitivity (upper two panels) and precision (lower two panels) for indels as a function of the net change in sequence length relative to the reference (5-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 6 Receiver-operator curves on simulated data (30×).

We simulated 30× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as variant candidate input (Table 1). a,b, Receiver-operator curves were computed for genotype quality (posterior probability for BayesTyper) across all methods for each of the ten simulated individuals for SNVs (a) and non-SNVs (b). Triangles indicate the genotype quality threshold used in the benchmark.

Supplementary Figure 7 Comparison of genotyping methods on Genome of the Netherlands data (13×).

BayesTyper, SVTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from 30 individuals (ten parent–offspring trios) from the Genome of the Netherlands (GoNL) project using variant calls for the entire GoNL cohort (n = 769) obtained from the GoNL project as variant candidate input (Table 1). The number of called variant alleles was used as a measure of sensitivity, whereas genotyping accuracy was estimated as the fraction of variants with no Mendelian errors across the ten trios. a, Sensitivity (right) and accuracy (left) across variant classes; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. b, Sensitivity (top, log scale) and accuracy (bottom) for structural variants as a function of the net change in sequence length relative to the reference (50-nt and 5-nt bins for the ±500-nt and ±50-nt scales, respectively). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 8 Comparison of homopolymer genotyping performance across methods on Genome of the Netherlands data (13×).

BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from 30 individuals (ten parent–offspring trios) from the Genome of the Netherlands (GoNL) project using variant calls for the entire GoNL cohort (n = 769) obtained from the GoNL project as variant candidate input (Table 1). a,b, The number of called variant alleles (i.e., variant sensitivity) (a) and genotyping accuracy (b) assessed by the fraction of variants with no Mendelian errors across the ten trios shown as a function of the reference homopolymer length.

Supplementary Figure 9 Comparison of genotyping methods on simulated data (10×).

We simulated 10× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, SVTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as variant candidate input (Table 1). Values were aggregated across all ten individuals to provide a single estimate. a, Variant allele sensitivity (right) and genotyping accuracy (left) across variant classes; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. b, Sensitivity (upper two panels) and precision (lower two panels) for structural variants as a function of the net change in sequence length relative to the reference (50-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 10 Comparison of indel genotyping performance across methods on simulated data (10×).

We simulated 10× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as variant candidate input (Table 1). Values were aggregated across all ten individuals to provide a single estimate. Plots show sensitivity (upper two panels) and precision (lower two panels) for indel variants as a function of the net change in sequence length relative to the reference (5-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 11 Receiver-operator curves on simulated data (10×).

We simulated 10× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as input (Table 1). a,b, Receiver-operator curves were computed for genotype quality (posterior probability for BayesTyper) across all methods for each of the ten simulated individuals for SNVs (a) and non-SNVs (b). Triangles indicate the genotype quality threshold used in the benchmark.

Supplementary Figure 12 Effect of using BayesTyper with a variation prior on genotyping performance across variant classes on the Platinum Genomes (50×) and Genome of the Netherlands (13×) datasets.

A ‘variation prior’ database was constructed by combining SNVs and structural variants from different databases and studies (Supplementary Table 1). BayesTyper was then run on variant candidates obtained by merging the variation prior with variants discovered using four different methods (Table 1). a, Sensitivity (right) and accuracy (left) of BayesTyper across variant classes on the Platinum Genomes (50×) datasets with and without the variation prior; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. Genotyping accuracy was estimated by validating genotypes using pedigree inheritance information. b, Same analyses as in a when running BayesTyper on the Genome of the Netherlands data (13×), where genotyping accuracy was estimated as the fraction of variants with no Mendelian errors across the ten trios.

Supplementary Figure 13 Effect of changing the ‘maximum allele length’ threshold in BayesTyper on genotyping performance across variant classes on the Platinum Genomes data (50×).

BayesTyper was run on data from 13 individuals in the Platinum Genomes pedigree using variants discovered by merging calls from four different methods as variant candidate input (Table 1) and using two different thresholds for the maximum allele length (10,000 and 500,000 nt). The number of called variant alleles was used as a measure of sensitivity, whereas the genotyping accuracy was estimated by validating genotypes using pedigree inheritance information. a, Sensitivity (right) and accuracy (left) across variant classes; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. b, Sensitivity (top, log scale) and accuracy (bottom) for structural variants as a function of the net change in sequence length relative to the reference (50-nt and 5-nt bins for the ±500-nt and ±50-nt scales, respectively). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 14 Variant cluster group definition.

Variant clusters within the copied or deleted sequence of an upstream structural variant are dependent, as they will share k-mers. These clusters are therefore defined to belong to the same inference group. Their dependency structure is represented as a tree, where the colors correspond to different variant clusters (VC); green and purple triangles are deletions, and the red triangle is a copy number insertion.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–14, Supplementary Tables 1–4 and Supplementary Note

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sibbesen, J.A., Maretty, L., The Danish Pan-Genome Consortium. et al. Accurate genotyping across variant classes and lengths using variant graphs. Nat Genet 50, 1054–1059 (2018). https://doi.org/10.1038/s41588-018-0145-5

Download citation

Received: 15 June 2016
Accepted: 20 April 2018
Published: 18 June 2018
Issue Date: July 2018
DOI: https://doi.org/10.1038/s41588-018-0145-5

This article is cited by

A comprehensive benchmark of graph-based genetic variant genotyping algorithms on plant genomes for creating an accurate ensemble pipeline
- Ze-Zhen Du
- Jia-Bao He
- Wen-Biao Jiao
Genome Biology (2024)
Pangenomic genotyping with the marker array
- Taher Mun
- Naga Sai Kavya Vaddadi
- Ben Langmead
Algorithms for Molecular Biology (2023)
Haplotype-aware pantranscriptome analyses using spliced pangenome graphs
- Jonas A. Sibbesen
- Jordan M. Eizenga
- Benedict Paten
Nature Methods (2023)
Minos: variant adjudication and joint genotyping of cohorts of bacterial genomes
- Martin Hunt
- Brice Letcher
- Zamin Iqbal
Genome Biology (2022)
KAGE: fast alignment-free graph-based genotyping of SNPs and short indels
- Ivar Grytten
- Knut Dagestad Rand
- Geir Kjetil Sandve
Genome Biology (2022)