Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Accurate genotyping across variant classes and lengths using variant graphs

Abstract

Genotype estimates from short-read sequencing data are typically based on the alignment of reads to a linear reference, but reads originating from more complex variants (for example, structural variants) often align poorly, resulting in biased genotype estimates. This bias can be mitigated by first collecting a set of candidate variants across discovery methods, individuals and databases, and then realigning the reads to the variants and reference simultaneously. However, this realignment problem has proved computationally difficult. Here, we present a new method (BayesTyper) that uses exact alignment of read k-mers to a graph representation of the reference and variants to efficiently perform unbiased, probabilistic genotyping across the variation spectrum. We demonstrate that BayesTyper generally provides superior variant sensitivity and genotyping accuracy relative to existing methods when used to integrate variants across discovery approaches and individuals. Finally, we demonstrate that including a ‘variation-prior’ database containing already known variants significantly improves sensitivity.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: BayesTyper.
Fig. 2: Comparison of genotyping methods on PG data (50×).
Fig. 3: Effect of using BayesTyper with a variation prior on structural variation calling performance on the PG (50×) and GoNL (13×) data sets.

Similar content being viewed by others

References

  1. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).

  2. DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  3. Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  4. 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  CAS  Google Scholar 

  5. Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  6. Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 46, 818–825 (2014).

    Article  CAS  Google Scholar 

  7. Hehir-Kwa, J. Y. et al. A high-quality human reference panel reveals the complexity and distribution of genomic structural variants. Nat. Commun. 7, 12989 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  8. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).

  9. Handsaker, R. E. et al. Large multiallelic copy number variations in humans. Nat. Genet. 47, 296–303 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  10. Sirén, J. Indexing variation graphs. Preprint at https://arxiv.org/abs/1604.06605 (2016).

  11. Paten, B., Novak, A. M., Eizenga, J. M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  12. Schneeberger, K. et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  13. Huang, L., Popic, V. & Batzoglou, S. Short read alignment with populations of genomes. Bioinformatics 29, i361–i370 (2013).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  14. Sirén, J., Välimäki, N. & Mäkinen, V. Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. 11, 375–388 (2014).

    Article  PubMed  Google Scholar 

  15. Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  16. Eggertsson, H. P. et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nat. Genet. 49, 1654–1660 (2017).

    Article  PubMed  CAS  Google Scholar 

  17. Deorowicz, S., Kokot, M., Grabowski, S. & Debudaj-Grabysz, A. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31, 1569–1576 (2015).

    Article  PubMed  CAS  Google Scholar 

  18. Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  19. Chiang, C. et al. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat. Methods 12, 6–10 (2015).

    Article  CAS  Google Scholar 

  20. Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  21. Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

    Article  PubMed  CAS  Google Scholar 

  22. Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 1–26 (2016).

    Article  CAS  Google Scholar 

  23. Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).

    Google Scholar 

  24. Maretty, L. et al. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature 548, 87–91 (2017).

    Article  PubMed  CAS  Google Scholar 

  25. Chiang, C. et al. The impact of structural variation on human gene expression. Nat. Genet. 49, 692–699 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  26. Hwang, S., Kim, E., Lee, I. & Marcotte, E. M. Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci. Rep. 5, 17875 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  27. Jain, M., Olsen, H. E., Paten, B. & Akeson, M. The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 17, 239 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  28. Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  29. Mohamadi, H., Chu, J., Vandervalk, B. P. & Birol, I. NtHash: Recursive nucleotide hashing. Bioinformatics 32, 3492–3494 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  30. Albers, C. A. et al. Dindel: accurate indel calls from short-read data. Genome Res. 21, 961–973 (2011).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  31. Maretty, L., Sibbesen, J. A. & Krogh, A. Bayesian transcriptome assembly. Genome Biol. 15, 501 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  32. Zhao, H. et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007 (2014).

    Article  PubMed  CAS  Google Scholar 

  33. Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  34. Hu, X. et al. pIRS: Profile-based Illumina pair-end reads simulator. Bioinformatics 28, 1533–1535 (2012).

    Article  PubMed  CAS  Google Scholar 

  35. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  36. Cleary, J. G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. Preprint at https://www.biorxiv.org/content/early/2015/08/03/023754 (2015).

Download references

Acknowledgements

This work was supported by grants from the Novo Nordisk Foundation (grant number NNF10SA1016550) to A.K. and Innovation Fund Denmark (grant number 019-2011-2). We thank the Platinum Genomes project and the Genome of the Netherlands consortium for providing access to their data.

Author information

Authors and Affiliations

Authors

Consortia

Contributions

J.A.S. designed and implemented the algorithm, performed the analyses and wrote the manuscript. L.M. designed and implemented the algorithm, performed the analyses and wrote the manuscript. A.K. designed the algorithm and wrote the manuscript.

Corresponding author

Correspondence to Anders Krogh.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated Supplementary Information

Supplementary Figure 1 Comparison of homopolymer genotyping performance across methods on Platinum Genomes data (50×).

BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from 13 individuals in the Platinum Genomes pedigree using variants discovered by merging calls from four different methods as variant candidate input (Table 1). a,b, The number of called variant alleles (i.e., variant sensitivity) (a) and genotyping accuracy (b) estimated by validating genotypes using pedigree inheritance information shown as a function of the reference homopolymer length.

Supplementary Figure 2 Comparison of genotyping methods on Platinum Genomes data (50×) against the Genome in a Bottle ‘ground-truth’ set for NA12878.

BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from 13 individuals in the Platinum Genomes pedigree using variants discovered by merging calls from four different methods as variant candidate input (Table 1). Variants longer than 50 nt were excluded from the analyses. a, Variant allele sensitivity (right) and precision (left) across variant classes; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. b, Sensitivity (two upper panels) and precision (two lower panels) for structural variants as a function of the net change in sequence length relative to the reference (5-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 3 Marginal contribution of discovery methods to the Platinum Genomes (50×) genotypes.

Variant discovery was conducted by merging calls from HaplotypeCaller, Platypus, FreeBayes and Manta across 13 individuals in the Platinum Genomes pedigree (Table 1). Plots show the absolute number of variants called by BayesTyper for each discovery method (top) and the marginal fraction of all called variants identified by each discovery method (bottom) for structural variants as a function of the net change in sequence length relative to the reference (50-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 4 Comparison of genotyping methods on simulated data (30×).

We simulated 30× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, SVTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as variant candidate input (Table 1). Values were aggregated across all ten individuals to provide a single estimate. a, Variant allele sensitivity (right) and precision (left) across variant classes; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. b, Sensitivity (two upper panels) and precision (two lower panels) for structural variants as a function of the net change in sequence length relative to the reference (50-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 5 Comparison of indel genotyping performance across methods on simulated data (30×).

We simulated 30× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as variant candidate input (Table 1). Values were aggregated across all ten individuals to provide a single estimate. Plots show sensitivity (upper two panels) and precision (lower two panels) for indels as a function of the net change in sequence length relative to the reference (5-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 6 Receiver-operator curves on simulated data (30×).

We simulated 30× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as variant candidate input (Table 1). a,b, Receiver-operator curves were computed for genotype quality (posterior probability for BayesTyper) across all methods for each of the ten simulated individuals for SNVs (a) and non-SNVs (b). Triangles indicate the genotype quality threshold used in the benchmark.

Supplementary Figure 7 Comparison of genotyping methods on Genome of the Netherlands data (13×).

BayesTyper, SVTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from 30 individuals (ten parent–offspring trios) from the Genome of the Netherlands (GoNL) project using variant calls for the entire GoNL cohort (n = 769) obtained from the GoNL project as variant candidate input (Table 1). The number of called variant alleles was used as a measure of sensitivity, whereas genotyping accuracy was estimated as the fraction of variants with no Mendelian errors across the ten trios. a, Sensitivity (right) and accuracy (left) across variant classes; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. b, Sensitivity (top, log scale) and accuracy (bottom) for structural variants as a function of the net change in sequence length relative to the reference (50-nt and 5-nt bins for the ±500-nt and ±50-nt scales, respectively). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 8 Comparison of homopolymer genotyping performance across methods on Genome of the Netherlands data (13×).

BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from 30 individuals (ten parent–offspring trios) from the Genome of the Netherlands (GoNL) project using variant calls for the entire GoNL cohort (n = 769) obtained from the GoNL project as variant candidate input (Table 1). a,b, The number of called variant alleles (i.e., variant sensitivity) (a) and genotyping accuracy (b) assessed by the fraction of variants with no Mendelian errors across the ten trios shown as a function of the reference homopolymer length.

Supplementary Figure 9 Comparison of genotyping methods on simulated data (10×).

We simulated 10× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, SVTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as variant candidate input (Table 1). Values were aggregated across all ten individuals to provide a single estimate. a, Variant allele sensitivity (right) and genotyping accuracy (left) across variant classes; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. b, Sensitivity (upper two panels) and precision (lower two panels) for structural variants as a function of the net change in sequence length relative to the reference (50-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 10 Comparison of indel genotyping performance across methods on simulated data (10×).

We simulated 10× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as variant candidate input (Table 1). Values were aggregated across all ten individuals to provide a single estimate. Plots show sensitivity (upper two panels) and precision (lower two panels) for indel variants as a function of the net change in sequence length relative to the reference (5-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 11 Receiver-operator curves on simulated data (10×).

We simulated 10× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as input (Table 1). a,b, Receiver-operator curves were computed for genotype quality (posterior probability for BayesTyper) across all methods for each of the ten simulated individuals for SNVs (a) and non-SNVs (b). Triangles indicate the genotype quality threshold used in the benchmark.

Supplementary Figure 12 Effect of using BayesTyper with a variation prior on genotyping performance across variant classes on the Platinum Genomes (50×) and Genome of the Netherlands (13×) datasets.

A ‘variation prior’ database was constructed by combining SNVs and structural variants from different databases and studies (Supplementary Table 1). BayesTyper was then run on variant candidates obtained by merging the variation prior with variants discovered using four different methods (Table 1). a, Sensitivity (right) and accuracy (left) of BayesTyper across variant classes on the Platinum Genomes (50×) datasets with and without the variation prior; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. Genotyping accuracy was estimated by validating genotypes using pedigree inheritance information. b, Same analyses as in a when running BayesTyper on the Genome of the Netherlands data (13×), where genotyping accuracy was estimated as the fraction of variants with no Mendelian errors across the ten trios.

Supplementary Figure 13 Effect of changing the ‘maximum allele length’ threshold in BayesTyper on genotyping performance across variant classes on the Platinum Genomes data (50×).

BayesTyper was run on data from 13 individuals in the Platinum Genomes pedigree using variants discovered by merging calls from four different methods as variant candidate input (Table 1) and using two different thresholds for the maximum allele length (10,000 and 500,000 nt). The number of called variant alleles was used as a measure of sensitivity, whereas the genotyping accuracy was estimated by validating genotypes using pedigree inheritance information. a, Sensitivity (right) and accuracy (left) across variant classes; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. b, Sensitivity (top, log scale) and accuracy (bottom) for structural variants as a function of the net change in sequence length relative to the reference (50-nt and 5-nt bins for the ±500-nt and ±50-nt scales, respectively). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 14 Variant cluster group definition.

Variant clusters within the copied or deleted sequence of an upstream structural variant are dependent, as they will share k-mers. These clusters are therefore defined to belong to the same inference group. Their dependency structure is represented as a tree, where the colors correspond to different variant clusters (VC); green and purple triangles are deletions, and the red triangle is a copy number insertion.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–14, Supplementary Tables 1–4 and Supplementary Note

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sibbesen, J.A., Maretty, L., The Danish Pan-Genome Consortium. et al. Accurate genotyping across variant classes and lengths using variant graphs. Nat Genet 50, 1054–1059 (2018). https://doi.org/10.1038/s41588-018-0145-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-018-0145-5

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research