Accurate genotyping across variant classes and lengths using variant graphs

Abstract

Genotype estimates from short-read sequencing data are typically based on the alignment of reads to a linear reference, but reads originating from more complex variants (for example, structural variants) often align poorly, resulting in biased genotype estimates. This bias can be mitigated by first collecting a set of candidate variants across discovery methods, individuals and databases, and then realigning the reads to the variants and reference simultaneously. However, this realignment problem has proved computationally difficult. Here, we present a new method (BayesTyper) that uses exact alignment of read k-mers to a graph representation of the reference and variants to efficiently perform unbiased, probabilistic genotyping across the variation spectrum. We demonstrate that BayesTyper generally provides superior variant sensitivity and genotyping accuracy relative to existing methods when used to integrate variants across discovery approaches and individuals. Finally, we demonstrate that including a ‘variation-prior’ database containing already known variants significantly improves sensitivity.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: BayesTyper.
Fig. 2: Comparison of genotyping methods on PG data (50×).
Fig. 3: Effect of using BayesTyper with a variation prior on structural variation calling performance on the PG (50×) and GoNL (13×) data sets.

References

  1. 1.

    Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).

  2. 2.

    DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

  3. 3.

    Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).

  4. 4.

    1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

  5. 5.

    Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).

  6. 6.

    Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 46, 818–825 (2014).

  7. 7.

    Hehir-Kwa, J. Y. et al. A high-quality human reference panel reveals the complexity and distribution of genomic structural variants. Nat. Commun. 7, 12989 (2016).

  8. 8.

    Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).

  9. 9.

    Handsaker, R. E. et al. Large multiallelic copy number variations in humans. Nat. Genet. 47, 296–303 (2015).

  10. 10.

    Sirén, J. Indexing variation graphs. Preprint at https://arxiv.org/abs/1604.06605 (2016).

  11. 11.

    Paten, B., Novak, A. M., Eizenga, J. M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).

  12. 12.

    Schneeberger, K. et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).

  13. 13.

    Huang, L., Popic, V. & Batzoglou, S. Short read alignment with populations of genomes. Bioinformatics 29, i361–i370 (2013).

  14. 14.

    Sirén, J., Välimäki, N. & Mäkinen, V. Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. 11, 375–388 (2014).

  15. 15.

    Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015).

  16. 16.

    Eggertsson, H. P. et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nat. Genet. 49, 1654–1660 (2017).

  17. 17.

    Deorowicz, S., Kokot, M., Grabowski, S. & Debudaj-Grabysz, A. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31, 1569–1576 (2015).

  18. 18.

    Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).

  19. 19.

    Chiang, C. et al. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat. Methods 12, 6–10 (2015).

  20. 20.

    Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).

  21. 21.

    Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

  22. 22.

    Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 1–26 (2016).

  23. 23.

    Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).

  24. 24.

    Maretty, L. et al. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature 548, 87–91 (2017).

  25. 25.

    Chiang, C. et al. The impact of structural variation on human gene expression. Nat. Genet. 49, 692–699 (2017).

  26. 26.

    Hwang, S., Kim, E., Lee, I. & Marcotte, E. M. Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci. Rep. 5, 17875 (2015).

  27. 27.

    Jain, M., Olsen, H. E., Paten, B. & Akeson, M. The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 17, 239 (2016).

  28. 28.

    Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).

  29. 29.

    Mohamadi, H., Chu, J., Vandervalk, B. P. & Birol, I. NtHash: Recursive nucleotide hashing. Bioinformatics 32, 3492–3494 (2016).

  30. 30.

    Albers, C. A. et al. Dindel: accurate indel calls from short-read data. Genome Res. 21, 961–973 (2011).

  31. 31.

    Maretty, L., Sibbesen, J. A. & Krogh, A. Bayesian transcriptome assembly. Genome Biol. 15, 501 (2014).

  32. 32.

    Zhao, H. et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007 (2014).

  33. 33.

    Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).

  34. 34.

    Hu, X. et al. pIRS: Profile-based Illumina pair-end reads simulator. Bioinformatics 28, 1533–1535 (2012).

  35. 35.

    Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

  36. 36.

    Cleary, J. G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. Preprint at https://www.biorxiv.org/content/early/2015/08/03/023754 (2015).

Download references

Acknowledgements

This work was supported by grants from the Novo Nordisk Foundation (grant number NNF10SA1016550) to A.K. and Innovation Fund Denmark (grant number 019-2011-2). We thank the Platinum Genomes project and the Genome of the Netherlands consortium for providing access to their data.

Author information

J.A.S. designed and implemented the algorithm, performed the analyses and wrote the manuscript. L.M. designed and implemented the algorithm, performed the analyses and wrote the manuscript. A.K. designed the algorithm and wrote the manuscript.

Correspondence to Anders Krogh.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated Supplementary Information

Supplementary Figure 1 Comparison of homopolymer genotyping performance across methods on Platinum Genomes data (50×).

BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from 13 individuals in the Platinum Genomes pedigree using variants discovered by merging calls from four different methods as variant candidate input (Table 1). a,b, The number of called variant alleles (i.e., variant sensitivity) (a) and genotyping accuracy (b) estimated by validating genotypes using pedigree inheritance information shown as a function of the reference homopolymer length.

Supplementary Figure 2 Comparison of genotyping methods on Platinum Genomes data (50×) against the Genome in a Bottle ‘ground-truth’ set for NA12878.

BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from 13 individuals in the Platinum Genomes pedigree using variants discovered by merging calls from four different methods as variant candidate input (Table 1). Variants longer than 50 nt were excluded from the analyses. a, Variant allele sensitivity (right) and precision (left) across variant classes; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. b, Sensitivity (two upper panels) and precision (two lower panels) for structural variants as a function of the net change in sequence length relative to the reference (5-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 3 Marginal contribution of discovery methods to the Platinum Genomes (50×) genotypes.

Variant discovery was conducted by merging calls from HaplotypeCaller, Platypus, FreeBayes and Manta across 13 individuals in the Platinum Genomes pedigree (Table 1). Plots show the absolute number of variants called by BayesTyper for each discovery method (top) and the marginal fraction of all called variants identified by each discovery method (bottom) for structural variants as a function of the net change in sequence length relative to the reference (50-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 4 Comparison of genotyping methods on simulated data (30×).

We simulated 30× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, SVTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as variant candidate input (Table 1). Values were aggregated across all ten individuals to provide a single estimate. a, Variant allele sensitivity (right) and precision (left) across variant classes; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. b, Sensitivity (two upper panels) and precision (two lower panels) for structural variants as a function of the net change in sequence length relative to the reference (50-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 5 Comparison of indel genotyping performance across methods on simulated data (30×).

We simulated 30× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as variant candidate input (Table 1). Values were aggregated across all ten individuals to provide a single estimate. Plots show sensitivity (upper two panels) and precision (lower two panels) for indels as a function of the net change in sequence length relative to the reference (5-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 6 Receiver-operator curves on simulated data (30×).

We simulated 30× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as variant candidate input (Table 1). a,b, Receiver-operator curves were computed for genotype quality (posterior probability for BayesTyper) across all methods for each of the ten simulated individuals for SNVs (a) and non-SNVs (b). Triangles indicate the genotype quality threshold used in the benchmark.

Supplementary Figure 7 Comparison of genotyping methods on Genome of the Netherlands data (13×).

BayesTyper, SVTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from 30 individuals (ten parent–offspring trios) from the Genome of the Netherlands (GoNL) project using variant calls for the entire GoNL cohort (n = 769) obtained from the GoNL project as variant candidate input (Table 1). The number of called variant alleles was used as a measure of sensitivity, whereas genotyping accuracy was estimated as the fraction of variants with no Mendelian errors across the ten trios. a, Sensitivity (right) and accuracy (left) across variant classes; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. b, Sensitivity (top, log scale) and accuracy (bottom) for structural variants as a function of the net change in sequence length relative to the reference (50-nt and 5-nt bins for the ±500-nt and ±50-nt scales, respectively). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 8 Comparison of homopolymer genotyping performance across methods on Genome of the Netherlands data (13×).

BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from 30 individuals (ten parent–offspring trios) from the Genome of the Netherlands (GoNL) project using variant calls for the entire GoNL cohort (n = 769) obtained from the GoNL project as variant candidate input (Table 1). a,b, The number of called variant alleles (i.e., variant sensitivity) (a) and genotyping accuracy (b) assessed by the fraction of variants with no Mendelian errors across the ten trios shown as a function of the reference homopolymer length.

Supplementary Figure 9 Comparison of genotyping methods on simulated data (10×).

We simulated 10× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, SVTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as variant candidate input (Table 1). Values were aggregated across all ten individuals to provide a single estimate. a, Variant allele sensitivity (right) and genotyping accuracy (left) across variant classes; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. b, Sensitivity (upper two panels) and precision (lower two panels) for structural variants as a function of the net change in sequence length relative to the reference (50-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 10 Comparison of indel genotyping performance across methods on simulated data (10×).

We simulated 10× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as variant candidate input (Table 1). Values were aggregated across all ten individuals to provide a single estimate. Plots show sensitivity (upper two panels) and precision (lower two panels) for indel variants as a function of the net change in sequence length relative to the reference (5-nt bins). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 11 Receiver-operator curves on simulated data (10×).

We simulated 10× paired-end sequencing data for ten Yoruba individuals from Ibadan, Nigeria (YRI) based on their 1000 Genomes genotype estimates. BayesTyper, HaplotypeCaller, Platypus and FreeBayes were run on data from the ten simulated individuals using variants discovered by merging calls from four different methods as input (Table 1). a,b, Receiver-operator curves were computed for genotype quality (posterior probability for BayesTyper) across all methods for each of the ten simulated individuals for SNVs (a) and non-SNVs (b). Triangles indicate the genotype quality threshold used in the benchmark.

Supplementary Figure 12 Effect of using BayesTyper with a variation prior on genotyping performance across variant classes on the Platinum Genomes (50×) and Genome of the Netherlands (13×) datasets.

A ‘variation prior’ database was constructed by combining SNVs and structural variants from different databases and studies (Supplementary Table 1). BayesTyper was then run on variant candidates obtained by merging the variation prior with variants discovered using four different methods (Table 1). a, Sensitivity (right) and accuracy (left) of BayesTyper across variant classes on the Platinum Genomes (50×) datasets with and without the variation prior; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. Genotyping accuracy was estimated by validating genotypes using pedigree inheritance information. b, Same analyses as in a when running BayesTyper on the Genome of the Netherlands data (13×), where genotyping accuracy was estimated as the fraction of variants with no Mendelian errors across the ten trios.

Supplementary Figure 13 Effect of changing the ‘maximum allele length’ threshold in BayesTyper on genotyping performance across variant classes on the Platinum Genomes data (50×).

BayesTyper was run on data from 13 individuals in the Platinum Genomes pedigree using variants discovered by merging calls from four different methods as variant candidate input (Table 1) and using two different thresholds for the maximum allele length (10,000 and 500,000 nt). The number of called variant alleles was used as a measure of sensitivity, whereas the genotyping accuracy was estimated by validating genotypes using pedigree inheritance information. a, Sensitivity (right) and accuracy (left) across variant classes; variants not classified as SNVs, insertions, deletions or inversions were labeled as complex. b, Sensitivity (top, log scale) and accuracy (bottom) for structural variants as a function of the net change in sequence length relative to the reference (50-nt and 5-nt bins for the ±500-nt and ±50-nt scales, respectively). Variant alleles that do not entail a net change in sequence length (e.g., SNVs) were omitted.

Supplementary Figure 14 Variant cluster group definition.

Variant clusters within the copied or deleted sequence of an upstream structural variant are dependent, as they will share k-mers. These clusters are therefore defined to belong to the same inference group. Their dependency structure is represented as a tree, where the colors correspond to different variant clusters (VC); green and purple triangles are deletions, and the red triangle is a copy number insertion.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–14, Supplementary Tables 1–4 and Supplementary Note

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sibbesen, J.A., Maretty, L. & Krogh, A. Accurate genotyping across variant classes and lengths using variant graphs. Nat Genet 50, 1054–1059 (2018). https://doi.org/10.1038/s41588-018-0145-5

Download citation

Further reading