Abstract
Pangenomes reduce reference bias by representing genetic diversity better than a single reference sequence. Yet when comparing a sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with by filtering rare variants. However, this blunt heuristic both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach that imputes a personalized pangenome subgraph by sampling local haplotypes according to k-mer counts in the reads. We implement the approach in the vg toolkit (https://github.com/vgteam/vg) for the Giraffe short-read aligner and compare its accuracy to state-of-the-art methods using human pangenome graphs from the Human Pangenome Reference Consortium. This reduces small variant genotyping errors by four times relative to the Genome Analysis Toolkit and makes short-read structural variant genotyping of known variants competitive with long-read variant discovery methods.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
This work was done using publicly available data. HPRC v.1.1 graphs and VCF files for the variants included in them are available at https://github.com/human-pangenomics/hpp_pangenome_resources. The underlying assemblies, including GRCh38, can be found at https://github.com/human-pangenomics/HPP_Year1_Assemblies. We used Illumina and Element short reads for HG001, HG002, HG003, HG003 and HG005 available at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free and https://console.cloud.google.com/storage/browser/brain-genomics-public/research/element/cloudbreak_wgs, respectively. The GIAB small variant benchmark sets for the same samples can be found at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/. GIAB and challenging medically relevant gene SV sets for HG002 is available at the same location. The T2T assembly of HG002 is available at https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/hg002v0.9.fasta.gz. See Supplementary Section 1 for further details.
Code availability
The haplotype sampling approach described in this article is part of the vg toolkit available under MIT license at https://github.com/vgteam/vg. There is an example dataset in directory test/haplotype-sampling. Documentation can be found at https://github.com/vgteam/vg/wiki/Haplotype-Sampling. See Supplementary Sections 4 and 5 for details on other software used.
References
Eizenga, J. M. et al. Pangenome graphs. Ann. Rev. Genomics Hum. Genet. 24, 139–162 (2020).
Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018).
Rautiainen, M. & Marschall, T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020).
Sirén, J. et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021).
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–64 (2015).
Pritt, J., Chen, Nae-Chyun & Langmead, B. FORGe: prioritizing variants for graph genomes. Genome Biol. 19, 220 (2018).
Liao, Wen-Wei et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015).
Vaddadi, K., Mun, T. & Langmead, B. Minimizing reference bias with an impute-first approach. Preprint bioRxiv https://doi.org/10.1101/2023.11.30.568362 (2023).
Hickey, G. et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01793-w (2023).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Hickey, G. et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 21, 35 (2020).
Ebler, J. et al. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat. Genet. 54, 518–525 (2022).
Human Pangenome Reference Consortium. HPRC Pangenome Resources. GitHub https://github.com/human-pangenomics/hpp_pangenome_resources/ (2023).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint https://arxiv.org/abs/1303.3997 (2013).
Kokot, M., Długosz, M. & Deorowicz, S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017).
Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2022).
Baid, G. et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv https://doi.org/10.1101/2020.12.11.422022 (2020).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).
Carroll, A. et al. Accurate human genome analysis with Element Avidity sequencing. Preprint at bioRxiv https://doi.org/10.1101/2023.08.11.553043 (2023).
Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022).
Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).
Kolmogorov, M. et al. Scalable nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nat. Methods 20, 1483–1492 (2023).
Marchet, C. et al. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res. 31, 1–12 (2021).
Feuk, L., Carson, A. R. & Scherer, S. W. Structural variation in the human genome. Nat. Rev. Genet. 7, 85–97 (2006).
Rausch, T. et al. Delly: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).
Mohiyuddin, M. et al. Metasv: an accurate and integrative structural-variant caller for next generation sequencing. Bioinformatics 31, 2741–2744 (2015).
Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).
Fang, H. et al. Indel variant analysis of short-read sequencing data with scalpel. Nat. Protoc. 11, 2529–2548 (2016).
Wala, J. A. et al. Svaba: genome-wide detection of structural variants and indels by local assembly. Genome Res. 28, 581–591 (2018).
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Jiang, T. et al. Long-read-based human genomic structural variation detection with cutesv. Genome Biol. 21, 189 (2020).
Smolka, M., Paulin, L.F., Grochowski, C.M. et al. Detection of mosaic and population-level structural variants with Sniffles2. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-02024-y (2024).
Sirén, J. & Paten, B. GBZ file format for pangenome graphs. Bioinformatics 38, 5012–5018 (2022).
Sirén, J., Garrison, E., Novak, A. M., Paten, B. & Durbin, R. Haplotype-aware graph indexes. Bioinformatics 36, 400–407 (2020).
Paten, B. et al. Superbubbles, ultrabubbles, and cacti. J. Comput. Biol. 25, 649–663 (2018).
Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).
Chang, X., Eizenga, J., Novak, A. M., Sirén, J. & Paten, B. Distance indexing and seed clustering in sequence graphs. Bioinformatics 36, i146–i153 (2020).
Gagie, T., Navarro, G. & Prezza, N. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. Assoc. Comput. Mach. 67, 2 (2020).
Dufresne, Y. et al. The k-mer file format: a standardized and compact disk representation of sets of k-mers. Bioinformatics 38, 4423–4425 (2022).
Acknowledgements
This work was supported in part by the National Human Genome Research Institute and the National Institutes of Health (NIH). B.P. was partly supported by NIH grant nos. R01HG010485, U24HG010262, U24HG011853, OT3HL142481, U01HG010961 and OT2OD033761. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
J.S. and B.P. conceived the method for haplotype sampling, and J.S. developed and implemented it. J.S., P.E., M.T.U. and M.K. performed the analyses shown in the paper. J.S., G.H., J.M.E., A.M.N., X.C. and J.M. contributed to the vg software on which the method is based and helped modify it for this work. P.-C.C. and A.C. trained and provided support on using DeepVariant for the paper. J.S., P.E., M.T.U. and B.P. wrote the paper. All authors reviewed and edited the draft.
Corresponding authors
Ethics declarations
Competing interests
P.-C.C. and A.C. are employees of Google LLC and own Alphabet stock as part of the standard compensation package. The other authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Rayan Chikhi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Lei Tang, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Sections 1–5, Tables 1–4 and Figs. 1–4.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sirén, J., Eskandar, P., Ungaro, M.T. et al. Personalized pangenome references. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02407-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41592-024-02407-2