Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Personalized pangenome references

Subjects

Abstract

Pangenomes reduce reference bias by representing genetic diversity better than a single reference sequence. Yet when comparing a sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with by filtering rare variants. However, this blunt heuristic both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach that imputes a personalized pangenome subgraph by sampling local haplotypes according to k-mer counts in the reads. We implement the approach in the vg toolkit (https://github.com/vgteam/vg) for the Giraffe short-read aligner and compare its accuracy to state-of-the-art methods using human pangenome graphs from the Human Pangenome Reference Consortium. This reduces small variant genotyping errors by four times relative to the Genome Analysis Toolkit and makes short-read structural variant genotyping of known variants competitive with long-read variant discovery methods.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Illustrating haplotype sampling at adjacent blocks in the pangenome.
Fig. 2: Mapping 30× NovaSeq reads for HG002 to GRCh38 (with BWA-MEM) and to HPRC graphs (with Giraffe).
Fig. 3: Small variants evaluation across samples HG001 to HG005.
Fig. 4: SVs benchmark evaluation.

Similar content being viewed by others

Data availability

This work was done using publicly available data. HPRC v.1.1 graphs and VCF files for the variants included in them are available at https://github.com/human-pangenomics/hpp_pangenome_resources. The underlying assemblies, including GRCh38, can be found at https://github.com/human-pangenomics/HPP_Year1_Assemblies. We used Illumina and Element short reads for HG001, HG002, HG003, HG003 and HG005 available at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free and https://console.cloud.google.com/storage/browser/brain-genomics-public/research/element/cloudbreak_wgs, respectively. The GIAB small variant benchmark sets for the same samples can be found at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/. GIAB and challenging medically relevant gene SV sets for HG002 is available at the same location. The T2T assembly of HG002 is available at https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/hg002v0.9.fasta.gz. See Supplementary Section 1 for further details.

Code availability

The haplotype sampling approach described in this article is part of the vg toolkit available under MIT license at https://github.com/vgteam/vg. There is an example dataset in directory test/haplotype-sampling. Documentation can be found at https://github.com/vgteam/vg/wiki/Haplotype-Sampling. See Supplementary Sections 4 and 5 for details on other software used.

References

  1. Eizenga, J. M. et al. Pangenome graphs. Ann. Rev. Genomics Hum. Genet. 24, 139–162 (2020).

    Article  Google Scholar 

  2. Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Rautiainen, M. & Marschall, T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  4. Sirén, J. et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  5. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–64 (2015).

    Article  Google Scholar 

  6. Pritt, J., Chen, Nae-Chyun & Langmead, B. FORGe: prioritizing variants for graph genomes. Genome Biol. 19, 220 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Liao, Wen-Wei et al. A draft human pangenome reference. Nature 617, 312–324 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Vaddadi, K., Mun, T. & Langmead, B. Minimizing reference bias with an impute-first approach. Preprint bioRxiv https://doi.org/10.1101/2023.11.30.568362 (2023).

  10. Hickey, G. et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01793-w (2023).

  11. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    Article  CAS  PubMed  Google Scholar 

  12. Hickey, G. et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 21, 35 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Ebler, J. et al. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat. Genet. 54, 518–525 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Human Pangenome Reference Consortium. HPRC Pangenome Resources. GitHub https://github.com/human-pangenomics/hpp_pangenome_resources/ (2023).

  15. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint https://arxiv.org/abs/1303.3997 (2013).

  16. Kokot, M., Długosz, M. & Deorowicz, S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017).

    Article  CAS  PubMed  Google Scholar 

  17. Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Baid, G. et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv https://doi.org/10.1101/2020.12.11.422022 (2020).

  19. Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).

  21. Carroll, A. et al. Accurate human genome analysis with Element Avidity sequencing. Preprint at bioRxiv https://doi.org/10.1101/2023.08.11.553043 (2023).

  22. Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Kolmogorov, M. et al. Scalable nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nat. Methods 20, 1483–1492 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Marchet, C. et al. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res. 31, 1–12 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Feuk, L., Carson, A. R. & Scherer, S. W. Structural variation in the human genome. Nat. Rev. Genet. 7, 85–97 (2006).

    Article  CAS  PubMed  Google Scholar 

  28. Rausch, T. et al. Delly: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Mohiyuddin, M. et al. Metasv: an accurate and integrative structural-variant caller for next generation sequencing. Bioinformatics 31, 2741–2744 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).

    Article  CAS  PubMed  Google Scholar 

  31. Fang, H. et al. Indel variant analysis of short-read sequencing data with scalpel. Nat. Protoc. 11, 2529–2548 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Wala, J. A. et al. Svaba: genome-wide detection of structural variants and indels by local assembly. Genome Res. 28, 581–591 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

    Article  CAS  PubMed  Google Scholar 

  34. Jiang, T. et al. Long-read-based human genomic structural variation detection with cutesv. Genome Biol. 21, 189 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Smolka, M., Paulin, L.F., Grochowski, C.M. et al. Detection of mosaic and population-level structural variants with Sniffles2. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-02024-y (2024).

  36. Sirén, J. & Paten, B. GBZ file format for pangenome graphs. Bioinformatics 38, 5012–5018 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Sirén, J., Garrison, E., Novak, A. M., Paten, B. & Durbin, R. Haplotype-aware graph indexes. Bioinformatics 36, 400–407 (2020).

    Article  PubMed  Google Scholar 

  38. Paten, B. et al. Superbubbles, ultrabubbles, and cacti. J. Comput. Biol. 25, 649–663 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Chang, X., Eizenga, J., Novak, A. M., Sirén, J. & Paten, B. Distance indexing and seed clustering in sequence graphs. Bioinformatics 36, i146–i153 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Gagie, T., Navarro, G. & Prezza, N. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. Assoc. Comput. Mach. 67, 2 (2020).

    Article  Google Scholar 

  42. Dufresne, Y. et al. The k-mer file format: a standardized and compact disk representation of sets of k-mers. Bioinformatics 38, 4423–4425 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Human Genome Research Institute and the National Institutes of Health (NIH). B.P. was partly supported by NIH grant nos. R01HG010485, U24HG010262, U24HG011853, OT3HL142481, U01HG010961 and OT2OD033761. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

J.S. and B.P. conceived the method for haplotype sampling, and J.S. developed and implemented it. J.S., P.E., M.T.U. and M.K. performed the analyses shown in the paper. J.S., G.H., J.M.E., A.M.N., X.C. and J.M. contributed to the vg software on which the method is based and helped modify it for this work. P.-C.C. and A.C. trained and provided support on using DeepVariant for the paper. J.S., P.E., M.T.U. and B.P. wrote the paper. All authors reviewed and edited the draft.

Corresponding authors

Correspondence to Jouni Sirén or Benedict Paten.

Ethics declarations

Competing interests

P.-C.C. and A.C. are employees of Google LLC and own Alphabet stock as part of the standard compensation package. The other authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Rayan Chikhi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Lei Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Sections 1–5, Tables 1–4 and Figs. 1–4.

Reporting Summary

Peer Review File

Supplementary Tables 5–13

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sirén, J., Eskandar, P., Ungaro, M.T. et al. Personalized pangenome references. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02407-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41592-024-02407-2

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research