Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Pangenome graph construction from genome alignments with Minigraph-Cactus

Abstract

Pangenome references address biases of reference genomes by storing a representative set of diverse haplotypes and their alignment, usually as a graph. Alternate alleles determined by variant callers can be used to construct pangenome graphs, but advances in long-read sequencing are leading to widely available, high-quality phased assemblies. Constructing a pangenome graph directly from assemblies, as opposed to variant calls, leverages the graph’s ability to represent variation at different scales. Here we present the Minigraph-Cactus pangenome pipeline, which creates pangenomes directly from whole-genome alignments, and demonstrate its ability to scale to 90 human haplotypes from the Human Pangenome Reference Consortium. The method builds graphs containing all forms of genetic variation while allowing use of current mapping and genotyping tools. We measure the effect of the quality and completeness of reference genomes used for analysis within the pangenomes and show that using the CHM13 reference from the Telomere-to-Telomere Consortium improves the accuracy of our methods. We also demonstrate construction of a Drosophila melanogaster pangenome.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Minigraph-Cactus pangenome construction.
Fig. 2: Evaluating GRCh38-based and T2T-CHM13-based human pangenomes.
Fig. 3: Comparing pangenome SV genotyping.
Fig. 4: A D. melanogaster pangenome.

Similar content being viewed by others

Data availability

All data, software versions and commands are available at https://github.com/ComparativeGenomicsToolkit/cactus/tree/master/doc/mc-paper.

HPRC graphs can be downloaded from https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=pangenomes/freeze/freeze1/minigraph-cactus/. Consult the Data Portal for explanations of the different files: https://github.com/human-pangenomics/hpp_pangenome_resources/. Variant calls can be downloaded from https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/mc_2022/hprc-human/. SV genotyping results are available at https://doi.org/10.5281/zenodo.7669083. D. melanogaster graphs can be downloaded from https://s3-us-west2.amazonaws.com/human-pangenomics/index.html?prefix=publications/mc_2022/mc_pangenomes/16-fruitfly-mc-2022-05-26/. Consult the Data Portal for explanations of the different files: https://github.com/ComparativeGenomicsToolkit/cactus/tree/master/doc/mc-pangenomes. D. melanogaster mapping and calling results can be downloaded from https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/mc_2022/fruitfly/.

Code availability

All source code for the Minigraph-Cactus pangenome pipeline, as well as release binaries, Docker images and user manuals, can be found at https://github.com/ComparativeGenomicsToolkit/cactus.

References

  1. Eizenga, J. M. et al. Pangenome graphs. Annu. Rev. Genomics Hum. Genet. 21, 139–162 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. Miga, K. H. & Wang, T. The need for a human pangenome reference sequence. Annu. Rev. Genomics Hum. Genet. 22, 81–102 (2021).

    PubMed  PubMed Central  Google Scholar 

  3. Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. Abel, H. J. et al. Mapping and characterization of structural variation in 17,795 human genomes. Nature 583, 83–89 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Hickey, G. et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 21, 35 (2020).

    PubMed  PubMed Central  Google Scholar 

  6. Sirén, J. et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021).

    PubMed  PubMed Central  Google Scholar 

  7. Paten, B. et al. Superbubbles, ultrabubbles, and cacti. J. Comput. Biol. 25, 649–663 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. https://doi.org/10.1038/s41587-022-01435-7 (2023).

  9. Just, W. Computational complexity of multiple sequence alignment with SP-score. J. Comput. Biol. 8, 615–623 (2004).

    Google Scholar 

  10. Kille, B., Balaji, A., Sedlazeck, F. J., Nute, M. & Treangen, T. J. Multiple genome alignment in the telomere-to-telomere assembly era. Genome Biol. 23, 182 (2022).

    PubMed  PubMed Central  Google Scholar 

  11. Blanchette, M. et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14, 708–715 (2004).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. Harris, R. S. Improved Pairwise Alignment of Genomic DNA. PhD thesis, Pennsylvania State Univ. (2007).

  13. Armstrong, J. et al. Progressive Cactus is a multiple-genome aligner for the thousand-genome era. Nature 587, 246–251 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. Goenka, S. D., Turakhia, Y., Paten, B. & Horowitz, M. SegAlign: a scalable GPU-based whole genome aligner. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. https://doi.org/10.1109/sc41405.2020.00043 (IEEE, 2020).

  15. Paten, B. et al. Cactus graphs for genome comparisons. J. Comput. Biol. 18, 461–489 (2011).

    Google Scholar 

  16. Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 265 (2020).

    PubMed  PubMed Central  Google Scholar 

  17. Lee, C., Grasso, C. & Sharlow, M. F. Multiple sequence alignment using partial order graphs. Bioinformatics 18, 452–464 (2002).

    CAS  PubMed  Google Scholar 

  18. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. Vivian, J. et al. Toil enables reproducible, open source, big biomedical data analyses. Nat. Biotechnol. 35, 314–316 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. Paten, B. et al. Cactus: algorithms for genome multiple sequence alignment. Genome Res. 21, 1512–1528 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. Hickey, G., Paten, B., Earl, D., Zerbino, D. & Haussler, D. HAL: a hierarchical format for storing and analyzing multiple genome alignments. Bioinformatics 29, 1341–1342 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. Fiddes, I. T. et al. Comparative Annotation Toolkit (CAT)-simultaneous clade and personal genome annotation. Genome Res. 28, 1029–1038 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. Doerr, D. GFAffix. https://github.com/marschall-lab/GFAffix (2022).

  24. Bzikadze, A. V. & Pevzner, P. A. TandemAligner: a new parameter-free framework for fast sequence alignment. Preprint at bioRxiv https://doi.org/10.1101/2022.09.15.507041 (2022).

  25. Liao, W.-W. et al. A draft human pangenome reference. Nature https://doi.org/10.1038/s41586-023-05896-x (2023).

  26. Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. Rautiainen, M. & Marschall, T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020).

    PubMed  PubMed Central  Google Scholar 

  28. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    CAS  PubMed  Google Scholar 

  29. Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. Ebler, J. et al. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat. Genet. 54, 518–525 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. 1000 Genomes Project Consortiumet al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Google Scholar 

  32. Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  33. Chakraborty, M., Emerson, J. J., Macdonald, S. J. & Long, A. D. Structural variants exhibit widespread allelic heterogeneity and shape variation in complex traits. Nat. Commun. 10, 4872 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. Huang, W. et al. Natural variation in genome architecture among 205 Drosophila melanogaster Genetic Reference Panel lines. Genome Res. 24, 1193–1208 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at arXiv https://doi.org/10.48550/arXiv.1207.3907 (2012).

  36. Miller, D. E. et al. Identification and characterization of breakpoints and mutations on Drosophila melanogaster balancer chromosomes. G3 (Bethesda) 10, 4271–4285 (2020).

    CAS  PubMed  Google Scholar 

  37. Sherman, R. M. et al. Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat. Genet. 51, 30–35 (2019).

    CAS  PubMed  Google Scholar 

  38. Human Pangenome Reference Consortium. HPRC Pangenome Resources. https://github.com/human-pangenomics/hpp_pangenome_resources (2022).

  39. Guarracino, A. et al. Recombination between heterologous human acrocentric chromosomes. Nature https://doi.org/10.1038/s41586-023-05976-y (2023).

  40. Zhou, Y. et al. Graph pangenome captures missing heritability and empowers tomato breeding. Nature 606, 527–534 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  41. Leonard, A. S. et al. Structural variant-based pangenome construction has low sensitivity to variability of haplotype-resolved bovine assemblies. Nat. Commun. 13, 3012 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. Li, H. Identifying centromeric satellites with dna-brnn. Bioinformatics 35, 4408–4410 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. Numanagic, I. et al. Fast characterization of segmental duplications in genome assemblies. Bioinformatics 34, i706–i714 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. Gao, Y. et al. abPOA: an SIMD-based C library for fast partial order alignment using adaptive band. Bioinformatics 37, 2209–2211 (2021).

    CAS  PubMed  Google Scholar 

  45. Earl, D. et al. Alignathon: a competitive assessment of whole-genome alignment methods. Genome Res. 24, 2077–2089 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. Garrison, E. & Guarracino, A. Unbiased pangenome graphs. Bioinformatics 39, btac743 (2023).

    CAS  PubMed  Google Scholar 

  47. Eizenga, J. M. et al. Efficient dynamic variation graphs. Bioinformatics 36, 5139–5144 (2020).

    CAS  PubMed Central  Google Scholar 

  48. Sirén, J., Garrison, E., Novak, A. M., Paten, B. & Durbin, R. Haplotype-aware graph indexes. Bioinformatics 36, 400–407 (2020).

    PubMed  Google Scholar 

  49. Mose, L. E., Wilkerson, M. D., Hayes, D. N., Perou, C. M. & Parker, J. S. ABRA: improved coding indel detection via assembly-based realignment. Bioinformatics 30, 2813–2815 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  51. Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. Cleary, J. G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. Preprint at bioRxiv https://doi.org/10.1101/023754 (2015).

  53. Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).

    PubMed  PubMed Central  Google Scholar 

  54. broadinstitute/picard. https://github.com/broadinstitute/picard

  55. Kuhn, R. M., Haussler, D. & Kent, W. J. The UCSC Genome Browser and associated tools. Brief. Bioinform. 14, 144–161 (2012).

    PubMed  PubMed Central  Google Scholar 

  56. English, A. C., Menon, V. K., Gibbs, R. A., Metcalf, G. A. & Sedlazeck, F. J. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol. 23, 271 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  57. Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker Open-4.0. http://www.repeatmasker.org (2013–2015).

  58. Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).

    PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank A. D. Long for many suggestions and insights regarding the D. melanogaster data and the whole vg team for their work to create and maintain vg, upon which much of this work depends. B.P., A.N., J.M.E. and J.M. were partly supported by National Institutes of Health (NIH) grants R01HG010485, U24HG010262, U24HG011853, OT3HL142481, U01HG010961 (with H.L.) and OT2OD033761. H.L. was partly supported by NIH grant R01HG010040 and T.M. by U01HG010973. Computational infrastructure and support for running PanGenie were provided by the Centre for Information and Media Technology at Heinrich Heine University Düsseldorf.

Author information

Authors and Affiliations

Authors

Consortia

Contributions

G.H., J.M., H.L. and B.P. designed the method. G.H., J.M. and J.E. contributed to the results and analysis. G.H., J.M., A.N., J.E. and B.P. wrote the mansuscript. All authors contributed to the software. B.P. led the project.

Corresponding authors

Correspondence to Glenn Hickey or Benedict Paten.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–20 and Supplementary Tables 1–6.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hickey, G., Monlong, J., Ebler, J. et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat Biotechnol 42, 663–673 (2024). https://doi.org/10.1038/s41587-023-01793-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-023-01793-w

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research