Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm

Abstract

Haplotype-resolved de novo assembly is the ultimate solution to the study of sequence variations in a genome. However, existing algorithms either collapse heterozygous alleles into one consensus copy or fail to cleanly separate the haplotypes to produce high-quality phased assemblies. Here we describe hifiasm, a de novo assembler that takes advantage of long high-fidelity sequence reads to faithfully represent the haplotype information in a phased assembly graph. Unlike other graph-based assemblers that only aim to maintain the contiguity of one haplotype, hifiasm strives to preserve the contiguity of all haplotypes. This feature enables the development of a graph trio binning algorithm that greatly advances over standard trio binning. On three human and five nonhuman datasets, including California redwood with a ~30-Gb hexaploid genome, we show that hifiasm frequently delivers better assemblies than existing tools and consistently outperforms others on haplotype-resolved assembly.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Outline of the hifiasm algorithm.
Fig. 2: Effect of false read binning.

Data availability

All HiFi data were obtained from the NCBI Sequence Read Archive: SRR11606869 for Z. mays, SRR11606870 for M. musculus, SRR11606867 for F.×ananassa, SRR11606868 and SRR12048570 for R. muscosa, SRP251156 for S. sempervirens, SRR11292120SRR11292123 for CHM13, ERX3831682 for HG00733, and four runs (SRR10382244, SRR10382245, SRR10382248 and SRR10382249) for HG002. For trio binning and computing QV, short reads were also downloaded: SRR7782677 for HG00733, ERR3241754 for HG00731 (father), ERR3241755 for HG00732 (mother) and SRX1082031 for CHM13. GIAB’s ‘homogeneity Run01’ short-read runs were used for the HG002 trio. These HG002 reads were downsampled to 30-fold coverage. The BAC libraries of CHM13 and HG00733 can be found at https://www.ncbi.nlm.nih.gov/nuccore/?term=VMRC59+and+complete/and https://www.ncbi.nlm.nih.gov/nuccore/?term=VMRC62+and+complete/, respectively. The HG002 major histocompatibility complex reference sequences can be found at https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/MHC/assembly/MHCv1.1/ (ref. 26). For BUSCO, the Embryophyta, Tetrapoda and Mammalia datasets are available at https://busco-data.ezlab.org/v4/data/lineages/embryophyta_odb10.2020-09-10.tar.gz, https://busco.ezlab.org/v2/datasets/tetrapoda_odb9.tar.gz and https://busco.ezlab.org/v2/datasets/mammalia_odb9.tar.gz, respectively. The CHM13 reference (v0.9) generated by the T2T consortium can be found at https://s3.amazonaws.com/nanopore-human-wgs/chm13/assemblies/chm13.draft_v0.9.fasta.gz. The hifiasm assemblies produced in this work are available at https://zenodo.org/record/4393631 and https://zenodo.org/record/4393750.

Code availability

Hifiasm code is available at https://github.com/chhylp123/hifiasm/.

References

  1. 1.

    Chin, C.-S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).

    CAS  Article  Google Scholar 

  2. 2.

    Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015).

    CAS  Article  Google Scholar 

  3. 3.

    Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).

    CAS  Article  Google Scholar 

  4. 4.

    Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).

    CAS  Article  Google Scholar 

  5. 5.

    Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

    CAS  Article  Google Scholar 

  6. 6.

    Chin, C.-S. & Khalak, A. Human genome assembly in 100 minutes. Preprint at bioRxiv https://doi.org/10.1101/705616 (2019).

  7. 7.

    Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2. Nat. Methods 17, 155–158 (2020).

    CAS  Article  Google Scholar 

  8. 8.

    Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).

    CAS  Article  Google Scholar 

  9. 9.

    Chen, Y. et al. Efficient assembly of nanopore reads via highly accurate and intact error correction. Nat. Commun. 12, 60 (2021).

    CAS  Article  Google Scholar 

  10. 10.

    Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).

    CAS  Article  Google Scholar 

  11. 11.

    Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. 36, 1174–1182 (2018).

    CAS  Article  Google Scholar 

  12. 12.

    Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

    CAS  Article  Google Scholar 

  13. 13.

    Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol. https://doi.org/10.1038/s41587-020-0711-0 (2020).

  14. 14.

    Porubsky, D. et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat. Biotechnol. https://doi.org/10.1038/s41587-020-0719-5 (2020).

  15. 15.

    Martin, M. et al. WhatsHap: fast and accurate read-based phasing. Preprint at bioRxiv https://doi.org/10.1101/085050 (2016).

  16. 16.

    Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).

    CAS  Article  Google Scholar 

  17. 17.

    Li, H. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28, 1838–1844 (2012).

    CAS  Article  Google Scholar 

  18. 18.

    Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).

    CAS  Article  Google Scholar 

  19. 19.

    Myers, E. W. The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005).

    CAS  PubMed  Google Scholar 

  20. 20.

    Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896–2898 (2020).

    CAS  Article  Google Scholar 

  21. 21.

    Hon, T. et al. Highly accurate long-read HiFi sequencing data for five complex genomes. Sci. Data 7, 399 (2020).

    CAS  Article  Google Scholar 

  22. 22.

    Edger, P. P. et al. Origin and evolution of the octoploid strawberry genome. Nat. Genet. 51, 541–547 (2019).

    CAS  Article  Google Scholar 

  23. 23.

    Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).

    Article  Google Scholar 

  24. 24.

    Hizume, M., Kondo, T., Shibata, F. & Ishizuka, R. Flow cytometric determination of genome size in the Taxodiaceae, Cupressaceae sensu stricto and Sciadopityaceae. Cytologia 66, 307–311 (2001).

    Article  Google Scholar 

  25. 25.

    Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).

    CAS  Article  Google Scholar 

  26. 26.

    Chin, C.-S. et al. A diploid assembly-based benchmark for variants in the major histocompatibility complex. Nat. Commun. 11, 4794 (2020).

    CAS  Article  Google Scholar 

  27. 27.

    Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS  Article  Google Scholar 

  28. 28.

    Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 265 (2020).

    Article  Google Scholar 

  29. 29.

    Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).

    Article  Google Scholar 

  30. 30.

    Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).

    CAS  Article  Google Scholar 

  31. 31.

    Chaisson, M. J. P. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1784 (2019).

    Article  Google Scholar 

  32. 32.

    Cleary, J. G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. Preprint at bioRxiv https://doi.org/10.1101/023754 (2015).

  33. 33.

    Myers, G. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46, 395–415 (1999).

    Article  Google Scholar 

  34. 34.

    Cheng, H., Jiang, H., Yang, J., Xu, Y. & Shang, Y. BitMapper: an efficient all-mapper based on bit-vector computing. BMC Bioinformatics 16, 192 (2015).

    Article  Google Scholar 

  35. 35.

    Miller, J. R. et al. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24, 2818–2824 (2008).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

This study was supported by grants from the US National Institutes of Health (R01HG010040, U01HG010961 and U41HG010972 to H.L.).

Author information

Affiliations

Authors

Contributions

H.C. and H.L. designed the algorithm, implemented hifiasm and drafted the manuscript. H.C. benchmarked hifiasm and other assemblers. G.T.C. ran hifiasm for S. sempervirens, HiCanu for R. muscosa, Peregrine for S. sempervirens and R. muscosa, and Falcon-Unzip for all datasets. X.F. helped with evaluation of the manuscript. H.Z. provided valuable suggestions for error correction and ran BUSCO.

Corresponding author

Correspondence to Heng Li.

Ethics declarations

Competing interests

G.T.C. is an employee of PacBio. H.L. is a consultant of Integrated DNA Technologies and on the Scientific Advisory Boards of Sentieon, BGI and OrigiMed.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Peer review information Nature Methods thanks Benedict Paten and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Supplementary information

Supplementary Information

Supplementary software commands, Supplementary Tables 1–10 and Supplementary Fig. 1.

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cheng, H., Concepcion, G.T., Feng, X. et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18, 170–175 (2021). https://doi.org/10.1038/s41592-020-01056-5

Download citation

Further reading

  • Haplotype-resolved diverse human genomes and integrated analysis of structural variation

    • Peter Ebert
    • , Peter A. Audano
    • , Qihui Zhu
    • , Bernardo Rodriguez-Martin
    • , David Porubsky
    • , Marc Jan Bonder
    • , Arvis Sulovari
    • , Jana Ebler
    • , Weichen Zhou
    • , Rebecca Serra Mari
    • , Feyza Yilmaz
    • , Xuefang Zhao
    • , PingHsun Hsieh
    • , Joyce Lee
    • , Sushant Kumar
    • , Jiadong Lin
    • , Tobias Rausch
    • , Yu Chen
    • , Jingwen Ren
    • , Martin Santamarina
    • , Wolfram Höps
    • , Hufsah Ashraf
    • , Nelson T. Chuang
    • , Xiaofei Yang
    • , Katherine M. Munson
    • , Alexandra P. Lewis
    • , Susan Fairley
    • , Luke J. Tallon
    • , Wayne E. Clarke
    • , Anna O. Basile
    • , Marta Byrska-Bishop
    • , André Corvelo
    • , Uday S. Evani
    • , Tsung-Yu Lu
    • , Mark J.P. Chaisson
    • , Junjie Chen
    • , Chong Li
    • , Harrison Brand
    • , Aaron M. Wenger
    • , Maryam Ghareghani
    • , William T. Harvey
    • , Benjamin Raeder
    • , Patrick Hasenfeld
    • , Allison A. Regier
    • , Haley J. Abel
    • , Ira M. Hall
    • , Paul Flicek
    • , Oliver Stegle
    • , Mark B. Gerstein
    • , Jose M.C. Tubio
    • , Zepeng Mu
    • , Yang I. Li
    • , Xinghua Shi
    • , Alex R. Hastie
    • , Kai Ye
    • , Zechen Chong
    • , Ashley D. Sanders
    • , Michael C. Zody
    • , Michael E. Talkowski
    • , Ryan E. Mills
    • , Scott E. Devine
    • , Charles Lee
    • , Jan O. Korbel
    • , Tobias Marschall
    •  & Evan E. Eichler

    Science (2021)

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing