Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph

Abstract

Despite advances in long-read sequencing technologies, constructing a near telomere-to-telomere assembly is still computationally demanding. Here we present hifiasm (UL), an efficient de novo assembly algorithm combining multiple sequencing technologies to scale up population-wide near telomere-to-telomere assemblies. Applied to 22 human and two plant genomes, our algorithm produces better diploid assemblies at a cost of an order of magnitude lower than existing methods, and it also works with polyploid genomes.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Hybrid assembly with PacBio HiFi and ONT ultra-long reads.
Fig. 2: Statistics of different assemblies.

Similar content being viewed by others

Data availability

For the human reference genome, GRCh38 and CHM13v2; HiFi reads of HPRC year 2 samples except HG002: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/1E2DD570-3B26-418B-B50F-5417F64C5679--HIFI_DEEPCONSENSUS/; ONT ultra-long reads of HPRC year 2 samples except HG002 (R9.4.1 flow cells): https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/90A1F283-2752-438B-917F-53AE76C9C43E--UCSC_HPRC_nanopore_Year2/; Hi-C reads of HPRC year 2 samples except HG002: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/4C696EB9-9AD2-47A2-8011-2F43977CC4E0--Y2-HIC/; parental short reads of HPRC year 2 samples except HG002: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/AD30A684-C7A8-4D24-89B2-040DFF021B0C--Y2_1000G_DATA/; HG002 HiFi reads (Google Cloud Storage): https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/hpp_HG002_NA24385_son_v1/PacBio_HiFi/20kb/m64011_190830_220126.Q20.fastq, https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/hpp_HG002_NA24385_son_v1/PacBio_HiFi/20kb/m64011_190901_095311.Q20.fastq, https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/hpp_HG002_NA24385_son_v1/PacBio_HiFi/15kb/m64012_190920_173625.Q20.fastq, https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/hpp_HG002_NA24385_son_v1/PacBio_HiFi/15kb/m64012_190921_234837.Q20.fastq; HG002 ultra-long reads (R9.4.1 flow cells, pass reads only): https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/nanopore/ultra-long/03_08_22_R941_HG002_1_Guppy_6.1.2_5mc_cg_prom_sup.tar, https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/nanopore/ultra-long/03_08_22_R941_HG002_2_Guppy_6.1.2_5mc_cg_prom_sup.tar, https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/nanopore/ultra-long/03_08_22_R941_HG002_3_Guppy_6.1.2_5mc_cg_prom_sup.tar; HG002 parental short reads: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=working/HPRC_PLUS/HG002/raw_data/Illumina/parents/ and HG002 Hi-C reads: ‘HG002.HiC_1*’ from https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=working/HPRC_PLUS/HG002/raw_data/hic/downsampled/. All reads of HPRC year 1 samples (R9.4.1 flow cells for ONT reads): https://github.com/human-pangenomics/HPP_Year1_Data_Freeze_v1.0. All reads of Arabidopsis (R9.4.1 flow cells for ONT reads): https://ngdc.cncb.ac.cn/search/?dbId=gsa&q=CRA004538. All reads of potato (R9.4.1 flow cells for ONT ultra-long reads): https://ngdc.cncb.ac.cn/gsa/browse/CRA006012. Hifiasm (UL) assemblies of HPRC year 2 samples: ‘*hifiasm_v0.19.5*’ from https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/53FEE631-4264-4627-8FB6-09D7364F4D3B--ASM-COMP/. Verkko assemblies of HPRC year 2 samples: ‘*verkko_1.3.1*’ from https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/53FEE631-4264-4627-8FB6-09D7364F4D3B--ASM-COMP/. All evaluated HPRC year 1 and plant assemblies are available at https://zenodo.org/record/7996422 (ref. 25) and https://zenodo.org/record/7962930 (ref. 26), respectively.

Code availability

Hifiasm (UL) along with its source code is freely available at https://github.com/chhylp123/hifiasm.

References

  1. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

    Article  CAS  PubMed  Google Scholar 

  5. Luo, X., Kang, X. & Schönhuth, A. phasebook: haplotype-aware de novo assembly of diploid genomes from long reads. Genome Biol. 22, 299 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  6. Porubsky, D. et al. Gaps and complex structurally variant loci in phased genome assemblies. Genome Res. 33, 496–510 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. 41, 1474–1482 (2023).

  10. Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D. & Pevzner, P. A. Multiplex de bruijn graphs enable genome assembly from long, high-fidelity reads. Nat. Biotechnol. 40, 1075–1081 (2022).

    Article  CAS  PubMed  Google Scholar 

  11. Rautiainen, M. & Marschall, T. MBG: minimizer-based sparse de Bruijn graph construction. Bioinformatics 37, 2476–2478 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Myers, E. W. The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005).

    Article  CAS  PubMed  Google Scholar 

  13. Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Lorig-Roach, R. et al. Phased nanopore assembly with Shasta and modular graph phasing with GFAse. Genome Res. https://genome.cshlp.org/content/early/2024/04/16/gr.278268.123 (2024).

  15. Wang, B. et al. High-quality Arabidopsis thaliana genome assembly with nanopore and HiFi long reads. Genomics, Proteom. Bioinforma. 20, 4–13 (2022).

    Article  CAS  Google Scholar 

  16. Naish, M. et al. The genetic and epigenetic landscape of the Arabidopsis centromeres. Science 374, eabi7489 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  17. Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).

    Article  PubMed  Google Scholar 

  18. Bao, Z. et al. Genome architecture and tetrasomic inheritance of autotetraploid potato. Mol. Plant 15, 1211–1226 (2022).

    Article  CAS  PubMed  Google Scholar 

  19. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Cheng, H. et al. Haplotype-resolved assembly of diploid genomes without parental data. Nat. Biotechnol. 40, 1332–1335 (2022).

    Article  CAS  PubMed  Google Scholar 

  21. Jain, C. Coverage-preserving sparsification of overlap graphs for long-read assembly. Bioinformatics 39, btad124 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 265 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Martin, M. et al. WhatsHap: fast and accurate read-based phasing. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/085050v2 (2016).

  25. Cheng, H., Asri, M., Lucas, J., Koren, S. & Li, H. HPRC Y1 assemblies (HiFi + UL) evaluated in the hifiasm (UL) paper. Zenodo https://doi.org/10.5281/zenodo.7996421 (2023).

  26. Cheng, H., Asri, M., Lucas, J., Koren, S. & Li, H. Plant assemblies evaluated in the hifiasm (UL) paper. Zenodo https://doi.org/10.5281/zenodo.7962929 (2023).

Download references

Acknowledgements

This study was supported by US National Institutes of Health (grant nos. R01HG010040, U01HG010971 and U41HG010972 to H.L. and K99HG012798 to H.C.). We thank the HPRC for making year 1 and year 2 datasets publicly available.

Author information

Authors and Affiliations

Authors

Contributions

H.C. and H.L. designed the algorithm, implemented hifiasm (UL) and drafted the paper. H.C. benchmarked hifiasm (UL) and other assemblers. M.A., J.L. and S.K. designed the evaluation of human genome assemblies.

Corresponding author

Correspondence to Heng Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Derek Bickhart and Antoine Limasset for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Accurate HiFi string graph combining PacBio HiFi and ONT ultra-long reads.

(a) Effect of contained reads in the string graph. Rectangles in orange and blue represent heterozygous HiFi reads from haplotype 1 and haplotype 2, respectively. Green rectangles are HiFi reads originating from homozygous regions, whereas red rectangles are contained reads. The string graph is constructed using all reads, except for two contained reads. (b) Hifiasm (UL) aligns ultra-long reads to the HiFi string graph with contained reads to alleviate the contained read problem. The alignment paths of ultra-long reads from haplotype 1 and haplotype 2 are represented by orange and blue lines, respectively. Despite being a contained read, h12 is retained as the critical read because it is covered by ultra-long reads u6 and u7. To ensure accurate graph cleaning, hifiasm (UL) also tracks the number of ultra-long reads that support each edge as its weight. For instance, the edge weight between h5 and h8 is 2 because ultra-long reads u4 and u5 cover it.

Supplementary information

Supplementary Information

Supplementary Tables 1–5, Fig. 1 and text.

Reporting Summary

Peer Review File

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, H., Asri, M., Lucas, J. et al. Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02269-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41592-024-02269-8

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research