Abstract
Routine haplotype-resolved genome assembly from single samples remains an unresolved problem. Here we describe an algorithm that combines PacBio HiFi reads and Hi-C chromatin interaction data to produce a haplotype-resolved assembly without the sequencing of parents. Applied to human and other vertebrate samples, our algorithm consistently outperforms existing single-sample assembly pipelines and generates assemblies of similar quality to the best pedigree-based assemblies.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Chromosome-level haplotype-resolved genome assembly for Takifugu ocellatus using PacBio and Hi-C technologies
Scientific Data Open Access 11 January 2023
-
Semi-automated assembly of high-quality diploid human reference genomes
Nature Open Access 19 October 2022
-
False gene and chromosome losses in genome assemblies caused by GC content variation and repeats
Genome Biology Open Access 27 September 2022
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 per month
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout

Data availability
Human reference genome: GRCh38; CHM13 genome: GCA_009914755.3; HG002 HiFi reads: SRR10382244, SRR10382245, SRR10382248 and SRR10382249; HG002 Hi-C reads: ‘HG002.HiC_1*.fastq.gz’ from https://github.com/human-pangenomics/HG002_Data_Freeze_v1.0; HG002 parental short reads: from the same HG002 data freeze; HG00733 HiFi reads: ERX3831682; HG00733 Hi-C reads: SRR11347815; HG00733 parental short reads: ERR3241754 for HG00731 (father) and ERR3241755 for HG00732 (mother); European badger: PRJEB46293; sterlet: PRJEB19273; South Island takahe: https://vgp.github.io/genomeark/Porphyrio_hochstetteri/; and black rhinoceros: https://vgp.github.io/genomeark/Diceros_bicornis/. All evaluated assemblies are available at https://zenodo.org/record/5948487 and https://zenodo.org/record/5953248.
Code availability
Hifiasm is available at https://github.com/chhylp123/hifiasm.
Reference
Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597–614 (2020).
Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).
Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Luo, X., Kang, X. & Schönhuth, A. phasebook: haplotype-aware de novo assembly of diploid genomes from long reads. Genome Biol. 22, 299 (2021).
Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. 36, 1174–1182 (2018).
Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol. 39, 309–312 (2021).
Porubsky, D. et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat. Biotechnol. 39, 302–308 (2021).
Kronenberg, Z. N. et al. Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C. Nat. Commun. 12, 1–10 (2021).
Edge, P., Bafna, V. & Bansal, V. Hapcut2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).
Tourdot, R. W., Brunette, G. J., Pinto, R. A. & Zhang, C.-Z. Determination of complete chromosomal haplotypes by bulk dna sequencing. Genome Biol. 22, 139 (2021).
Chin, C.-S. & Khalak, A. Human genome assembly in 100 minutes. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/705616v1 (2019).
Makeyev, A. V. et al. GTF2IRD2 is located in the Williams–Beuren syndrome critical region 7q11. 23 and encodes a protein with two TFII-I-like helix–loop–helix repeats. Proc. Natl Acad. Sci. USA 101, 11052–11057 (2004).
Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. https://doi.org/10.1038/s41587-021-01158-1 (2022).
Darwin Tree of Life Project Consortium. Sequence locally, think globally: the Darwin Tree of Life Project. Proc. Natl Acad. Sci. USA 119, e2115642118 (2022).
Du, K. et al. The sterlet sturgeon genome sequence and the mechanisms of segmental rediploidization. Nat. Ecol. Evol. 4, 841–852 (2020).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896–2898 (2020).
Acknowledgements
This study was supported by the US National Institutes of Health (grants R01HG010040, U01HG010961, U01HG010971 and U41HG010972 to H.L.) and Howard Hughes Medical Institute funds to E.D.J. We thank members of the Vertebrate Genome Lab at The Rockefeller University and the Sanger genome team at the Sanger Institute for help with producing data for the non-human vertebrate species. Presentation and analyses of the completed reference genome assemblies will be reported on separately. We also thank the Human Pangenome Reference Consortium for making the HiFi and Hi-C data of HG002 and HG00733 publicly available. K.-P.K. thanks the International Rhino Foundation for providing funding to generate the black rhinoceros assembly (grant no. R-2018-1). The South Island takahe genome was funded by Revive and Restore and the University of Otago. The South Island takahe reference genome was created in direct collaboration with the Takahē Recovery Team (Department of Conservation, New Zealand) and Ngāi Tahu, the Māori kaitiaki (‘guardians’) of this taonga (‘treasured’) species. Sequencing of the takahe genome was funded by Revive and Restore and the University of Otago. L.U. was supported by a Feodor Lynen Fellowship of the Alexander von Humboldt Foundation, the Revive and Restore Catalyst Science Fund and the University of Otago.
Author information
Authors and Affiliations
Contributions
H.C. and H.L. designed the algorithm, implemented hifiasm and drafted the manuscript. H.C. benchmarked hifiasm and other assemblers. E.D.J. and O.F. coordinated generation of the non-human vertebrate species data as part of the vertebrate genomes project. K.-P.K. sponsored the black rhinoceros genome. L.U. obtained the South Island takahe samples, all necessary permits and funding for the South Island takahe reference genome. L.U. and N.G. sponsored the South Island takahe genome.
Corresponding author
Ethics declarations
Competing interests
H.L. is a consultant of Integrated DNA Technologies and is on the Scientific Advisory Boards of Sentieon and Innozeen. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Biotechnology thanks Rayan Chikhi, David Rank, Riccardo Vicedomini and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Chromosome-level phasing results for hifiasm (Hi-C) human assemblies.
All contigs were aligned to the T2T CHM13 reference and the Y chromosome of GRCh38, and then the corresponding regions of contigs on the reference were determined based on the alignment results. For each chromosome, the top track and the bottom track indicate haplotype 1 contigs and haplotype 2 contigs, respectively. The phase density of contigs was calculated by the parental short reads. Gray bars indicate centromeric regions. (a) Chromosome-level phasing results for HG002 with 30X HiFi and 30X Hi-C. (b) Chromosome-level phasing results for HG00733 with 30X HiFi and 30X Hi-C.
Supplementary information
Supplementary Information
Supplementary Section 1, Tables 1–3 and Fig. 1.
Rights and permissions
About this article
Cite this article
Cheng, H., Jarvis, E.D., Fedrigo, O. et al. Haplotype-resolved assembly of diploid genomes without parental data. Nat Biotechnol 40, 1332–1335 (2022). https://doi.org/10.1038/s41587-022-01261-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41587-022-01261-x
This article is cited by
-
Telomere-to-telomere assembly of diploid chromosomes with Verkko
Nature Biotechnology (2023)
-
Chromosome-level haplotype-resolved genome assembly for Takifugu ocellatus using PacBio and Hi-C technologies
Scientific Data (2023)
-
False gene and chromosome losses in genome assemblies caused by GC content variation and repeats
Genome Biology (2022)
-
Conservation genomics in practice
Nature Methods (2022)
-
Structural variant-based pangenome construction has low sensitivity to variability of haplotype-resolved bovine assemblies
Nature Communications (2022)