Main

Humans contain two homologous copies of every chromosome, and deriving the genome sequence of each copy is essential to correctly understand allele-specific DNA methylation and gene expression, and to analyze evolution, forensics and genetic diseases1. However, traditional de novo assembly algorithms that reconstruct genome sequences often represent the sample as a haploid genome. For a diploid genome such as the human genome, this collapsed representation results in the loss of half of heterozygous variations in the genome, may introduce assembly errors in regions diverged between haplotypes and may lead to inflated assembly for species with high heterozygosity2. Several algorithms have been proposed to generate haplotype-resolved assemblies, also known as phased assemblies. Early efforts such as FALCON-Unzip3, Supernova4 and our previous work5 used relatively short-range sequence data for phasing and can resolve haplotypes only up to several megabases for human samples. These methods are unable to phase through centromeres or long repeats. FALCON-Phase6, which extends FALCON-Unzip, uses Hi-C to connect phased sequence blocks and can generate longer haplotypes, but it cannot achieve chromosome-long phasing. Trio binning7,8 is the only published method that can do this, plus the assembly and phasing of entire chromosomes. It uses sequence reads from both parents to partition the offspring’s long reads and then assemble each partition separately. However, trio binning is unable to resolve regions heterozygous in all three samples in the trio and will leave such regions unphased. More importantly, parental samples are not always available—for example, for samples caught in the wild or when parents are deceased. For Mendelian diseases, de novo mutations in the offspring will not be captured and phased with the parents if there are no other heterozygotes nearby. This limits the application of trio binning. Therefore, we currently lack methods that can accurately produce phased assembly for a single individual and keep pace with sequence technology innovations.

To overcome the limitations in existing methods, we combined recent advances in long-read assembly and Hi-C-based phasing to develop DipAsm, which accurately reconstructs the two haplotypes in a diploid individual using only PacBio’s long high-fidelity (HiFi) reads9 and Hi-C data10, both at ~30-fold coverage, without any pedigree information (Fig. 1). Starting with an unphased Peregrine11 assembly scaffolded by 3D-DNA12 or HiRise13, our pipeline calls small variants with DeepVariant14, phases them with WhatsHap15 and HapCUT2 (ref. 16), partitions the reads and assembles each partition independently with Peregrine again (Methods). Grouping contigs into chromosome-long scaffolds is necessary for phasing of entire chromosomes by WhatsHap and HapCUT2.

Fig. 1: Outline of the phased assembly algorithm, DipAsm.
figure 1

Assemble HiFi reads into unphased contigs using Peregrine (1); group and order contigs into scaffolds with Hi-C data using HiRise/3D-DNA (3D de novo assembly) (2); map HiFi reads to scaffolds and call heterozygous SNPs using DeepVariant (3); phase heterozygous SNP calls with both HiFi and Hi-C data using WhatsHap plus HapCUT2 (4); partition reads based on their phase using WhatsHap (5); assemble partitioned reads into phased contigs using Peregrine (6).

We demonstrate our method on four human genomes: PGP1 from the Personal Genome Project, HG002 and NA12878 from the Genome in a Bottle dataset17,18 (GIAB) and HG00733 from the Human Genome Structural Variation Consortium (HGSVC)19. We produced HiFi data for the PGP1 genome and Hi-C data for HG002 and HG00733, and assembled the samples with DipAsm (Table 1). For HG002, we also generated a trio-binning-based assembly with Peregrine using parental Illumina reads (Trio Peregrine in Table 1) and obtained a published Trio Canu assembly9 for comparison (Table 1). All HG002 assemblies took the same HiFi data as input. For HG00733, we downloaded a FALCON-Phase assembly6 and a recent assembly assembled from HiFi and Strand-seq20. The Strand-seq assembly and our assembly use the same HiFi data, while the FALCON-Phase assembly uses noisy continuous long read (CLR) and a different Hi-C dataset.

Table 1 Assembly statistics

From sample HG002, we generated a phased de novo assembly of 5.95 gigabases (Gb) in total, including both parental haplotypes. Half of the assembly is contained in contigs of length ~25 Mb (that is, N50), achieving better contiguity than trio-binning-based assemblies. The scaffold N50 for each parent is >130 Mb. In comparison to GIAB’s single-nucleotide polymorphisms (SNPs) phased by trio, our phasing disagrees at only 0.49% of heterozygous SNPs. This low Hamming error rate over the whole genome suggests we have phased almost every chromosome into maternal and paternal haplotypes, and that the switch errors occurring result in only small local errors in phasing of a small fraction of variants.

To evaluate the consensus accuracy of our assembly, we ran the dipcall pipeline21 to align the phased contigs of HG002 against the human reference genome, called SNPs and short insertions and deletions (INDELs) from the alignment and then compared the assembly-based variant calls to GIAB truth calls. Out of the 2.36-Gb confident regions in GIAB, our de novo assembly yields 5,753 false SNP alleles (0.19% of called SNPs) and 65,302 false INDEL alleles (11.86% of called INDELs); 77% of INDEL errors are 1-base-pair (bp) deletions, consistent with a previous observation that 1-bp deletion is the major error mode for this dataset9. On the assumption that false-positive calls are all consensus errors and not structural assembly errors or contig alignment errors, this gives a per-base error rate of 1.5 × 10−5 (which equals (5,753 + 65,392)/(2 × 2.36 × 109)), or Q48 in the Phred scale. Notably, our de novo assembly achieves a consensus accuracy comparable to that of the Arrow-polished Trio Canu assembly. This suggests that signal-based Arrow polishing may not be necessary for HiFi data.

Comparison to GIAB truth data also reveals the phasing power. During assembly, failure to partition reads in heterozygous regions leads to the loss of heterozygotes and thus the elevated false-negative rate in Table 1. On this metric, our Hi-C-based assemblies miss only 0.4% of heterozygous SNPs, around eight times better than trio-binning-based assemblies. Trio binning is less powerful potentially because it is unable to phase a heterozygote when all individuals in a trio are heterozygous at the same site. In addition, trio binning breaks short reads into k-mers, which also reduces power in comparison to mapping of full-length, paired-end Hi-C reads in our pipeline.

The dipcall pipeline outputs phased long INDELs along with small variants. Evaluated against the GIAB SV truth set22 (v.0.6) with Truvari v.1.3.2, our de novo assembly-based callset shows a sensitivity of 93.4% and precision of 92.6% (Table 1). The sensitivity of trio-binning-based callsets is ~3% lower, consistent with their lower sensitivity on small variants. Nearly all of the putative false-positive calls are low-complexity sequences. We manually inspected some of these false-positive calls from the de novo assembly. In many cases, our long INDEL calls are apparent in both HiFi read alignment and contig alignment but they are often split into multiple INDEL calls that sum to the same length as the GIAB call. Current SV benchmarking tools are unable to match SVs between VCF files when SVs are represented as multiple events in the variant call format (VCF)22. Therefore, our precision is probably substantially higher than 92.6% within GIAB SV benchmark regions.

We additionally ran RepeatMasker23 on SV insertion sequences (9.1 Mb total length) and discovered that 831, 540 and 2,303 of these are within LINEs (long interspersed nuclear element), LTRs (long terminal repeats) and SINEs (short interspersed nuclear elements), respectively. There are 123 microsatellites, 3,582 simple repeats and 270 low-complexity sequences. We also found 21 inversions relative to the reference genome in these HG002 haplotigs (maximum length 25 kb, average length 5 kb). A subset of SVs called from our haplotype assemblies are analyzed in Fig. 2b.

Fig. 2: Applications of phased assemblies.
figure 2

a, Local sequence divergence in comparison to the reference HLA haplotypes (top) and to the KIR haplotypes (bottom) regions in GRCh38. b, SV density (per 100 kb) on chromosome 1 for HG002 (inner), NA12878 (middle) and PGP1 (outer).

Our HG00733 assembly has similar contiguity to the Strand-seq assembly. Evaluated against the phased SNP calls generated by the HGSVC project19, our assembly has slightly lower phasing error rate and phases more heterozygous SNPs. It is worth noting that the HGSVC calls are not curated. Some of the false negatives in the table may be false positives by HGSVC. We also cannot estimate false-positive rates because HGSVC does not provide confident regions. Both the Strand-seq assembly and our assembly can phase entire chromosomes but the FALCON-Phase assembly cannot, as indicated by the 35.8% Hamming error rate. The FALCON-Phase assembly swaps large blocks of haplotypes between the two phases.

We assembled two further human genomes, NA12878 and PGP1, with DipAsm. We could achieve chromosome-long phasing, albeit with a shorter read length of NA12878 and lower read coverage of PGP1. Compared again to GIAB, the NA12878 assembly has even better consensus accuracy, measured at Q55 in GIAB confident regions. Notably, the raw HiFi base quality of NA12878 and HG002 is similar. To understand why NA12878 has better consensus, we counted distinct 31-mers in both assemblies and HiFi reads. We found for NA12878 that 3.63% of 31-mers occurring at least three times in reads are absent from the assembly but, for HG002, the percentage rises to 6.35%. Given that the completeness of NA12878 and HG002 is similar, the higher percentage suggests that there are more recurrent sequencing errors in HG002, which could explain the lower consensus accuracy of HG002.

The HLA and KIR regions are among the most polymorphic in the human genome. Our phased assemblies can reconstruct most of these regions with two contigs for each haplotype. Based on the pattern of local sequence divergence (Fig. 2a), we can see that the two haplotypes in each individual are distinct from one another. Such regions can be faithfully assembled only when we phase through the entire regions.

We present a method to generate a phased assembly for a single human individual or, potentially, a diploid sample of other species. It accurately produces chromosome-long phasing using only two types of input data: HiFi and Hi-C. In comparison to other published single-sample phased assembly algorithms, our method is capable of chromosome-long phasing. While Strand-seq, in combination with HiFi, has recently been used to phase entire chromosomes as well20, Hi-C is easier to produce and more widely used. In comparison to trio binning, our method is not restricted to samples having pedigree data and can phase de novo mutations. It gives more contiguous assembly and phases a larger fraction of the genome for human samples. Meanwhile, our assembly strategy is not without limitations. First, relying on accurate SNP calls from long reads and using Peregrine for assembly, our pipeline does not work with noisy long reads at present. It is possible to switch to a noisy read assembler and to add Illumina data for SNP calling, but assembly accuracy may be reduced due to the elevated sequencing error rate. Second, starting with an unphased assembly, we may miss highly heterozygous regions involving long SVs, as demonstrated in our previous works on small genomes5,8. A potential solution is to retain heterozygous events in the initial assembly graph and to scaffold and dissect these events later to generate a phased assembly. Nevertheless, our improved de novo method sets a milestone. Its ability to generate phased assemblies without using a reference sequence will enable the unbiased characterization of human genome diversity and construction of a comprehensive human pangenome, which are currently goals of the Human Genome Reference Project. The ability to accurately resolve highly polymorphic regions of biological importance, such as HLA and KIR, will further the goals of precision medicine.

Methods

PacBio circular consensus sequencing for PGP1

Library preparation: genomic DNA was converted into a SMRTbell library as previously described9, but with several modifications to generate slightly larger inserts. Specifically, gDNA was sheared using MegaruptorR from Diagenode with the 30-kb shearing protocol using a long hydropore cartridge. Before library preparation, the size distribution of sheared DNA was characterized on the Agilent Femto Pulse System. A sequencing library was constructed from this sheared gDNA using the SMRTbell Template Prep Kit v.1.0 (Pacific Biosciences, no. 100-259-100). To tighten the size distribution of the SMRTbell library, it was size fractionated using the SageELF System from Sage Science. Approximately 4 µg of the SMRTbell library was prepared with loading solution/Marker40; next, the sample was loaded onto a 0.75% agarose 10–40-kb gel cassette and size fractionated using a run target size of 7,000 bp set for elution well 12. A total of 8 µg was fractionated on two cassettes. Fractions having the desired size distribution range were identified on the Agilent Femto Pulse System. Fractions centered at 11 kb were pooled to generate an 11-kb library, and those centered at 16 kb were pooled to create a 16-kb library. Both libraries were used for sequencing.

Sequencing: sequencing reactions were performed on the PacBio Sequel System with Sequel Sequencing Kit 3.0 chemistry. The samples were pre-extended without exposure to illumination for 12 h to enable transition of the polymerase enzymes into the highly processive strand-displacing state, and sequencing data were collected for 24 h to ensure maximal yield of high-quality HiFi reads. In addition, sequencing reactions were also performed on the PacBio Sequel II System using Sequel II Sequencing Kit 1.0 chemistry. On the Sequel II system, data collection was extended to 30 h to ensure suitable amounts of data.

Hi-C sequencing for HG002 and HG00733

A Hi-C library was generated on HG002 and HG00733 by Arima Genomics using a modified version of the Arima-HiC kit. Briefly, the current Arima-HiC kit (no. A510008) utilizes two restriction enzymes for simultaneous chromatin digestion. In the modified protocol, four restriction enzymes were deployed to enable more uniform per-base coverage of the genome while maintaining the highest long-range contiguity signal, thereby benefiting analyses such as variant discovery, base polishing, scaffolding and phasing. After modified chromatin digestion, digested ends were labeled, proximally ligated and then proximally ligated DNA was purified. Following the modified Arima-HiC protocol, Illumina-compatible sequencing libraries were prepared by first shearing purified Arima-HiC ligation products and then size selecting DNA fragments using SPRI beads. The size-selected fragments containing ligation junctions were enriched using enrichment beads provided with the Arima-HiC kit, and converted into Illumina-compatible sequencing libraries using the Swift Accel-NGS 2S Plus kit (no. 21024) reagents. After adapter ligation, DNA was PCR amplified and purified using SPRI beads. Purified DNA underwent standard quality control (quantitative PCR and Bioanalyzer) and was sequenced on HiSeq X following the manufacturer’s protocols.

Phased sequence assembly

We ran Peregrine v.0.1.5.2 with the following command line: ‘peregrine asm reads.lst 24 24 24 24 24 24 24 24 24 --with-consensus --shimmer-r 3 --best_n_ovlp 8--output asm’, where file ‘reads.lst’ gives the list of input read files and directory ‘asm’ holds the output assembly. We mapped Hi-C reads to contigs with BWA-MEM v.0.7.17 and scaffolded the Peregrine contigs with juicer v.1.5 and 3D-DNA v.180922. We preprocessed data with ‘juicer.sh -d juicer -p chrom.sizes -y cut-sites.txt -z contigs.fa -D’, where file ‘cut-sites.txt’ was generated using the generate_site_positions_Arima.py script, which outputs merged_nodups.txt. The scaffolds were produced with ‘run-asm-pipeline.sh -m haploid contigs.fa merged_nodups.txt’. We then called small variants using DeepVariant v.0.8.0 with the pretrained PacBio model. We mapped Hi-C reads to the scaffolds and ran HapCUT2 v.1.1 over heterozygous SNP sites to obtain sparse phasing at the chromosome scale. The resulting haplotypes were then combined with PacBio HiFi data using WhatsHap v.0.18, with default parameters, to generate fine-scale, chromosome-long phasing. We partitioned HiFi reads based on the phases of SNPs residing on these reads, and ran Peregrine again for reads on the same haplotype from the same scaffold. This provided the final phased assembly.

Evaluation of variant calling accuracy

For GIAB samples HG002 and NA12878, we compared small variant calls to GIAB v.3.3.2 with RTG’s vcfeval v.3.8.4. We extracted allelic errors with the ‘hapdip.js rtgeval’ script from the syndip pipeline21. For sample HG002, we used Truvari v.1.3.2 to evaluate long INDEL accuracy against GIAB SV v.0.6. We specified the option ‘--passonly --multimatch’ to skip filtered calls in the GIAB VCF and to allow matching of base calls to multiple comparison calls, and vice versa. Increasing evaluation distance from the default 500 to 1,000 with ‘-r 1,000’ only marginally improved precision, from 92.6 to 93.3%.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary in this article.