Introduction

De novo genome assembly is a fundamental task in genomic research1,2,3. Using long noisy reads from sequencing technologies, such as complete long reads (CLR) sequencing of Pacific Biosciences (PacBio) and Nanopore sequencing of Oxford Technologies (ONT), many assemblers4,5,6,7,8,9 now can effectively reconstruct high-quality genome sequences for haploid or inbred species. However, a significant fraction of genetic information of a diploid genome is lost in those assemblies and most of them have lots of haplotype switch errors. Because long noisy reads, such as PacBio CLR reads and Nanopore reads, usually contain a 5–15% sequencing error10, it is difficult to distinguish heterozygotes from sequencing errors in long noisy reads11,12,13,14,15, which prevents the diploid assemblers from generating long haplotype-specific contigs. Recently, by combining long noisy reads with additional highly accurate sequencing data, such as parental short reads16,17,18, PacBio HiFi reads17,18, Hi-C reads19, Strand-seq data20, or gamete cell data21, assemblers now can produce more contiguous haplotype-specific contigs. However, the requirements for additional sequencing data increase costs and limit their applications in practice22,23. Meanwhile, it is easier to identify haplotype differences using high accurate PacBio HiFi reads24 (< 1% error rate), which has been widely used for haplotype resolved assemblies25,26,27,28. However, the average lengths of HiFi reads (10–25 kb) are shorter than those of long noisy reads. For example, the ultra-long reads of Nanopore are up to 1 M in length and with read N50 > 100 kb29. Longer reads usually help to assemble more contiguous contigs30. Therefore, it will be useful to develop an assembler that can take advantage of long noisy reads to generate more contiguous haplotype-specific contigs for diploid genomes.

To achieve high-quality assemblies, error correction is usually a useful step for genome assembling using long noisy reads. One challenge for correct-then-assemble pipelines in assembling a diploid genome using long noisy reads is how to retain heterozygotes during error correction. If the sequencing error rate exceeds haplotype divergence, current correct-then-assemble pipelines4,5,6,9 eliminate heterozygotes as sequencing errors or mixed alleles of different haplotypes in a read. Therefore, corrected reads don’t contain heterozygote information for haplotype phasing. Assemblers, such as FALCON-Unzip4, used raw reads instead of corrected reads in their “unzip” phasing step, then mapped the raw reads to the corrected reads for the later diploid assembly. This has led to assembly errors due to the heterozygote errors in the corrected reads and mapping errors between corrected reads and raw reads as the length of reads and their SNP position changed after correction4. Furthermore, phasing raw reads is more difficult due to their high error rate.

Another challenge is partitioning reads according to their haplotypes in the phasing step. After error correction, there still are sequencing errors left in corrected reads. The error profile of corrected reads is also more complex than that of PacBio HiFi reads. Due to the sequencing errors, a significant fraction of reads is difficult to be phased to the correct haplotypes. This leads to inconsistent overlaps25 among reads, whose reads come from different haplotypes or different copies of the segmental duplication. The inconsistent overlaps can cause haplotype switch errors or unresolved repeats. Accurately phasing the reads, and then identifying and removing inconsistent overlaps is important in overlap-graph-based diploid genome assembly. Even if we assemble the genome using PacBio HiFi reads, there is still need error correction step to improve the phasing accuracy25. The higher error rate in corrected long noisy reads makes phasing and identifying inconsistent overlaps more difficult. It is necessary to design a more accurate and robust method to identify inconsistent overlaps for long noisy read-based assemblers.

Furthermore, to construct two haplotypes of diploid genomes, the assemblers often need higher coverage of sequencing data. However, with the increasing amount of sequencing data, the running time of assembly increases nonlinearly. Therefore, assembling large diploid genomes at an acceptable time is another challenge. Overlap-graph-based assemblers have their unique advantages to assemble diploid genomes since the overlaps can be reused in subsequent steps once they are found. This helps to design an efficient multi-round assembly strategy. The overlap-graph-based assemblers usually use seed (i.e., k-mers, minimizers31) based methods to find candidate overlaps, then perform local alignment to find the true overlaps. The local alignment is the major computational bottleneck of overlap-graph-based approaches. Although skipping local alignment can accelerate the overlap-finding step32, it leads to lots of false-positive overlaps and introduces errors in assembly graphs. String-graph-based33 approaches, a type of overlap-graph-based approach, ignore the reads contained by other reads and mark most edges in the graph as transitive edges33 that don’t contribute to the construction of contigs. Then, it is not necessary to perform local alignment on those read pairs. Therefore, it is possible to design a fast string-graph-based diploid genome assembler by minimizing the number of local alignments needed.

In this work, we present PECAT, a Phased Error Correction and Assembly Tool, designed to reconstruct diploid genomes from long noisy reads, including PacBio CLR reads and Nanopore reads. PECAT follows the correct-then-assemble strategy, including a haplotype-aware error correction module, which can retain heterozygote alleles while correcting sequencing errors, and a two-round string graph-based assembly module. To accelerate the assembling, PECAT only performs local alignment when it is necessary instead of performing local alignments on all candidate overlaps. PECAT outputs two sets of contigs either in primary/alternate format (long contigs with the mosaic of homologous haplotypes and short haplotype-specific contigs) or in the dual assembly format27 (two sets of long contigs with the mosaic of homologous haplotypes). PECAT can efficiently assemble diploid genomes using Nanopore R9, PacBio CLR or more accurate Nanopore R10 reads only and generate more contiguous haplotype-specific contigs compared to other assemblers.

Result

Haplotype-aware error correction

Our error correction method (Fig. 1a and Supplementary Fig. 1) is based on the partial-order alignment (POA) graph34 method. For the template read to be corrected, a POA graph is built from the alignment of supporting reads. Then, the path with the highest weight is found in the POA graph to construct the consensus sequence for the template read. However, if the sequencing error rate exceeds haplotype divergence, current methods inevitably select supporting reads from different haplotypes. This causes either heterozygote alleles in the template reads to be eliminated as sequencing errors or heterozygous alleles from different haplotypes to be mixed in corrected reads.

Fig. 1: Overview of PECAT.
figure 1

a Haplotype-aware error correction method. The reads in different colors (in green or in yellow) are from different haplotypes. The reads with a dark gray background color indicate that they are not selected. The supporting reads that are more likely from the same haplotype are selected to correct the template read. b The first round of assembly. PECAT finds overlaps between reads. The dashed lines indicate there are overlaps between reads. Our string-graph-based assembler is performed to construct haplotype-collapsed contigs. The reads in green or in yellow are from heterozygosity regions. The same color indicates that the reads come from the same haplotype. The reads in grey are from homozygosity regions. c The second round of assembly. PECAT identifies inconsistent overlaps by calling and analyzing SNP alleles in reads. After removing inconsistent overlaps from the overlaps found in the first round of assembly, PECAT performs our string-graph-based assembler again to construct the contigs in the primary/alternate format or the dual assembly format.

To maintain the heterozygote alleles in the template read, we need to select supporting reads from the same haplotype to correct it. After analyzing the POA graph, we have found that the difference between random sequencing errors and heterozygotes can be reflected in the POA graph. In the case of heterozygotes, there are two dominant parallel branches in the POA graph. In the case of random errors, there tends to be only one dominant branch (Supplementary Fig. 1b). Based on this finding, we design a scoring algorithm to estimate the likelihood that the supporting read and the template read are from the same haplotype (Methods). For each position at which there are two dominant parallel branches, if the supporting read and the template read pass through the same dominant branch, the score is increased by 1 and if the supporting read and the template read pass through the different dominant branches, the score is decreased by 1 (Supplementary Fig. 1c). A higher score means that the template read and the supporting read are more likely from the same haplotype. If the supporting reads come from different haplotypes, the histogram of scores of all supporting reads should show two peaks (Supplementary Fig. 1d). Then, we select the high-scoring supporting reads, which are very likely to be in the same haplotype as the template read, to correct the template read. To further increase the likelihood of selecting supporting reads in the same haplotype, we assign different weights to the reads according to their score. The higher the score of a read, the larger its weight is assigned (Methods). Then, we remove unselected reads in the POA graph by assigning their weights to 0. Finally, we use a dynamic programming approach to find the path with the highest weight in the POA graph and concatenated the nodes in the path, and generate the consensus sequence (Methods).

We evaluate the performance of the selection method on seven diploid datasets of S. cerevisiae (SK1×Y12), A. thaliana (Col-0×Cvi-0), D. melanogaster (ISO1×A4), B. taurus (Angus×Brahman), A. thaliana (Col-0×C24), B. taurus (Bison×Simmental) and HG002 (Nanopore R9), all with reads classified by the trio-binning algorithm16 (Methods). As shown in Supplementary Table 1, without the selection method, 36.7%, 31.5%, 41.1%, 28.1%, 33.8%, 37.2% and 27.4% of supporting reads are inconsistent reads, respectively, which are from the haplotype different from the one of template read. Meanwhile, there are only 2.8%, 3.5% 4.9%, 3.6%, 4.3%, 2.9%, and 4.2% inconsistent reads in selected supporting reads, respectively. After further re-weighting the scores, the percentages of inconsistent reads are reduced to 2.1%, 3.1%, 4.0%, 3.1%, 3.5%, 2.3%, and 3.4%, respectively.

Since most of the selected supporting reads are from the same haplotype of the template read, most heterozygote alleles in the template read are correctly retained in our error correction method. We compare our method with other methods including Canu5, MECAT26, and NECAT9 on simulated PacBio CLR data and Nanopore data (Fig. 2a, Supplementary Table 2, and Supplementary Note 1). The overall accuracies of corrected reads are similar, which exceeds 99%. However, the accuracies of SNP alleles in corrected reads are less than 80% for Canu and less than 60% for MECAT and NECAT, which are far worse than those of raw reads and reads corrected by PECAT. When the heterozygosity rate of simulated reads is greater than or equal to 0.0005, the accuracy of SNP alleles in reads corrected by PECAT is greater than 99%, which exceeds that of raw reads (96% or 98%). However, when the heterozygosity rate of simulated reads is equal to 0.0001, the accuracy of SNP alleles in reads corrected by PECAT is reduced to 92% ~ 94%, which is less than that of raw reads (96% ~ 98%). Therefore, for high heterozygosity genome regions (>= 0.0005), our haplotype-aware error correction can preserve the SNPs well. However, for low heterozygosity genome regions (< 0.0005), we need to combine with other methods to improve the accuracy of SNP calling (Methods).

Fig. 2: Performance comparison of error correction.
figure 2

a Accuracy of raw and corrected reads and accuracy of SNP alleles in raw and corrected reads on the simulated datasets with different heterozygosity rates. b Accuracy of raw reads and corrected reads by NECAT and PECAT in difficult-to-map regions and low-complexity regions of HG002 reference genome. c, d Consistency, and completeness of raw reads and corrected reads by Canu, FALCON, MECAT2, NECAT, and PECAT on the seven diploid datasets. The metrics by Canu on B. taurus (PacBio CLR and ONT) and HG002 (ONT) and the metrics by FALCON on B. taurus (PacBio CLR) are excluded because they could not finish correcting in three weeks. Consistency is defined as \(\sum \max ({k}_{p},{k}_{m})/\sum ({k}_{p}+{k}_{m})\), in which \({k}_{p}\) and \({k}_{m}\) are the number of paternal and maternal haplotype-specific k-mers in each read. Completeness is the percentage of parent-specific k-mers (occurrences \(\ge 4\)) in the 40X longest reads. e Consistency of D. melanogaster (ISO1 × A4) raw reads and corrected reads by the different methods. Each point corresponds to a read. Its coordinate gives the proportion of the parental specific k-mers in the read, where k is 18. All 40X longest reads are shown in each sub-figure.

Fast string graph-based assembler

After error correction, PECAT implements two rounds of string graph-based assembly. In each round of assembly, we first construct the overlaps between the corrected reads using the seed-based alignment method (minimap235), which allows us to build the overlaps quickly. However, seed-based alignment can bring low-quality overlaps with low identity or with long overhangs. Those low-quality overlaps could introduce errors during assembling. Performing local alignment on overlaps to identify low-quality ones becomes the major computational cost. To speed up the assembling, PECAT only performs local alignments when it is necessary during the construction of the string graph. (Methods). First, to reduce overhangs of overlaps, we use diff algorithm36 to extend the candidate overlaps to the ends of the reads. Here, we only perform local alignment on overhangs instead of on the whole overlap. We remove the overlaps if their overhangs are still long (\(\min ({{{{{\mathrm{100,0.01}}}}}}\cdot l)\) for PacBio reads and \(\min {({{{{\mathrm{300,0.03}}}}}}\cdot l)\) for Nanopore reads, where \(l\) is length of the read). On the diploid datasets of S. cerevisiae (SK1×Y12), A. thaliana (Col-0×Cvi-0), D. melanogaster (ISO1×A4), B. taurus (Angus×Brahman), A. thaliana (Col-0×C24), B. taurus (Bison × Simmental) and HG002, 5.3%, 10.3%, 82.2%, 12.8%, 20.0%, 59.4% and 56.3% of candidate overlaps with long overhangs have been considered as low quality and removed, respectively (Supplementary Table 3). Then, we further filter out the overlaps whose reads are contained in other reads or with low coverage. Only 0.15%, 0.18%, 0.86%, 0.86%, 0.03%, 1.23%, and 0.15% of overlaps remained on the above seven diploid datasets (Supplementary Table 3). We then construct a directed string graph from the remaining overlaps. We find the transitive edges using Myers’ algorithm33 and mark them as inactive edges, which are not used for constructing contigs. On the above diploid datasets, only 25.1%, 16.7%, 16.3%, 18.5%, 20.7%, 15.4%, and 23.6% of edges are active (Supplementary Table 3).

To remove low-quality edges in the string graph, we calculate the identity of the overlaps (the active edges) in the string graph using local alignments. Since most edges have been marked as inactive, we only need to perform local alignments for a small portion of overlaps. On the above diploid datasets, 2.1%, 12.2%, 9.5%, 2.2%, 13.2%, 1.8%, and 6.9% of active edges have been removed because their identities are less than the threshold (Methods) (Supplementary Table 3). After the above step, some paths in the graphs connected by low-quality edges are broken. Those broken paths need to be connected using the transitive edges that have been labeled as inactive edges. We then select transitive edges with the longest alignment and their identity greater than the threshold to connect the broken paths. On the above diploid datasets, about 0.67%, 0.12%, 0.15%, 0.09%, 0.13%, 0.05%, and 0.16% of transitive edges have been reactivated (Supplementary Table 3). Finally, we attempt to use the contained reads to connect the broken paths, and about 0.30%, 0.39%, 0.17%, 0.11%, 0.77%, 0.13%, and 0.45% of new edges are added to the graph on the above diploid datasets (Supplementary Table 3). In the first round of assembly (Fig. 1b), PECAT finds linear paths from this string graph and constructs haplotype-collapsed contigs. In the second round of assembly (Fig. 1c), PECAT identifies and removes inconsistent overlaps (next Section) in the string graph, and then generates two sets of contigs in primary/alternate format or dual assembly format. After generating contigs, we use corrected reads (CLR data) or raw reads (Nanopore data) to polish them to improve the quality (Methods).

Identification of inconsistent overlaps

The inconsistent overlaps connect reads from different haplotypes or different copies of the segmental duplication in the overlap graph. They cause haplotype switch errors or unresolved repeats (Supplementary Fig. 2). After identifying and filtering out inconsistent overlaps, only the overlaps between reads from the same haplotype or the same copies of the segmental duplication are left, and then the assembler can naturally generate contigs of two haplotypes. Since our corrected reads contain allele information, we may use the SNP allele information to identify inconsistent overlaps. If the SNP alleles on a pair of reads are different, these two reads should come from different haplotypes and the overlaps between them should be inconsistent. At the error correction step, we have scored the possibility that two raw reads come from the same haplotype. However, the accuracy of the previous scoring that is based on the alignments between two raw reads cannot meet the requirements for identifying inconsistent overlaps. Therefore, we developed a read-level SNP caller and a read grouping method for identifying inconsistent overlaps.

First, we map all corrected reads onto the haplotype-collapsed contigs from the first round of assembly using minimap235. We call the heterozygous SNP sites based on the base frequency of the alignments (Methods). Then, we set each read as a template read and collect a set of reads that have the common SNP sites with it. We cluster the set of reads according to the SNP alleles in reads. The reads from the same haplotype are more likely to be clustered into the same subgroup. We then verify and correct the SNP alleles in the template read using other reads in the same subgroups. After identifying SNP alleles in each read, we remove the inconsistent overlaps using the SNP information from the candidate overlaps found in the first round of assembly. Then, we construct the string graph again. For high heterozygosity genome regions (>= 0.0005), this inconsistent overlap identification approach works well due to the high accuracy of SNPs in corrected reads. However, for low heterozygosity genome regions (< 0.0005), the SNP caller using corrected reads only can not achieve high performance in inconsistent overlap identification as the accuracy of SNPs is decreased dramatically in corrected reads for those genome regions. Therefore, we combine raw reads to identify and filter inconsistent overlaps. We first use corrected reads to identify inconsistent overlaps. Then, we call SNPs using raw reads and identify inconsistent overlaps again (Methods). For Nanopore reads, we have used Clair315 to call heterozygous SNP sites. For PacBio CLR reads, we do not have a tool to call SNPs from raw reads now.

We evaluate the performance of the inconsistent overlaps identification method using the haplotype reads classified by the trio-binning algorithm16. As shown in Supplementary Table 4, compared with the first round of assemblies, the percentages of inconsistent overlaps in simplified graphs of the second round of assemblies of diploid datasets S. cerevisiae (SK1×Y12), A. thaliana (Col-0×Cvi-0), D. melanogaster (ISO1×A4), B. taurus (Angus×Brahman), A. thaliana (Col-0×C24), B. taurus (Bison × Simmental) and HG002 decrease sharply. The percentages of inconsistent overlaps have decreased from 36.98%, 16.00%, 15.94%, 36.55%, 16.62%, 18.08%, and 37.45% in the first round of assemblies to 0.32%, 0.57%, 0.90%, 0.63%, 0.79%, 0.03%, and 1.09% in the second round of assemblies, respectively. For the three Nanopore datasets, we compare the performance of the method using corrected reads, raw reads, and both (Supplementary Table 4). Using both raw and corrected reads, we get lower percentages of inconsistent overlaps in simplified graphs of the second round of assemblies of all three Nanopore datasets. The high performance of the inconsistent overlap identification method ensures that PECAT can generate the haplotype-specific contigs effectively. After filtering inconsistent overlaps, collapsed regions in the first round of assembly are separated in the second round of assembly. The sizes of alternate contigs are close to their corresponding reference genome in the second round of assembly (Supplementary Table 5).

Moreover, our inconsistent overlap identification method can also help solve nearly identical repeats without knowing the number of their copies. The clustering step can automatically separate nearly identical repeats if there are SNPs that can divide the repeats. After filtering out the inconsistent overlaps at the repeats, PECAT can solve the repeat to generate contiguous contigs in the second round of assembly. As a result, PECAT achieves more contiguous assemblies in the second round of assembly on most of our diploid datasets (Supplementary Table 5). As shown in Supplementary Figs. 3 and 4, PECAT perfectly solves repeats in the NCTC9024 and NCTC9006 datasets and reconstructs two circle contigs, while other assemblers obtain fragment results7,37.

Performance of PECAT error correction method

We evaluate the performance of the PECAT error correction method using four PacBio diploid datasets: S. cerevisiae (SK1×Y12), A. thaliana (Col-0×Cvi-0), D. melanogaster (ISO1×A4), B. taurus (Angus×Brahman), and three Nanopore diploid datasets: A. thaliana (Col-0×C24), B. taurus (Bison × Simmental), and HG002. We compare PECAT with the error correction methods in the other four tools, including Canu5, FALCON4, MECAT26, and NECAT9 (Fig. 2 b–e and Supplementary Table 6). All methods have reported high accuracy (>98.6%). We also evaluate the performance of PECAT and NECAT on the difficult-to-map regions38 and low-complexity regions of HG002 (Methods). On average, the accuracies of PECAT are 0.75% higher than those of NECAT, and the accuracy of corrected reads on these regions is similar to those on normal regions (Fig. 2b and Supplementary Table 7). We then assess the haplotype-specific k-mers completeness and consistency of corrected reads (Methods). These two metrics can evaluate the ability to retain consistent heterozygote alleles. The reads corrected by PECAT have higher completeness and consistency in all datasets (Fig. 2c, d). Especially, haplotype-specific k-mers consistencies of the reads corrected by PECAT are greater than or equal to 99.4% on all datasets, while haplotype-specific k-mers consistencies of the reads corrected by other methods are all less than or equal to 94.8%. The haplotype-specific k-mers completenesses of reads corrected by PECAT are also higher than those of reads corrected by other methods, especially for three Nanopore datasets. We plot the raw read and corrected reads using haplotype-specific k-mers for the seven datasets. As shown in Fig. 2e and Supplementary Figs. 510, PECAT effectively avoids mixing heterozygous alleles from two haplotypes and its corrected reads tend to contain only one type of haplotype-specific k-mer. In summary, the reads corrected by PECAT contain more haplotype-specific k-mers than those corrected by other methods.

Performance of PECAT assembler

We also assess the performance of the PECAT assembler using four PacBio diploid datasets: S. cerevisiae (SK1×Y12), A. thaliana (Col-0×Cvi-0), D. melanogaster (ISO1×A4), B. taurus (Angus×Brahman), and three Nanopore diploid datasets: A. thaliana (Col-0×C24), B. taurus (Bison × Simmental) and HG002. We compare PECAT with four diploid genome assembly pipelines: Canu5+purge_dups39, FALCON-Unzip4, Flye7+HapDup14,40, and Shasta41 (Supplementary Note 2). We evaluate assembled genomes with respect to the assembly size, contiguity (contig NG50 and phase block NG50), and qualities (BUSCO42, the base quality using pomosix (https://github.com/nanoporetech/pomoxis) and merqury43, ‘Intra-block switch error’ from merqury, and the hamming error rate). (Methods).

For four PacBio CLR datasets, all pipelines output the contigs in primary/alternate format. As shown in Table 1, the sizes of both primary and alternate contigs of all four assemblies are close to those of their corresponding reference genome, except the alternate contigs of S. cerevisiae (SK1×Y12) genome assembled by Canu+Purge_dups. Compared to the other two pipelines, PECAT obtains the highest NG50 with 0.8/0.8, 14.3/7.8, 24.5/11.9, and 72.4/2.8 Mb and the highest ‘phase block NG50’ with 0.8/0.8, 12.1/7.4, 16.1/11.8, 4.5/2.4 Mb for all four assemblies. As shown in Supplementary Figs. 1114, the alternate contigs generated by all pipelines are haplotigs. Unlike the other two pipelines, most primary contigs by PECAT are also haplotigs except those of B. taurus (Angus×Brahman). These results are consistent with the results of the ‘Hamming error’, where the ‘Hamming error’ of primary contigs of B. taurus (Angus×Brahman) is exceptionally high. All pipelines have similar ‘Quality’ and BUSCO scores, except that Canu+Purge_dups has low BUSCO scores on alternate contigs of S. cerevisiae.

Table 1 Performance comparison of assembly on PacBio CLR datasets

For three Nanopore datasets, Canu+Purge_dups and Shasta output the contigs in primary/alternate format, Flye+HapDup outputs the contigs in dual assembly format and PECAT can output both formats. As shown in Table 2, the sizes of both two sets of contigs of all three assemblies are close to their corresponding reference genome, except the A. thaliana (Col-0×C24) genomes assembled by Canu+Purge_dups. Compared to other pipelines, PECAT obtains higher NG50 for three genomes. The phase block NG50 reported by PECAT is at least 10 times higher than those reported by other pipelines for A. thaliana (Col-0×C24) and B.taurus (Bison×Simmental). Especially, PECAT reports the assembly with phase block NG50 of 79.6/86.1 Mb for B. taurus (Bison×Simmental), which exceeds the assembly with phase block N50 of 68.5/70.6 Mb reported by the trio-binning method using additional parental reads44. Meanwhile, for HG002, PECAT reports higher phase block NG50 than that reported by Flye+HapDup, both are at least 25 times higher than that reported by Shasta. PECAT reports smaller Intra-block switch error and Hamming error on A. thaliana (Col-0×C24) and B. taurus (Bison×Simmental). Most of the contigs reported by PECAT on A. thaliana (Col-0×C24) and B. taurus (Bison × Simmental) are haplotigs (Fig. 3a and Supplementary Fig. 15). For HG002, PECAT and Flye+HapDup report similar Intra-block switch errors, and both are much less than those reported by Shasta. Although PECAT and Flye+HapDup reported smaller Hamming errors than that reported by Shasta, their reported Hamming error rates are high, which may be because some low heterozygosity regions of HG002 are not successfully phased. Moreover, most contigs reported by Flye+HapDup and PECAT on HG002 are not haplotigs (Supplementary Figs. 16,17).

Table 2 Performance comparison of assembly on Nanopore and PacBio HiFi datasets
Fig. 3: Performance comparison of assembly.
figure 3

a Haplotype-specific k-mer blob plots of the B. taurus (Bison × Simmental) reference genome and assemblies by Flye+HapDup, Shasta, and PECAT. pri/alt or dual represents that the assembly is the primary/alternate format or the dual assembly format. Each blob corresponds to a contig. The coordinate of the blob gives the count of the parental specific k-mers in the contig, where k is 21. Blob size is proportional to contig length. b Precisions and recalls of small variants (SNP, INDEL) and structural variants (SV) in HG002 assemblies. c Genome fraction of the assemblies, which are evaluated by QUAST.

For the metrics ‘Quality’ and BUSCO score, Shasta reports the lowest scores on all three Nanopore datasets, while Flye+HapDup and PECAT report similar scores on A. thaliana (Col-0×C24), B. taurus (Bison×Simmental) and HG002, except that the BUSCO scores of the alternate contigs/the haplotype 2 contigs reported by PECAT are slightly lower than those reported by Flye+HapDup (91.0%/91.6% vs. 94.1%). The main reason for this difference is that PECAT places the contigs of X and Y chromosomes in the primary contigs or the haplotype 1 contigs while Flye+HapDup outputs two copies of contigs of X and Y chromosomes to two sets of assemblies at the same time. We remove the contigs of X and Y chromosomes in the assemblies of Flye+HapDup (dual) and PECAT (dual). The BUSCO scores of the assembly by Flye+HapDup (dual) are reduced from 94.2%/94.1% to 91.2%/91.1%, while the BUSCO scores of the assembly by PECAT (dual) are reduced from 94.6%/91.6% to 91.6%/91.5%, which are similar to those reported by Flye+HapDup (dual).

We also compare the HG002 genome assembled using Nanopore reads by PECAT with those using HiFi reads by Hifiasm25,27. The long Nanopore reads can help the assemblers to obtain longer phased blocks. PECAT reports much longer NG50 and phase block NG50 than those reported by Hifiasm using HiFi reads only, even longer than those reported by Hifiasm using both HiFi reads and Hi-C reads. The Intra-block switch errors reported by PECAT are similar to those reported by Hifiasm. On the other hand, the high-quality HiFi reads allow Hifiasm to report higher base qualities of assemblies with better ‘Quality’ scores. Moreover, Hifiasm reports higher Hamming errors when using HiFi reads only. However, with the help of Hi-C data, Hifiasm reported a much smaller Hamming error and almost all contigs are haplotigs (Supplementary Figs. 16,17). We then evaluate the small variants (SNP, INDEL) and the structural variants (SV) in HG002 assemblies against the GIAB benchmark (Methods). As shown in Fig. 3b and Supplementary Table 8, the precisions and recalls of SNP and SV of genomes assembled by Flye+HapDup and PECAT using Nanopore reads are similar to those of genomes assembled by Hifiasm using HiFi reads. However, the precisions and recalls of INDEL of genomes assembled by Flye+HapDup and PECAT using Nanopore read is much less than those of genomes assembled by Hifiasm using HiFi reads, which is consistent with the previous finding14.

In the second round of assembly of PECAT, it reassembles the filtered string graph, which will help correct errors or collapsed regions in the first round of assembly. As shown in Fig. 3c and Supplementary Data 1, PECAT reports higher ‘Genome fraction’ values than those reported by other pipelines. We then map the HG002 reference genome and HG002 assemblies to GRCh38. As an example (Fig. 4), the region [130,005,418, 130,116,766] in chromosome 3 includes the gene EVA1CP6 and has a large INDEL with a length of about 43,365 bp between the parental reference genomes. Both PECAT and Hifiasm reconstruct the INDEL in their assemblies, but Flye+HapDup (dual) fails. This large INDEL cannot be preserved in the haplotype-collapsed contigs in the first round of assembly and reads from another haplotype may not be mapped onto the haplotype-collapsed contigs. Therefore, it is difficult for the polish-based assemblies, such as Flye+HapDup, to restore the INDEL in the second round of assembly, which may be the reason that Flye+Hapdup reports lower Genome fraction scores than those reported by PECAT (Fig. 3c and Supplementary Data 1) on three Nanopore datasets.

Fig. 4: Screenshot of HG002 reference, assembly and read alignment to GRCh38.
figure 4

It shows the range of [130,005,418, 130,116,766] in chromosome 3. Small INDELs in all alignments and mismatches in read alignments are not shown. Paternal reference and maternal reference are HG002 paternal references. The assemblies by Flye, Flye+HapDup, and PECAT are from Nanopore reads. The assembly by Hifiasm is from HiFi reads. The assembly by Flye is haplotype-collapsed contigs. The other assemblies are in the dual assembly format. The two sets of contigs are labeled as haplotype 1 and haplotype 2.

We further compare the computational resources required by each pipeline (Supplementary Table 9, 10). Canu on B. taurus (Angus×Brahman), B. taurus (Bison×Simmental), and HG002, FALCON-Unzip on B. taurus (Angus×Brahman) are excluded for comparison since the assemblies are not constructed within three weeks. PECAT is at least 8.7 times faster than traditional correct-then-assemble pipelines, such as Canu+purge_dups and FALCON-Unzip. But it’s slower than assemble-then-correct pipelines, such as Shasta and Flye+HapDup. The peaks of memory usage and disk space usage of PECAT are also recorded in Supplementary Table 10.

Performance on highly accurate long reads

With the development of sequencing technology, the accuracy of long reads has greatly improved recently24,45. We evaluate the performance of PECAT using Nanopore R10 sequencing (ultra-long), Nanopore R10 duplex sequencing and PacBio HiFi sequencing reads. The accuracy of 40X longest reads for those datasets is 98.25%, 99.67%, and 99.77% (Supplementary Fig. 18 and Supplementary Table 11), respectively, which are much higher than the accuracy of the Nanopore R9 (ultra-long) dataset (94.97%). However, there are still less accurate reads in those datasets and PECAT error correction can still improve the qualities of those reads, except for a 0.1–0.2% decrease in the metric “Completeness” (Supplementary Table 11). The Nanopore R10 (ultra-long) dataset with higher accuracy and longer read length improves the assemblies of all pipelines (Table 2). Compared with the assemblies from the Nanopore R9 (ultra-long) dataset (Table 2), the assemblies from the Nanopore R10 (ultra-long) dataset have higher QV when evaluated by pomoxis and merqury. Especially, the assembly of PECAT from the Nanopore R10 (ultra-long) dataset has the highest phase block NG50 with 59.4/58.0 Mb in primary/alternate format and with 63.8/59.2 Mb in dual format, which is twice as much as those of assembly of PECAT from Nanopore R9 (ultra-long) data. Furthermore, the assemblies from the Nanopore R10 (ultra-long) dataset report one magnitude smaller Hamming error and more contigs in this assembly are haplotigs (Supplementary Figs. 19, 20). Similar to the Nanopore R9 (ultra-long) dataset, the assembly of PECAT from the Nanopore R10 (ultra-long) dataset outperforms those of Flye+HapDup and Shasta in terms of NG50 and phase block NG50. Meanwhile, the assemblies for the Nanopore R10 duplex and PacBio HiFi datasets (Supplementary Table 12) have less NG50 and phase block NG50 than those of assembly from the Nanopore R9 (ultra-long) dataset (Table 2), although have higher quality measures. Moreover, less amount of contigs of assemblies of Nanopore R10 duplex and PacBio HiFi reads are haplotigs (Supplementary Figs. 2124). This result indicated that the length of reads is more important for obtaining contiguous haplotype-specific contigs.

We then evaluate the small variants (SNP, INDEL) and the structural variants (SV) in those assemblies using Nanopore R10 sequencing (ultra-long), Nanopore R10 duplex sequencing and PacBio HiFi sequencing reads against the GIAB benchmark (Supplementary Table 1315 and Supplementary Figs. 2527). The assemblers report similar metrics. The precisions and recalls of SNP and SV of the assemblies are similar to those of the assemblies using Nanopore R9 (ultra-long) reads (Fig. 3b and Supplementary Table 8). However, the precisions and recalls of INDEL of genomes assembled from Nanopore R10 (ultra-long) reads and Nanopore R10 duplex reads are much greater than those of assemblies using R9 (ultra-long) reads, while still less than those of assemblies using HiFi reads.

Discussion

Although long noisy reads, especially Nanopore reads, have the advantage to generate high contiguity contigs, using them for diploid genome assembly remains a challenge. Due to the high sequencing error rate, directly phasing the raw reads and then assembling the diploid genome can not obtain high-quality assembly46. Here, we first develop a haplotype-aware error-correct method to keep most of the heterozygote alleles while correcting sequencing errors. However, even after error corrections, it is not able to accurately call SNPs by just aligning corrected reads as the Hifiasm did for HiFi reads. PECAT then first generates a haplotype-collapsed genome and calls SNPs by aligning the corrected reads to the haplotype-collapsed genome. Nevertheless, the accuracy of this SNP call is still low for low heterozygosity rate regions. Therefore, we use Clair3 to call SNPs again for Nanopore reads. Our experiments show that the SNPs called from raw reads complemented the SNPs called by aligning corrected reads. Combing both SNPs helps achieve the best performance.

Another advantage of our haplotype-aware error-corrected reads is to allow reuse the overlaps built in the first round of assembly and just simplify the overlaps by removing inconsistent ones before reassembling the diploid genome. This strategy is like that used in Hifiasm. Compared to methods that directly separate the collapsed contigs into two haplotypes, such as Flye+HapDup, this strategy can correct errors or collapsed regions in the first round of assembly, and then achieve better assembly.

Furthermore, PECAT does not simply phase the reads into two copies, but uses the read grouping method to separate reads from nearly identical repeats into multiple copies, which can help solve the repeats with more than two copies. Meanwhile, in order to obtain high-quality phased assembly, PECAT needs higher coverage of data. As shown in Supplementary Table 16, the contiguity and qualities of the assembly of HG002 using 37X Nanopore reads are less than those of assembly using 59X Nanopore reads.

Although the new generation long reads from Nanopore and PacBio have much higher accuracy, PECAT error correction can still improve their quality. PECAT can also efficiently assemble those highly accurate reads. It can leverage the advantages of read length and accuracy to obtain better assemblies. Therefore, PECAT achieve the phase block NG50 with 59.4/58.0 Mb in primary/alternate format and with 63.8/59.2 Mb in dual format only using Nanopore R10 (ultra-long) reads. However, our current error correction method may not be effective enough to distinguish small errors from heterozygote alleles in very low heterozygosity rate regions, even in highly accurate reads (Nanopore duplex and PacBio HiFi reads). Compared with the read-length advantage, PECAT does not fully take advantage of the high accuracy of reads for being compatible with long noisy reads, therefore it achieves mediocre performance on Nanopore duplex and PacBio HiFi reads (Supplementary Table 12 and Supplementary Figs. 2124). We will resolve this issue in subsequent PECAT releases. Overall, PECAT is an efficient assembly pipeline for diploid genomes.

Methods

Diploid datasets for benchmarking

We evaluate the performance of PECAT using seven diploid datasets (Supplementary Note 1 and Supplementary Table 17): S. cerevisiae (SK1×Y12), A. thaliana (Col-0×Cvi-0), D. melanogaster (ISO1×A4), B. taurus (Angus×Brahman), A. thaliana (Col-0×C24), B. taurus (Bison×Simmental) and HG002. The first four datasets are PacBio CLR datasets and the last three are Nanopore datasets (the last two are ultra-long raw reads and the HG002 dataset is generated by Nanopore R9). The heterozygosity rate of those species is 0.85%, 1.04%, 0.84%, 1.12%, 0.83%, 1.48%, and 0.34%, respectively (Supplementary Table 1 and Supplementary Note 3). The accuracies of reads in those datasets are 87.80%, 88.24%, 89.58%, 86.25%, 92.33%, 89.12%, and 94.97%, respectively. (Supplementary Table 6). In addition, we evaluate the performance of PECA on more accurate datasets, i.e., other three HG002 datasets using Nanopore R10 sequencing (ultra-long), Nanopore R10 duplex sequencing and PacBio HiFi sequencing, separately. Their accuracies are 98.25%, 99.67%, and 99.77% (Supplementary Table 11). For each of the above data, PECAT and NECAT extract the longest 80X raw reads or all reads, if the dataset is less than 80X, for error correction and assembly. Canu corrects the longest 100X raw reads for assembly. Other tools use all raw reads for error correction or assembly (Supplementary Note 2).

Haplotype-aware error correction

The PECAT error correction method is based on the partial-order alignment (POA) graph method4,47. Instead of assigning the same weight to each read for error correction, we select reads more likely coming from the same haplotype or the same copy of the nearly identical repeat and assign different weights to prevent heterozygotes in reads from being eliminated as sequencing errors. First, we find candidate overlaps between raw reads using minimap2 with parameters “-x ava-pb” for PacBio CLR reads and parameters “-x ava-ont” for Nanopore reads. For each template read to be corrected, we collect a group of supporting reads that have candidate overlaps with the template read. Then, we perform pairwise alignment between the template read and each supporting read using diff36 or edlib48 algorithm and build a POA graph based on the alignments of the reads as shown in Supplementary Fig. 1. Each node in the graph is labeled by a triple value \((c,r,{b})\), corresponding a base pair in the alignment. \(c\) is the location at the template read, \(r\) means the number of consecutive insertions (if r = 0, there is a match, a mismatch, or a deletion), and \(b\) is the base on the supporting read or a deletion, which is one of \(\{{\prime} {A}^{{\prime} }{,}^{{\prime} }{C}^{{\prime} }{,}^{{\prime} }{G}^{{\prime} }{,}^{{\prime} }{T}^{{\prime} }{,}^{{\prime} }{-}^{{\prime} }\}\). Each edge in the graph means two base pairs of their nodes appearing continuously in the alignment. The support count of each edge is defined as the number of alignments which pass the edge. For the convenience of analysis, we add a trivial alignment between the template read and itself to the graph. According to our observations, for each location \(c\), all paths in the graph must only pass one of the nodes \(\{\left(c,0,{b}_{i}\right),{b}_{i}\in \left({\prime} {A}^{{\prime} }{,}^{{\prime} }{C}^{{\prime} }{,}^{{\prime} }{G}^{{\prime} }{,}^{{\prime} }{T}^{{\prime} }{,}^{{\prime} }{-}^{{\prime} }\right)\}\). If these nodes have more than one in-edge with large support counts, there may be heterozygotes rather than random sequencing errors (Supplementary Fig. 1 b). Therefore, we compute the support count \({s}_{i}\) for in-edges of the nodes. We mark an in-edge as the important one if \({s}_{i}\ge \left\{\begin{array}{cc}{r}_{l}\cdot S & S\le {S}_{l}\\ \frac{{r}_{h}\cdot {S}_{h}-{r}_{l}\cdot {S}_{l}}{{S}_{h}-{S}_{l}}\cdot \left(S-{S}_{l}\right)+{r}_{l}\cdot {S}_{l} & {S}_{l} < S\le {S}_{h}\\ {r}_{h}\cdot S & {S}_{h} < S\end{array}\right.\), where \(S=\sum {s}_{i}\), and \({r}_{l}\), \({r}_{h}\), \({S}_{l}\), and \({S}_{h}\) are user-set parameters (the default values are 0.5, 0.2, 10, and 200). We mark a location \(c\) as the important one if (1) there is no homopolymer at location \(c\), (2) there are two or more important in-edges and a bubble structure which can be detected along the reverse direction of the in-edges, (3) more than half of the reads through the important in-edge pass the corresponding path in the bubble, (4) there is not an INDEL variant between the sequences represented by the paths in the bubble. Then, we calculate a score for each supporting read based on whether the supporting read and the template read pass the same important in-edges at important locations. We increase the score by 1 if a supporting read passes the same important in-edge as template read and decrease it by 1 if the supporting read and the template read pass the different important in-edges (Supplementary Fig. 1 c). For uniformity, the score is divided by the number of important locations in the supporting read. As shown in Supplementary Fig. 1 d, the histogram of scores shows two or more peaks if the supporting reads come from different haplotypes or different copies of the segmental duplication. We select the reads whose scores fall into the first peak for error correction. If there is only one peak, we select half of the supporting reads with larger scores. We linearly map the scores of selected reads to a range, which is \([0.4,0.8]\) by default, as the final weights of supporting reads. The weight of each edge is the sum of the weights of selected supporting reads that pass the edge.

Finally, we find the path in the POA graph to generate a consensus for the template read. For each node \(v\) labeled by the triple value \((c,r,b)\), if it has \(N\) in-edges \(\{({u}_{i},v){|i}\le N\}\), it gets the score \({S}_{v}=\mathop{\max }\limits_{i\le N}\{{S}_{{u}_{i}}+{W}_{\left({u}_{i},v\right)}-{P}_{c}\}\), where \({W}_{\left({u}_{i},v\right)}\) is the weight of the edge \(\left({u}_{i},v\right)\). In the previous work47, \({P}_{c}\) is the half of the coverage at the location \(c\) in the template read. In our work, \({P}_{c}=\max \left({0.4}^{r},0.3\right)*{W}_{c}\) instead, where \({W}_{c}\) is the sum of weight of the supporting reads which pass the location \(c\). The score of the nodes without any in-edge is assigned to 0. We calculate the scores for all nodes by dynamic programming in topological order and record the related edges which get the maximum scores for the nodes. The node with the highest score is selected and backtracking is done to obtain the path for consensus. The consensus sequence for the template read is generated by concatenating the bases of the nodes in the path.

Read-level SNP caller and read grouping method for identifying inconsistent overlaps

Calling heterozygous SNPs in haplotype-collapsed contigs and SNP alleles in reads

We map corrected reads to the first round of assembly using minimap2 with parameters “-x map-pb -c -p 0.5 -r 1000” for PacBio CLR reads and parameters “-x map-ont -c -p 0.5 -r 1000” for Nanopore reads. It performs base-level alignment and generates CIGAR strings. We scan the CIGAR strings and call heterozygous SNPs for each contig. We call a base site a heterozygous SNP site if it meets the following two conditions. (1) Its coverage is in the range [10, 1000]. (2) The number of second-most common bases is greater or equal to \(\left\{\begin{array}{cc}{r}_{l}\cdot c & c\le {c}_{l}\\ \frac{{r}_{h}\cdot {c}_{h}-{r}_{l}\cdot {c}_{l}}{{c}_{h}-{c}_{l}}\cdot \left(c-{c}_{l}\right)+{r}_{l}\cdot {c}_{l} & {c}_{l} < c\le {c}_{h}\\ {r}_{h}\cdot c & c\ge {c}_{h}\end{array}\right.\), in which \(c\) is the site coverage and \({r}_{l}\), \({r}_{h}\), \({c}_{l}\), and \({c}_{h}\) are user-set parameters (default values are 0.4, 0.2, 10 and 100). After calling heterozygous SNPs, we call SNP alleles in reads. We define a function \({H}_{r}(s)\) for each read \(r\). For any SNP site \(s\), \({H}_{r}(s)\) is defined as 1 or 2 if the read \(r\) covers the site \(s\) and the base of the read \(r\) is equal to the first-most or second-most common base at site \(s\). Otherwise, \({H}_{r}(s)\) is defined as 0.

Verifying and correcting SNP alleles

To verify and correct SNP alleles in a read, we need to find which reads are from the same haplotype within its local region. For each read labeled as the template read \(t\), we assume that it covers a set of SNP sites \(S=\left\{{s}_{1},{s}_{2},\ldots,{s}_{N}\right\}\). We collect the query reads that cover 3 common SNP sites with the template read \(t\). The template read and the query reads are put into a group \(G\). We cluster the reads in the group \(G\) according to the SNP alleles in them. The reads in the same cluster can be considered from the same haplotype. To facilitate the description of the method, here we make some definitions. We define a centroid \(C=\{\left({v}_{s},{n}_{s}\right){|s}\in S\}\) for the group \(G\) at the SNP site set \(S\). \({v}_{s}\) is defined to \({v}_{s}^{+}-{v}_{s}^{-}\), where \({v}_{s}^{+}\) is the number of the read set \(\{{r|}{H}_{r}\left(s\right)=1,{r}\in G,s\in S\}\) and \({v}_{s}^{-}\) is the number of the read set \(\{{r|}{H}_{r}\left(s\right)=2,{r}\in G,s\in S\}\). \({n}_{i}\) is defined to \({v}_{s}^{+}+{v}_{s}^{-}\). One read can be regarded as a group only including it. We define the following three formulas to get verified SNP sites of the centroid \(C\).

$$\begin{array}{c}{V}_{ > }\left(C,\,{p}_{0},\,{p}_{1}\right)=\left\{s | {v}_{s}\, > \,0,\left|{v}_{s}\right|\, \ge \,{\max}(p_{0}\cdot {n}_{s},\,{p}_{1})\right\},\, ({v}_{s},\, {n}_{s})\in C\left\}\right.\\ {V}_{ < }\left(C,\, {p}_{0},\, {p}_{1}\right)=\left\{s | {v}_{s}\, < \, 0,\left|{v}_{s}\right|\, \ge \, {\max}(p_{0}\cdot {n}_{s},\, {p}_{1})\right\},({v}_{s},\, {n}_{s})\in C\left\}\right.\\ {V}_{\ne }\left(C,\, {p}_{0},\, {p}_{1}\right)={V}_{ > }\left(C,\, {p}_{0},\, {p}_{1}\right)\bigcup {V}_{ < }\left(C,\, {p}_{0},\, {p}_{1}\right)\end{array}$$

The parameters \({p}_{0}\) and \({p}_{1}\) are used to control which sites are valid. Next, we define the operations between two centroids.

$$\begin{array}{c}A\left({C}_{1},\, {C}_{2},\, {p}_{0},\, {p}_{1}\right)={V}_{\ne }({C}_{1},\, {p}_{0},\, {p}_{1})\bigcap {V}_{\ne }({C}_{2},\, {p}_{0},\, {p}_{1})\\ S\left({C}_{1},\, {C}_{2},\, {p}_{0},\, {p}_{1}\right)={V}_{ < }({C}_{1},\, {p}_{0},\, {p}_{1})\bigcap {V}_{ < }({C}_{2},\, {p}_{0},\, {p}_{1})\cup {V}_{ > }({C}_{1},\, {p}_{0},\, {p}_{1})\bigcap {V}_{ > }({C}_{2},\,{p}_{0},\,{p}_{1})\\ D\left({C}_{1},\,{C}_{2},\,{p}_{0},\,{p}_{1}\right)={V}_{ < }\left({C}_{1},\,{p}_{0},\,{p}_{1}\right)\bigcap {V}_{ > }\left({C}_{2},\,{p}_{0},\,{p}_{1}\right)\cup {V}_{ > }\left({C}_{1},\,{p}_{0},\,{p}_{1}\right)\bigcap {V}_{ < }\left({C}_{2},\,{p}_{0},\,{p}_{1}\right)\end{array}$$

\(A\left({C}_{1},{C}_{2},{p}_{0},{p}_{1}\right)\) is the set of common verified SNP sites of the centroids \({C}_{1}\) and \({C}_{2}\). \(S\left({C}_{1},{C}_{2},{p}_{0},{p}_{1}\right)\) and \(D\left({C}_{1},{C}_{2},{p}_{0},{p}_{1}\right)\) are used to describe the similarity and the distance of the centroids \({C}_{1}\) and \({C}_{2}\). A read \(r\) can also be used as the parameter, which means the centroid of \(\{r\}\).

For each template read \(t\) and related group \(G\), we use a divide-then-combine strategy to cluster reads (Supplementary Fig. 28). In the dividing step, we use a modified bisecting k-means algorithm49 to divide the group \(G\) into small ones with following steps

1. The centroids\(\,{C}_{1}\) and \({C}_{2}\) of the sets \(\{{r}_{1}\}\) and \(\left\{{r}_{2}\right\}\) are initially selected, where reads \({r}_{1}\) and \({r}_{2}\) are a read pair in the group \(G\) with the largest distance (\({r}_{1},{r}_{2}={{{\arg }}}_{{r}_{1},{r}_{2}\in G}\max {|D}({r}_{1},{r}_{2},0,0)|\)).

2. We divide the group \(G\) into three subgroups, two subgroups corresponding to the centroids and a separate subgroup containing the reads far away from the centroids. For each read \(r\), if it is far away from the centroids, it meets the conditions: \(\frac{\left|A\left(r,{C}_{i},0,0\right)\right|}{\left|{V}_{\ne }(r,0,0)\right|} < {p}_{2}\) \({or}\frac{|D(r,{C}_{i},0,0)|}{|A(r,{C}_{i},0,0)|} > {p}_{3},{i}\in \{{1,2}\}\), where \({p}_{2}\) and \({p}_{3}\) are user-set parameters (default values are 0.3 and 0.5), it is assigned to a separate subgroup. Otherwise, it is assigned the nearest subgroup \(i={{{\arg }}}_{{{{{{\rm{i}}}}}}\in \{1,2\}}\min \frac{\left|D\left(r,{C}_{i},0,0\right)\right|}{\left|A\left(r,{C}_{i},0,0\right)\right|}\). The purpose of the separate subgroup is to prevent the centroids from changing dramatically in each iteration.

3. We calculate the centroids for the first two subgroups and repeat step 2 until the three subgroups don’t change.

After the group is divided into three subgroups, we repeat the above steps to continue dividing the subgroups. If the subgroup \(G\) and its centroid \(C\) meet the following conditions: \(\left|G\right|\le 3\) or \(\frac{{\sum }_{r\in G}\left|D\left(r,C,0,0\right)\right|}{{\sum }_{r\in G}\left|A\left(r,C,0,0\right)\right|}\le {p}_{4}\) and \({\sum }_{r\in G}\left|D\left(r,C,0,0\right)\right|\le {p}_{5}\), it is no longer divided into small subgroups. \({p}_{4}\) and \({p}_{5}\) are user-set parameters (the default values are 0.02 and 4).

After dividing the groups, we get a set of subgroups \({SG}\). we combine a pair of subgroups into a big one if the centroids are close to each other. First, we create an empty list \(L\) and add the subgroup that contains the template read to the list as the first element \({L}_{1}\). Then, we combine other subgroup \(g\) to \({L}_{1}\), if it meets the conditions: \(D\left(g,{L}_{1},{p}_{6},{p}_{7}\right) < \min ({p}_{8},{A}\left(g,{L}_{1},{p}_{6},{p}_{7}\right)\cdot {p}_{9})\) and \(D\left(g,{L}_{1},{p}_{6},{p}_{7}/2\right) < A\left(g,{L}_{1},{p}_{6},{p}_{7}/2\right)\cdot {p}_{9}\), where \({p}_{6}\), \({p}_{7}\), \({p}_{8}\) and \({p}_{9}\) are user-set parameters (the default values are 0.66, 6, 4 and 0.2). The remaining subgroups are put into another list \({L}^{{\prime} }\) and sort the subgroups in ascending order of distance with \({L}_{1}\) (\(\left|D({L}_{1},{L}_{i}^{{\prime} },{p}_{6},{p}_{7})\right|\)). For the subgroup \(i\) in the list \({L}^{{\prime} }\), we combine it to the subgroup \(j\) in list \(L\), if it meets the following conditions: (1) \(j={{\arg }}\max \left|S\left(i,j,{p}_{6},0\right)\right|,{j}\in L\); (2) \(D\left(i,j,{p}_{6},{p}_{7}\right) < \max ({p}_{8},{A}\left(i,j,{p}_{6},{p}_{7}\right)\cdot {p}_{9})\); (3) \(D\left(i,j,{p}_{6},0\right) < A\left(i,j,{p}_{6},0\right)\cdot {p}_{9}\). Otherwise, the subgroup \(i\) is added to the end of the list \(L\).

After combining the subgroups, we think the reads in the same subgroup in the list \(L\) are from the same haplotype or the same copy of the repeat. The SNP alleles in the read can be verified by the centroid of its subgroup, which helps to find inconsistent overlaps more accurately. In addition, since the group \({L}_{1}\) contains the template read \(t\), we use the centroid of the group \({L}_{1}\) to correct SNP alleles in the template read \(t\). Here, the read is corrected when it is treated as a template read. After all reads are corrected, we run the verifying and correcting SNP alleles step again to obtain more robust results. We run the step twice by default.

Finding inconsistent overlaps

In the last round of verifying and correcting SNP alleles, we identify inconsistent overlaps. After combining steps, for each template read \(t\), we obtain a subgroup list \(L\) and the first subgroup \({L}_{1}\) of \(L\) contains the template read \(t\). We consider a query read \(r\) and the template read \(t\) to be inconsistent if the query read \(r\) and its subgroup \({L}_{r}\) (\(r\in {L}_{r}\)) meet the conditions:

$${{{{{\rm{|}}}}}}D\left({L}_{r},{L}_{1},{p}_{6},{p}_{7}\right){{{{{\rm{|}}}}}}\ge \max ({p}_{8},{{{{{\rm{|}}}}}}A\left({L}_{r},{L}_{1},{p}_{6},{p}_{7}\right){{{{{\rm{|}}}}}}\cdot {p}_{9});$$
(1)
$$\left|D\left(r,{L}_{1},0,0\right)\bigcap A\left({L}_{r},{L}_{1},{p}_{6},{p}_{7}\right)\right| \ge \max ({p}_{8},{{{{{\rm{|}}}}}}A\left(r,{L}_{1},0,0\right)\\ \bigcap A\left({L}_{r},{L}_{1},{p}_{6},{p}_{7}\right){{{{{\rm{|}}}}}}\cdot {p}_{9}).$$
(2)

The parameters \({p}_{6}\), \({p}_{7}\), \({p}_{8}\) and \({p}_{9}\) are mentioned above. We record the location and direction of the reads in the haplotype-collapsed contigs. The overlaps between the inconsistent read pairs are identified as inconsistent overlaps if the directions and distances of the reads in the overlaps do not conflict with those in the haplotype-collapsed contigs (The difference between two distances should be less than 1000 by default.). The distance of the reads is defined as the distance of the 3’ ends of the reads. We record the inconsistent overlap information and SNP alleles in reads for subsequent steps.

Combining Nanopore raw reads to identify inconsistent overlaps

For Nanopore datasets, we combine the corrected reads and raw reads to identify inconsistent overlaps. We first identify inconsistent overlaps using corrected reads with the strict threshold that the pair of reads should contain 8 different SNP alleles. Then, we use raw reads to identify inconsistent overlaps with the loose threshold that the pair of reads should contain 6 different SNP alleles. Using raw reads to identify inconsistent overlaps is similar to that using corrected reads. For Nanopore reads, we use Clair315 to call heterozygous SNPs for each contig. Since PECAT doesn’t change the names of the reads during error correction, inconsistent overlaps between raw reads can be regarded as inconsistent overlaps between corrected reads. Similar to using corrected reads, we also check whether the directions and distances of the reads in the overlaps conflict with those in the haplotype-collapsed contigs. We record the location of corrected reads in raw reads during error correction. Therefore, we can calculate the distance of corrected reads in the haplotype-collapsed contigs by linear mapping. The distance threshold is set to \(\max (2500,0.05*{D}_{c})\) by default, where \({D}_{{{{{{\rm{c}}}}}}}\) is distance of corrected reads in the haplotype-collapsed contigs.

Fast string-graph-based assembler

According to the characteristics of k-mer-based alignment and string graphs, we propose a fast string-graph-based assembler to balance the quality and speed of assembling. First, we use minimap2 with parameters “-X -g3000 -w30 -k19 -m100 -r500” to find candidate overlaps between corrected reads. Minimap235 with those parameters invokes the k-mer-based alignment. To reduce overhangs of overlaps, we extend the alignment to the ends of the reads with the diff36 algorithm and filter out overlaps still with long overhangs. Next, we remove the overlaps whose reads are contained in other reads or with low coverage. A directed string graph is constructed using the remaining overlaps. Myers’ algorithm33 is used to mark transitive edges as inactive ones. We implement local alignment using the edlib48 algorithm to calculate the identity of each active edge, which is defined as the identity of its related overlap. Only a few edges need to calculate identities for most of the edges are marked as inactive ones. The edges whose identities are less than the threshold are marked as low quality and removed. The threshold is determined by the formula \((({m}_{1}-6*1.253*{{MAD}}_{1})+2*({m}_{2}-6*1.4826*{{MAD}}_{2}))/3\), where \({m}_{1}\) and \({m}_{2}\) are the mean and the median of identity of all active edges in the string graph and \({{MAD}}_{1}\) and \({{MAD}}_{2}\) are the mean absolute deviation and the median absolute deviation of them. This step also breaks some paths connected by low-quality edges in the graphs. To repair those paths, we check dead-end nodes whose outdegree or indegree are equal to 0. We calculate the identities of their transitive edges and reactivate the edge with the longest alignment and the identity greater than the threshold. Considering that some breaks in paths are caused by reads being contained by other reads from the different haplotype or the different copy of the repeat, we extend the dead-end nodes with contained reads to repair the breaks using the similar method. In this way, an appropriate string graph is constructed. After performing other simplifying processes, such as removing the ambiguous edges (tips, bubbles, and spurious links) in the graph, we identify linear paths from the graph and generate contigs.

Improvement of best overlap graph algorithm

Best overlap graph algorithm50 only retains the best out-edge and the best in-edge of each node according to the overlap length. After removing transitive edges using Myers’ algorithm33, PECAT performs the best overlap graph algorithm to further simplify the string graph. However, the original algorithm is not suitable for diploid assembly. The SNP alleles in reads are more important than the overlap length to measure which edge is the best one. Although most inconsistent overlaps are removed, there remain some undetected inconsistent overlaps and the corresponding reads contain different SNP alleles. We improve the algorithm and use the following steps to determine the best edges. First, for the in-edges or out-edges of each node, we sort the edges in descending order of the edge score \({s}_{i}=({{n}_{i}^{+}\cdot w-n}_{i}^{-},{l}_{i})\). \({n}_{i}^{+}\) and \({n}_{i}^{-}\) are the numbers of two related reads containing the same or different SNP alleles, which are obtained in finding inconsistent overlaps step. \(w\) is a user-defined parameter. Its default value is set to 0.5. \({l}_{i}\) is the overlap length. The first edge \({e}_{0}\) is marked as a candidate best edge. The other edges meeting the following conditions also are marked as the candidate best edges: (1) Its related read is inconsistent with the reads related to the candidate best edges. (2) Its score is not too much less than the first edge \({e}_{0}\), which means \({s}_{0}\left[0\right]-{s}_{i}\left[0\right]\, < \,\left\{\begin{array}{c}\max \left(C,\,{s}_{i}\left[0\right]\cdot {R}_{1}\right),\,{s}_{i}\left[0\right]\, \ge \,0\\ \max \left(C,-{s}_{i}\left[0\right]\cdot {R}_{2}\right),\,{s}_{i}\left[0\right] \, < \, 0\end{array}\right.\), where \(C\), \({R}_{1}\) and \({R}_{2}\) are user-defined parameters (the default values are 4, 2, and 0.66). Next, if the edge is marked as a candidate best edge twice (in-edge and out-edge), it is selected as the best edge. Then, if the node has no best in-edge or best out-edge, its first in-edge or first out-edge is selected as the best one. Finally, the edges selected as the best edges are retained and the other edges are removed from the string graph.

Generating two sets of contigs

After simplifying the string graph, PECAT identifies three structures from the string graph, as shown in Supplementary Fig. 29a. First, we use FALCON-Unzip’s heuristic algorithm4 to identify bubble structures. Generally, the paths in the bubble structure are from different haplotypes. Then, we also identify alternate branches, which meet the following conditions: (1) There are only two branches. (2) The shorter branch is linear and does not contain other branches. (3) More than 30% of reads in the shorter branch are inconsistent with the reads in the other branch. We think the two branches that meet the conditions are from different haplotypes. We identify the paths in the string graph and do not break the paths if they encounter bubble structures or alternate branches. This trick can increase the continuity of primary contigs. The other paths in bubble structures or the alternate branches output as alternate contigs. We also check each pair of arbitrary independent contigs. If more than 30% of reads in the shorter contig are inconsistent with the reads in another contig, the shorter contig is labeled as an alternate contig. Another contig not labeled as an alternate contig is outputted as a primary contig.

To generate a dual assembly, PECAT connects the paths in two adjacent bubble structures. In the step of inconsistent overlap identification, we have called SNP alleles in each read. The information is used to determine the pair of paths in adjacent bubble structures that should be connected. As shown in Supplementary Fig. 29b, we build two read groups \({I}_{1}\) and \({I}_{2}\) for in-edges \(i{n}_{1}\) and \(i{n}_{2}\) respectively. If the reads only overlap with the reads in in-edges \(i{n}_{1}\) or \(i{n}_{2}\), they are assigned to \({I}_{1}\) or \({I}_{2}\), respectively. If the reads overlap with the reads in in-edges \(i{n}_{1}\) and \(i{n}_{2}\) at the same time, the reads are assigned to the group whose centroid the reads are closer to. The concepts of centroid and distance are defined in the previous section on verifying and correcting SNP alleles. In the same way, we get read groups \({O}_{1}\) or \({O}_{2}\) for out-edges \({{out}}_{1}\) or \({{out}}_{2}\). If the centroid of \({I}_{1}\) or \({I}_{2}\) does not have a common heterozygous site with the centroid of \({O}_{1}\) or \({O}_{2}\), we randomly connect the paths of bubble structures. Otherwise, we choose the paths to minimize the distance between their corresponding read group centroids.

Polishing two sets of contigs

After generating contigs in primary/alternate format or dual assembly format, we use corrected reads or raw reads to polish them. We map the reads to the contigs using minimap2. In previous steps, we record which read pairs are inconsistent and which reads construct the contigs. If the read is inconsistent with the reads used for constructing the contig, the related alignments are removed. This trick helps to reduce haplotype switch errors. After removing other low-quality alignments, we run racon51 with default parameters to polish the contigs.

For the assemblies from the Nanopore sequences, we use Medaka (https://github.com/nanoporetech/medaka) to further improve the assembly quality. Its steps are similar to those polishing with racon. The key step is to filter the inconsistent alignments between the reads and the contigs according to the information of inconsistent overlap.

Evaluation

To evaluate the effectiveness of correction methods fairly, we extract the 40X longest reads from corrected reads and then evaluate them. We map the reads to the reference genome using minimap2 with parameters “-c --eqx”. It generates CIGAR strings. We scan the CIGAR strings to calculate the accuracy of each corrected data. To evaluate the accuracy of the sequences in the difficult-to-map regions and low-complexity regions in the HG002 reference genome, we map the HG002 reference genome to GRCh38. The regions are located in the HG002 reference genome according to GIAB v2.0 genome stratification BED files and the alignment between the HG002 reference genome and GRCh38. Then, we calculate the accuracy of the sequences in these regions separately. To evaluate the haplotype-specific k-mers completeness, we calculate the percentage of parent-specific k-mers in 40X longest reads. Considering that there are 40X datasets, we only count k-mers with equal to or more than 4 occurrences. To evaluate the haplotype-specific k-mers consistency, we calculate the metric as \(\sum \max ({k}_{p},{k}_{m})/\sum ({k}_{p}+{k}_{m})\), in which \({k}_{p}\) and \({k}_{m}\) are the number of paternal and maternal haplotype-specific k-mers in each read.

To evaluate the performance of selecting supporting reads and the performance of finding inconsistent overlaps, we use Illumina reads of parents to classify long reads with a trio-binning algorithm16. We adjust the threshold to reduce false positives. The read is classified as the paternal read if it meets the condition: \(\frac{{k}_{p}}{{K}_{p}} > \frac{{k}_{m}+\max (10,{k}_{m}\cdot 0.1)}{{K}_{m}}\), where \({k}_{p}\) and \({k}_{m}\) are the numbers of paternal and maternal haplotype-specific k-mers in the read, and \({K}_{p}\) and \({K}_{m}\) are the numbers of all paternal and maternal haplotype-specific k-mers. The read is classified as a maternal read if it meets the condition: \(\frac{{k}_{m}}{{K}_{m}} > \frac{{k}_{p}+\max (10,{k}_{p}\cdot 0.1)}{{K}_{p}}\). Otherwise, it is classified to the untagged reads. When one read of the pair of reads is a maternal read and the other read is a paternal read or vice versa, we think the pair of reads is inconsistent.

We use the k-mer-based assembly evaluation tool merqury43 to evaluate the diploid assemblies generated by each pipeline and use BUSCO42 to evaluate the gene completeness of assemblies. The hamming error rate of the assemblies is calculated from the output of merqury. The details of the parameters used in this study are described in Supplementary Notes 4, 5. We use the reference-based evaluation tool Pomoxis (https://github.com/nanoporetech/pomoxis) to evaluate the base quality of diploid assemblies (Supplementary Note 6). The genome fraction is evaluated by QUAST52 and its parameters are described in Supplementary Note 7.

We obtain the small variants (SNP and INDEL) sets from HG002 assemblies against GRCh38 using dipcall53 and compare them against the HG002 GIAB benchmark to evaluate their precisions, recalls, and F1 scores using hap.py54. We called structural variant (SV) sets from HG002 assemblies against GRCh37 using hapdiff (https://github.com/KolmogorovLab/hapdiff) and compare them against the curated set of SVs in the HG002 genome55 to evaluate their precisions, recalls, and F1 scores using truvari56 (Supplementary Note 8).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.