Introduction

Reconstructing the genome sequence of a species or individual in a population is one of the most important tasks in genomics1,2,3. Single-molecule sequencing (SMS) technologies, developed by Pacific Bioscience and Oxford Nanopore, yield long reads that can significantly increase the number of solvable repetitive genome regions and improve the contiguity of assembly4,5,6,7. However, SMS reads usually have high error rates8. The two strategies currently used for de novo genome assembly from SMS reads are “correction then assembly” and “assembly then correction.” Assemblers, such as Falcon9, Canu10, and MECAT11, first correct SMS reads and then assemble the genome using corrected reads. Conversely, assemblers, such as miniasm12, Flye13, wtdbg214, Shasta15, Smartdenovo16, and Raven17, assemble the genome using error-prone reads and then correct the assembled genome. Due to the high computational cost of error correction, the “correction then assembly” approach is usually slower than the “assembly then correction” approach. However, directly assembling the genome using error-prone SMS reads can increase assembly errors in the genome sequence, which affects the quality of reference genome and results in bias in downstream analysis, especially in complicated genome regions10,18. On the other hand, the “correction then assembly” approach can provide highly continuous and accurate genome assemblies9,10,11.

The recently released R9 flow cell from Oxford Nanopore technology can generate reads that are up to 1 M in length and with read N50 > 100 kb, which may significantly improve the contiguity of assembly5,6,7,19. However, errors in nanopore reads are more complex than those in PacBio reads20,21 (see “Results”). Error correction tools in current assemblers were originally designed for PacBio SMRT (Single Molecule, Real-time) reads and cannot efficiently correct nanopore reads. For example, correcting 30X coverage human nanopore reads using the error correction tool in Canu requires 29 K central process unit (CPU) hours19. Moreover, the average identity of reads corrected by Canu is only 92%, which is far less accurate than that of corrected PacBio SMRT reads. These high error rates in corrected nanopore reads can introduce misassemblies. Furthermore, HERS in nanopore reads are usually trimmed during error correction, which reduces both the length of the original reads and contiguity of final assembly.

In this work, we develop NECAT, an error correction and de novo assembly tool designed to overcome the problem of complex errors in nanopore reads. Unlike existing error correction tools that iteratively correct nanopore reads, we develop a two-step progressive method for nanopore read correction. In the first step, NECAT corrects low-error-rate subsequences (LERS), while in the second step, it corrects high-error-rate subsequences (HERS), of the read. This progressive approach allows NECAT to quickly correct nanopore reads, resulting in high accuracy of corrected reads. Evaluated on seven data sets, NECAT is able to obtain low average error rates (1.99–9.01%) and high percentages (45.85–99.34%) of corrected reads having error rate < 5% error, which are both significantly better than the results of Canu. To fully take advantage of nanopore read length, we present a two-stage assembler in NECAT, which constructs contigs using corrected nanopore reads, and then bridges the contigs using original raw reads. We also use an adaptive selection mechanism to choose high-quality supporting reads for each read to be corrected (template read) during error correction, and to select high-quality overlaps for each read during the read-overlap step. Comparing with “correct-then-assemble” assemblers, such as Canu and Canu + Smartdenovo (Canu+S), NECAT is 2.5–258 times faster, while maintaining a similar quality of assemblies. Comparing with “assembly then correction” assemblers, such as Flye, NECAT reports significantly smaller numbers of misassemblies for the assemblies of complex genomes.

Results

Analysis of sequencing errors in nanopore reads

We analyzed sequencing errors in nanopore reads of E. coli, S. cerevisiae, A. thaliana, D. melanogaster, C. reinhardtii, O. sativa, S. pennellii, and H. sapiens (NA12878) (Supplementary Notes 15 and Supplementary Tables 12). As shown in Supplementary Table 3, the average error rates of nanopore reads for these eight species ranged from 12.0% (for S. cerevisiae) to 20.1% (for A. thaliana). Although the average error rates of nanopore reads are similar to those of PacBio SMRT reads, error rates in nanopore reads are more broadly distributed than those of PacBio SMRT reads. The error rates of raw reads in the eight data sets used in our study were broadly distributed between 7 and 50% and centralized between 10 and 20% (Fig. 1A and Supplementary Table 3).

Fig. 1: Error characteristics of eight nanopore raw read data sets.
figure 1

A Error rate distribution of raw reads. B Error rates of subsequences in a nanopore read (upper) and illustration of a high-error subsequence in the read (bottom). C The plot of percentage of raw reads with high-error-rate subsequences (HERS), error rate more than 50% in 500 bp windows against read length.

Next, we analyzed sequencing errors in each nanopore read. We first aligned each read onto the reference genomes. We partitioned each read into 500-bp long subsequences and counted the alignment error rate of each subsequence. Our results show that the error rates in each read are also broadly distributed (Fig. 1B). Furthermore, on average, 3–23% of raw reads longer than 10 kb have HERS with error rates > 50% (Supplementary Table 3). Overall, nanopore reads produced by ultra-long library preparation techniques have a higher percentage of reads with HERS than those produced by normal library preparation techniques (23 vs. 3–11%). Additionally, the percentage of raw reads with HERS increased as read length increased (Fig. 1C). Especially, for reads produced by ultra-long reads library preparation techniques, up to 45% of raw reads longer than 45 kb have HERS (Fig. 1C). The HERS in nanopore reads usually force the error correction tool to break long reads into shorter fragments, which eliminates the advantage of using long nanopore reads for de novo assembly.

Furthermore, error rates of nanopore reads sampled from different genome locations shared the same distribution except for those of A. thaliana, which showed slight variations among genome locations (Supplementary Fig. 1). These results indicated that nanopore sequencing errors did not show genome-location bias. Therefore, a nanopore data set can contain both low- and high-error-rate reads from the same location in a genome.

In summary, our analysis indicates that, unlike PacBio reads, nanopore reads can contain HERS (especially in ultra-long raw reads), and show broad-error-rate distribution among reads and read subsequences.

Adaptive selection of supporting reads for error correction

To correct a nanopore read, we first collected supporting reads that overlap with it, then constructed the corrected read using a consensus of multiple sequence alignment of overlapped reads. An overlapping-error-rate threshold is usually set to select supporting reads. Due to the broad distribution of sequencing error rates among nanopore reads, it is difficult to select supporting reads using a single global overlapping-error-rate threshold. Setting a low overlapping-error-rate threshold, such as 0.3 used for PacBio reads, does not generate enough supporting reads to correct nanopore reads with high error rates (>20%); consequently, numerous nanopore reads cannot be corrected. Conversely, setting a high overlapping-error-rate threshold (such as 0.6) results in markedly increasing false supporting reads, which increases computational cost and reduces the accuracy of corrected reads. Furthermore, the high overlapping-error-rate threshold can increase the number of high-error-rate supporting reads for low-error-rate template reads. This results in correcting low-error-rate template with high-error-rate supporting reads, which greatly reduces the accuracy of corrected low-error-rate reads.

To overcome the broad-error-rate distribution of nanopore reads, we used two overlapping-error-rate thresholds to select supporting reads after filtering via distance difference factor (DDF) scoring11 and k-mer chaining22 (Online Methods). First, we used a global overlapping-error-rate threshold to select supporting reads. For all template reads, the overlapping error rates of their support reads need to be less than this global threshold. Therefore, the global threshold maintains the overall quality of supporting reads. Based on experiments of error rate distribution of raw reads (Fig. 1A), we set the global overlapping-error-rate threshold to 0.5. Then, for each template read, we set an individual overlapping-error-rate threshold. For each template read, we selected 50 candidate reads with top DDF scores and record the alignment differences between the 50 candidate reads and the template read. Then, the individual overlapping-error-rate threshold is set to be the average of alignment differences minus five times their standard deviation (see “Methods”).

The candidate reads were filtered if their alignment error rates were greater than either global or individual overlapping-error-rate thresholds. For low-error-template reads, the individual overlapping-error-rate threshold is less than the global threshold. Conversely, for the high-error-rate template reads, the individual overlapping-error-rate threshold is greater than the global threshold. Using both global and individual overlapping-error-rate thresholds, we were able to maintain the quality of supporting reads for both low and high-error-rate template reads, thereby improving the accuracy of corrected reads. High-error-rate template reads that did not have enough supporting reads were discarded without correction.

Progressive error correction of nanopore reads

The supporting reads for error correction are selected according to the average error rate of each template read. Since error rates for subsequences of each nanopore read are also broadly distributed (Fig. 2A), overlapping error rate between supporting reads and HERS can exceed the global overlapping-error-rate threshold 0.5, which can affect the accuracy of corrected subsequences. Therefore, we developed a progressive method for correcting error-prone nanopore reads in two steps (see “Methods”). We first corrected LERS in a template read (Fig. 2B). Then, we corrected HERS (Fig. 2C). In the first step, both corrected and uncorrected subsequences were outputted as a corrected read for the next step. After the first step, we corrected most nanopore reads to high accuracy. This allowed us to obtain an increased number of low-error supporting reads for high-error subsequences in the second step, thereby helping to correct high-error subsequences. After the second step, we outputted only the corrected subsequences. If a subsequence in a template read could not be corrected in the second step, it had either a high error rate or low coverage. Thus, one template read could be broken into multiple corrected reads after the second step.

Fig. 2: Illustration of progressive error correction and two-stage assembly methods of NECAT.
figure 2

A A template raw read for correction, in which the block boxes indicate errors and the red rectangle is high-error-rate region. B Error correction of low-error-rate subsequences. Only low-error-rate subsequences have supporting reads. C Error correction of high-error-rate subsequences. D Contig assembling using corrected reads. E Contig bridging using raw nanopore reads. F Output final contigs.

After filtering the candidate reads using overlapping-error-rate thresholds, we select at most 200 candidate reads with top DDF scores for each template read. Then, we performed local alignment of candidate reads to the template read (Online Methods). However, performing local alignments of supporting reads to long-template reads is computationally expensive. Although we selected 200 supporting reads for each template read, it was unnecessary to align all these supporting reads when there were enough reads available for error correction. Based on our observation, 12 supporting reads are enough for correcting the template read. Thus, we used a coverage count array (CCA) to record the number of supporting reads that covered each base of the template read during the local alignments. If a template read is covered by 12 support reads, we did not perform local alignments of the rest supporting reads to the template read anymore (see “Methods”).

Progressive assembly of nanopore reads

The long length of nanopore reads is a significant advantage for de novo genome assembly. However, HERS inside long nanopore reads usually fail to be corrected, leading to the splitting of long nanopore reads into several shorter corrected reads. Using only corrected reads for genome assembly abolishes the advantage presented by the long length of nanopore reads. In this study, we developed a two-step progressive genome assembler for nanopore reads. In the first step, we generated high-quality contigs using the corrected reads (Fig. 2D). In the second step, we bridged the contigs using original nanopore reads to generate final scaffolds (Fig. 2E). The lost contiguity in contigs, caused by HERSs in raw reads, is thereby filled in the second step of the process. Therefore, genome contiguity is improved by maximizing the usage of all raw reads. Our two-step assembly process is similar to the process using SMS reads for scaffolding23.

Meanwhile, even after error correction, sequencing error rates of corrected nanopore reads (1.5–9%) are still higher than those of corrected PacBio reads (<1%). The reads with high error rates may lead to low-quality overlaps, which have lower identity overlaps, long overhangs, and low-coverage subsequences. The low-quality overlaps increase the difficulty of assembly and introduce errors into assembly results. Therefore, to obtain high-quality contigs, we needed to select high-quality overlaps between corrected reads. Moreover, the error rates of corrected reads also show a relatively broad distribution (Supplementary Note 6 and Supplementary Table 4). Similar to the process used for selecting supporting reads for error correction, we employed both global and individual thresholds to overcome the broad-error-rate distribution for the filtering of low-quality overlaps (see “Methods”).

Performance of NECAT error correction

We assessed the performance of NECAT error correction using nanopore raw reads of nine data sets: E. coli, S. cerevisiae, D. melanogaster, A. thaliana, C. reinhardtii, O. sativa, S. pennellii, NA12878 (rel3,4), and NA12878 (rel6) with respect to correction speed, corrected data size, accuracy and continuity of corrected reads, as well as the number of reads with HERS in corrected reads (Supplementary Note 6). As shown in Table 1, NECAT correction speeds were 2.1–16.5 times faster than those of Canu for nanopore reads of these first seven data sets. The sizes of corrected reads for E. coli, S. cerevisiae, D. melanogaster, A. thaliana, C. reinhardtii, O. sativa, and S. pennellii were 102.2%, 83.4%, 90.6%, 92.5%, 100.3%, 100.7%, and 91.2% of their raw reads, respectively, while Canu only corrected the longest 40X raw reads and obtained 15.9%, 39.8%, 57.7%, 84.1%, 31.1%, 24.0%, and 28.3% corrected reads from their raw reads, respectively. For NA12878 (rel3,4) and NA12878 (rel6), the sizes of NECAT-corrected reads were 95.1 and 79.5% of their raw reads.

Table 1 Performance comparison of nanopore read error correction.

NECAT was able to obtain high-accuracy corrected reads. After the first step, average error rates for E. coli, S. cerevisiae, D. melanogaster, A. thaliana, C. reinhardtii, O. sativa, and S. pennellii data sets were 4.27%, 3.08%, 7.03%, 11.35%, 4.40%, 6.45%, and 9.23%, respectively; these were less than the average error rates of reads corrected by Canu, which were 7.06%, 3.13%, 8.15%, 12.05, 5.35%, 7.99%, and 9.69%, respectively. After the second step, average error rates for seven data sets were further reduced to 2.23%, 1.53%, 4.89%, 9.01%, 1.99%, 4.66%, and 6.45%, respectively. For NA12878 (rel3,4) and NA12878 (rel6), average error rates of NECAT-corrected reads after the first step were 7.38 and 6.46%, while those after the second step were 7.38 and 6.28%.

The maximum overlapping error rate between corrected reads is usually set to 10% during assembly. Thus, the higher the percentage of corrected reads having <5% error, the more reads can be used for assembly. As shown in Table 1, the percentages of NECAT-corrected reads having error rate < 5% for the first seven data sets were 99.34%, 95.04%, 72.03%, 45.85%, 95.18%, 74.62%, and 63.04%, respectively, which were significantly higher than those of reads corrected by Canu. For two NA12878 data sets, the percentages of NECAT-corrected reads having an error rate of <5% error were 77.60% and 75.45%, respectively.

The progressive correction strategy in NECAT also allowed us to correct more HERS and maintain the contiguity of reads. N50s for NECAT-corrected reads of the first seven data sets were 105.1%, 90.5%, 98.0%, 100.9%, 103.7%, 100.4%, and 96.0%, respectively, of N50s for their corresponding raw reads, indicating that NECAT could preserve the contiguity of raw reads. Conversely, N50s for the reads corrected by Canu were 91.9, 30.4, 85.8, 91.8, 99.0, 97.7, and 87.3% of the corresponding raw reads, which was less than those of NECAT-corrected reads. Another evidence that progressive correction strategy in NECAT can improve the correction of HERS is that the number of reads with HERS has been reduced. After two-step correction using NECAT, the numbers of reads containing HERS in the seven corrected data sets were 1, 268, 3481, 7158, 278, 3511, and 5445, respectively, while Canu-corrected data sets had 1, 4820, 6523, 8722, 726, 4413 and 5511 reads containing HERS. These results indicate that NECAT outperformed Canu in correcting sequencing errors in nanopore raw reads. For two NA12878 data sets, the N50s of NECAT-corrected reads were 106.7% and 121.7% of their raw reads, and the numbers of reads containing HERS were 53,130 and 64,210, respectively.

Performance of NECAT de novo assembler

We compared NECAT with two widely used correct-then-assemble pipelines: Canu and Canu + S, as well as an assemble-then-correct pipeline (Flye) for de novo assembly of nanopore reads (Supplementary Note 7). We assembled genomes of E. coli, S. cerevisiae, A. thaliana, D. melanogaster, C. reinhardtii, O. sativa, and S. pennellii using the longest 40X reads of each data set, and assembled 35X nanopore data for the human NA12878 genome using NECAT only.

As shown in Table 2, NECAT was 6.5–258.2 times faster than Canu, while showing 6.3–577.5 times speedup during the assembly step. Canu employs a high overlapping threshold (14.4%) in its overlapIncore tool for nanopore reads (a low threshold of 6% is used for assembling PacBio reads), which may greatly increase the time cost of local alignments. The Canu + S pipeline replaces the assembly step of Canu with Smartdenovo, which significantly reduces running time. NECAT was still 2.5–57.0 times faster than Canu + S on seven data sets. The high accuracy of corrected reads outputted by NECAT allowed us to use a more rapid overlapping approach. Compared with Flye, NECAT was 1.1–1.8 times faster than Flye on data sets of A. thaliana, D. melanogaster, C. reinhardtii, O. sativa, and S. pennellii, and was 224.8 and 21.2 times faster than Flye on data sets of E. coli and S. cerevisiae, but Flye was 3.8 times faster than NECAT on data set NA12878 (rel6).

Table 2 The quality and performance of long-read assembly with NECAT.

We then assessed the quality of assembled contigs with respect to assembly size, number of contigs, NG50, NGA50, number of misassemblies, QV, and gene completeness. As shown in Table 2, four assemble pipelines had similar overall performance, while each assembler had its strong points. Canu + S reported a significantly smaller number of contigs for A. thaliana, D. melanogaster, and S. pennellii while NECAT reported a significantly smaller number of contigs for O. sativa and human. NECAT reported significantly higher NG50 for D. melanogaster, O. sativa, and S. pennellii, and Flye reported the higher NG50 for NA12879 (rel6). NECAT reported higher NGA50 for E. coli, O. sativa, S. pennellii, and Flye reported the higher NG50 for NA12879 (rel6). Both NECAT and Flye reported much higher NG50 and NGA50 of NA12878 than those reported by Canu. NECAT reported significantly smaller numbers of misassemblies for both human assemblies, while Canu + S and Flye reported a smaller number of misassemblies for D. melanogaster and Canu reported a significantly smaller number of misassemblies for A. thaliana.

For most species, the four pipelines reported similar values of QV and gene completeness. For O. sativa, the low gene completeness of all four assemblies was due to their low qualities, even after polishing. The Canu reported higher gene completeness for NA12878(rel3,4) (86.3%). Although QVs of polished assemblies were similar, the distributions of error were different. For example, the mismatches and indels per 100 kbp in Canu NA12878 (rel3,4) assembly were 202 and 153, while those in NECAT NA12878 (rel3,4) assembly were 134 and 210.

Next, we assessed the effect of contig bridging in NECAT assembly. As shown in Table 3, the number of contigs was significantly reduced in the assembly of A. thaliana, D. melanogaster, C. reinhardtii, O. sativa, S. pennellii, and H. sapiens genomes after contig bridging of raw reads. For S. cerevisiae, D. melanogaster, C. reinhardtii, S. pennellii, and human NA12878, contig bridging also increased the NG50s and NGA50s of assemblies. These results indicate that contig bridging can improve the contiguity of assembly.

Table 3 Performance of de novo assemblies before and after the bridging step of NECAT.

We further compared NECAT assembler with other widely used assemble-then-correct assemblers: miniasm, Smartdenovo, wtdbg2, Flye, Raven, and Shasta (Supplementary Text 1 and Note 7). NECAT has similar time costs as those assemble-then-correct assemblers, but obtains better assembly results, especially for complex genomes (Supplementary Text 1). We also validated our assemblies by comparing them to reference genomes. The qualities of NECAT-generated assemblies were comparable to those of the other correct-then-assemble pipelines and better than assemble-then-correct assemblers (Supplementary Text 2). We also tested hybrid strategies to show the correctness and effectiveness of the error correction and assembly steps of NECAT. These strategies combine either the error correction step of NECAT with the assembly steps of Canu, Smartdenovo, and Flye, respectively, or the error correction step of Canu with the assembly step of Smartdenovo, Flye, and NECAT. The results of the comparison are described in Supplementary Text 3.

De novo genome assembly of retinoblastoma cell line WERI

To further evaluate the performance of NECAT in large-genome assembly, we sequenced a cell line called WERI, which is derived from human retinoblastoma24. We generated 210 Gb (82-folds) of raw reads from three flowcells using Nanopore PromethION. The WERI genome assembled by NECAT has an N50 of 28 M. To the best of our knowledge, this is the best N50 value for the assembly of the human genome using the general library of the nanopore sequencing platform.

We aligned the WERI assembly to human reference genome hg38 using MUMmer (v4.0)25. The dot-plot figure shows that the WERI assembly is structurally consistent with the reference genome except for minor structural variations (Supplementary Note 8 and Supplementary Fig. 2) and the tiling figure shows the continuity of the assembly (Fig. 3). We also used bowtie226 to align an Illumina data set for the WERI cell line onto a WERI assembly and hg38 human reference genome. The mapping rate of the WERI assembly (99.1%) was better than that of the hg38 human reference genome (98.0%).

Fig. 3: Continuity analysis of the assembly of WERI cell line using nanopore reads.
figure 3

Human chromosomes are painted with assembled contigs using the ColoredChromosomes package. Alternating shades indicate adjacent contigs (each vertical transition from gray to black represents a contig boundary or alignment breakpoint).

We then identified and validated structural variants (SVs) in the WERI assembly. We detected 11,725 SVs (≥10 bp) in the WERI assembly by aligning it to the hg38 human reference genome using MUMmer (v4.0). We also detected SVs from raw nanopore long reads and Illumina short reads for the WERI cell line using Sniffles27 and LUMPY28, respectively (Supplementary Note 8). In all, 7210 SVs are detected using both WERI assembly and raw nanopore reads, while only 1117 SVs are detected using both WERI assembly and NGS (Supplementary Fig. 3 and Supplementary Data 1). Furthermore, we can find 90% of unique small SVs (<1000 bp) that were detected using nanopore raw reads in the WERI assembly, indicating that the assembly can reduce false positives for small SVs (<1000 bp) (Supplementary Data 1).

Next, we examined genes associated with the identified SVs. We found 2843 annotated genes associated with 7210 SVs identified using both WERI assembly and raw nanopore reads. Two hundred and nine of 2843 genes are reported in Phenolyzer29 and are associated with retinoblastoma (Supplementary Data 2). Among 66 genes, the gene PRKCB, which is scored as high as 0.8901 in Phenolyzer29, was reported to be involved in retinoblastoma protein phosphorylation30. Among the 209 genes, there are eight genes (AATF, PRKCB, PRMT2, FRK, PIK3R1, CUX1, RAC2, and IGF1) with a Phenolyzer score > 0.5, and six of eight genes are associated with retinoblastoma as reported in PubMed. These results indicate that NECAT can provide high-quality assembly for reliable identification of SVs.

Discussion

Currently, applying nanopore reads in genomic studies is difficult because of the complex errors within these reads. In this study, our analyses have shown that nanopore reads contain HERS, and errors are broadly distributed among nanopore reads and in subsequences of a read. This broad error distribution complicates the selection of supporting reads during the error-correcting process. In traditional error correction methods, the threshold used to select supporting reads can be set too strict or too lenient; the former cannot select enough supporting reads for correction, while the latter generates too many low-quality reads that affect the accuracy of corrected reads. Furthermore, traditional error correction methods cannot correct the high-error subsequences in nanopore reads and generally break nanopore reads into multiple short corrected reads.

In this study, we developed NECAT, which includes methods such as progressive error correction, adaptive supporting reads and alignment selection, and two-stage assembly, to overcome the errors characteristic of nanopore reads. The error correction tool in NECAT, which is 2.1–16.5 times faster than that of Canu, can correct nanopore reads to high accuracy while maintaining the contiguity of nanopore reads. The whole pipeline is at least 2.5 times faster than other correct-then-assemble pipelines with enhanced or comparable assembly performance. The high performance shown by NECAT suggests that the high error rate of nanopore reads can be overcome by the development of algorithms with respect to error characteristics.

Structural variations identified via raw nanopore reads usually have a high false-positive rate. Here, we show that these false positives can be reduced considerably by using a high-quality assembly of nanopore reads for detection of structural variation. Our results show that NECAT is a useful tool for error correction and assembly of nanopore reads, and for the detection of structural variation.

Methods

The architecture of NECAT

The NECAT pipeline was designed as a high-performance assembler for nanopore reads. To overcome the high error rate of nanopore reads, we developed several methods, including progressive error correction, adaptive supporting reads and read overlaps selection, and two-step assembly. The NECAT pipeline contains four modules (Supplementary Fig. 4): preprocessing, correction, trimming, and assembly. The preprocessing module filters short and ill-formed reads. The correction module uses a progressive strategy to correct nanopore reads in two steps. The trimming module removes low-quality subsequences from corrected reads. The assembly module builds a string graph to assemble the genome in two steps. These four modules can be run in series to finish the assembly or can be operated independently. Currently, NECAT is the most efficient tool for assembling large genomes using nanopore reads.

Progressive error correction of nanopore reads

The broad distribution of sequencing error rates among nanopore reads, and within a single nanopore raw read, is the reason why traditional iterative error correction methods usually fail with nanopore data. In this study, we developed a method for correcting nanopore reads. Our progressive error correction method involves two steps. First, we correct the LERS in a read. Then, we correct the HERS in that read using a more sensitive approach. Both steps include the same four sub-steps: (1) selection of candidate reads, (2) determination of alignment-quality threshold, (3) selection of matched reads, and (4) correction of the read. The sub-steps 1, 2, and 4 are the same for both steps. We use different methods to select matched reads for each read to be corrected (template read) in the sub-step 3 of the two steps. In the first step, we use a strict selection method to choose matched reads for the low-error-rate portions of the template read. In the second step, we use a lenient method to choose matched reads for the high-error-rate portions of the template read.

The selection of candidate reads

For each template read, we selected candidate reads that had overlap with that read. For each pair of reads, we first used the DDF11 to select a seed k-mer pair with the highest score, which served as a reliable start position for local alignment. However, the wide distribution of error rates decreases the sensitivity of the DDF score for two k-mer pairs that are far apart; this may introduce false positives (Supplementary Fig. 5A). To remove false positives, we gathered all k-mer pairs that support the seed k-pair during DDF scoring. We sorted all k-mer pairs, including the seed k-mer pair, with respect to their positions and then chain them together22. The chaining process examines the relative positions of k-mer pairs and helps to filter out false positives (Supplementary Fig. 5B). We then updated the DDF score of the seed k-mer pair with remaining k-mer pairs, which further improved the sensitivity of candidate selection. We recorded the positions of the first and last k-mer pairs in the chain as the approximate mapped positions of candidate read. These two positions, together with the DDF score of the seed k-mer pair, were used for further filtering of redundant candidates and identifying HERS.

Determination of individual alignment-quality threshold for each template read

We selected high-quality supporting reads that were used for the correction of each template read. However, broad-error-rate distribution makes it difficult to use a single global threshold for the selection of supporting reads. Besides setting a global overlapping-error-rate threshold to 0.5, we also computed a local individual overlapping-error-rate threshold for each template read. For each template read, we used 50 candidate reads with top DDF scores for local alignments. If a local alignment contained more than 60% of template or candidate read length, we recorded the alignment, and the difference between template and candidate read. If we had \(n\)(\(0 \le {n} \le 50\)) recorded alignments and their differences are \(d_1,\;d_2, \ldots ,\;d_n\), we computed their average difference \(d_0 = \mathop {\sum }\nolimits_{i = 1}^n d_i/n\) and standard deviation \(D = \sqrt {\mathop {\sum }\nolimits_{i = 1}^n (d_i - d_0)^2}\). Then, we set the alignment-quality threshold as \(d = d_0 - 5D\). This threshold provided a lower alignment-quality bound for low-error-template reads.

Selection of matched reads

For each read template, we selected at most 200 candidate reads with top DDF scores for local alignment. We used different alignment methods in the first and second steps. In the first step, we used a blockwise alignment algorithm for aligning supporting reads to the template read. We performed local alignment from the seed k-mer pair in both directions. Thus, we first obtained two semi-global alignments, and then the two alignments were merged into one. Starting from the seed k-mer pair, we partitioned both template and candidate reads into equal-sized blocks 500 bp in length. We then used the Edlib algorithm31 to successively align each pair of blocks. The aligning process was terminated if the alignment error between a pair of blocks was >50%, or if the alignment algorithm reached the end of a template or candidate read. Because blockwise alignment terminated when either block from template or candidate had a high error rate, we could only obtain alignment between LERS in this step.

In the second step, we used multiple alignment methods to obtain long alignments between templates and candidate reads. We first used the blockwise approach to align the candidate reads to a template. If blockwise alignment terminated early due to the presence of a high-error-rate region inside the template or candidate read, we used the DALIGN algorithm32 to realign the candidate read to the template. However, alignments produced via DALIGN, running with a large difference threshold of 0.5, were usually too coarse. To refine the alignment result of DALIGN, we then used the Edlib algorithm to perform a global alignment on the mapped subsequences output by DALIGN to get a more correct alignment.

Performing a local alignment of supporting reads to a template is computationally expensive, especially for long-template reads. Usually, only dozens of alignments are enough for the error correction. Thus, it is unnecessary to align all 200 candidate reads if we have enough supporting reads for error correction. Here, we used a CCA, which is an integer array possessing the same length as that of template read, to record the number of candidate reads that cover each base of the template read. Before aligning a candidate read to the template read, we examined the values of CCA elements between the mapped positions for the approximate start and end of candidate read on a template. If all these values were greater than a user set threshold C, we would know that the corresponding region in template read has been covered by enough candidate reads and there was no need to perform the local alignment of this candidate read. If the alignment difference was less than the alignment-quality threshold d, we would increase every value of CCA between the start and end template mapped positions by 1. We used a default value of 12 for threshold C.

Correction of nanopore reads

After selecting the matched candidate reads, we used the FALCON-sense consensus algorithm9 to correct each subsequence of the template read that was covered by enough candidate reads. In the first step, we replaced these subsequences with corrected subsequences. Then, we output the whole template, including corrected subsequences and uncorrected subsequences, as a corrected read for the next step. HERS was corrected in the next step. In the second step, we only output corrected subsequences, meaning that one template may produce more than one corrected read. If a subsequence in a template read was not able to be corrected in the second step, it either had a too high error rate or low coverage.

Trimming of low-quality subsequences

Long nanopore reads may still contain HERS even after error correction, which can greatly affect the quality of assembly. Thus, low-quality subsequences need to be trimmed before assembly. We only selected 40X coverage longest corrected reads for trimming and future assembly. First, we performed pairwise alignment on selected nanopore reads using the trimming module of MECAT11. Because even corrected nanopore reads could have a relatively high error rate, we used the sensitive DALIGN algorithm to replace the original diff algorithm in the MECAT trimming module before performing local alignments. After pairwise alignment, we gathered high-quality overlaps with more than 90% identity for each read. If every residue of a read was covered by at least one overlap, the read was designated as a complete read. On the other hand, if there were subsequences without overlap coverage in a read, we trimmed it to its longest covered subsequence, which was called a trimmed read.

After trimming, the reads were usually subjected to another pairwise alignment. Our experiments showed that <10% of corrected reads were trimmed, therefore, it was unnecessary to pairwise align 90% of untrimmed reads. Thus, we stored complete reads and trimmed reads separately after trimming. Pairwise alignments were only performed between complete reads and trimmed reads, and between trimmed reads. The results of these pairwise alignments, together with complete reads, trimmed reads, and results of original pairwise alignments between complete reads, were fed into the assembly module.

De novo assembly of nanopore reads

Although the long length of nanopore reads helps improve genome assembly, the relatively high error rate of these reads rendered genome assembly difficult. Here, we developed an assembly tool, which is particularly useful for nanopore reads because it can overcome the high error rate of these reads. Our assembly module in NECAT consists of three steps: filtering of low-quality read overlaps, contig assembly, and contig bridging. We used multiple quality-control measures to filter out low-quality overlaps between nanopore reads. Then, we constructed a directed string graph and solved the graph to generate contigs. Finally, we bridged the contigs using original reads to generate the final scaffolds.

Filtering of low-quality read overlaps

Low-quality overlaps complicate assembly and introduce errors into assembly results. In NECAT, we used multiple thresholds to control the identity, overhang, and coverage of overlaps to filter out low-quality overlaps. For each read, we determined the coverage of each base according to its overlaps. Then, we calculated the minimum coverage (cmin), maximum coverage (cmax) of bases, as well as the difference between minimum coverage and maximum coverage (cdiff). If its cmin was less than the predefined threshold, min_coverage, or cmaxwas larger than the predefined threshold, max_coverage, or cdiffwas larger than the predefined threshold, max_diff_coverage, the read and its overlaps were removed. The details on coverage threshold settings are provided in Supplementary Note 9. Because of broad error distribution among different reads, we used both global and local threshold, instead of a single global threshold, for quality control of overlap identity and overhang. For a high-quality read, the average quality of its overlaps was higher than the global average; therefore, we set the local threshold to filter out overlaps having relatively low quality. For a low-quality read, the average quality of its overlaps was lower than the global average; we then used the global threshold to filter out low-quality overlaps for that read. This strategy allowed us to filter out overlaps with relatively low quality for each read, and to maintain the overall quality of all the overlaps. Details on setting global and local thresholds for overlap identity and overhang are provided in Supplementary Note 9.

Contig assembly

Next, we constructed a directed string graph and removed transitive edges using Mayer’s algorithm33. We marked the best out-edge and the best in-edge of each node based on overlap lengths of the edges. The edges that were not marked as best out-edge or best in-edge were removed34. We also removed ambiguous edges (tips, bubbles, and spurious links) in the graph. We then identified linear paths from the graph and generate contigs. When there was a branch, we broke the path to generating multiple contigs, which could reduce the possibility of mis-assembly.

Contig bridging

During error correction, long reads with high-error subsequences were cut into multiple shorter reads, which eventually led to the discontinuity of contigs. It is possible to relink contigs using long raw reads23,35. First, we aligned the long raw reads to contigs. Two contigs could have an overlap that had low quality; this overlap was filtered before the construction of a string graph. A raw read could either fill the gap between two contigs, which was then called a gap read, or overlap with the overlap of two contigs, which was then called an overlapped read. For each raw read, we recorded the gap or overlap length between the mapped positions on the ends of the two contigs. For each pair of contigs, the raw reads connecting them were grouped as those connecting in the same orientation or those connecting in different orientations. In each orientation group, we clustered the raw reads based on their gap/overlap lengths. If the difference between the gap/overlap lengths of two raw reads was less than a threshold (default value is 1000 bp), we assigned them into the same cluster. And we assigned a score to each raw read, which was the sum of the products of identity and length of overlaps between the raw read and the pair of contigs. The read cluster with the largest sum of scores was chosen as the link for the contig pair.

After identifying links between contig pairs, we created a string graph in which contigs were nodes, and links between the contigs were edges. The weight of each edge was set to the link score. We simplified the graph again by removing transitive edges. Then, we traversed the graph and identified linear paths as the final contigs. A raw read from the link was selected to fill the gap between contigs.

Error distribution analysis

We analyzed error distribution in nanopore data sets for E. coli, S. cerevisiae, A. thaliana, D. melanogaster, C. reinhardtii, O. sativa, and S. pennellii. Our results indicated that the sequencing error rate of nanopore reads was high at 10–20%, which helped us refine our algorithm for the NECAT platform and provided insights into why the existing correction algorithms were not suitable for the correction of nanopore reads. Details are provided in Supplementary Note 5.

Evaluation

We compared our error correction tool with those provided in Canu. We also systematically evaluated the assembly tools provided in NECAT by comparing them with those of Canu, Canu + S, and Flye. Details of these comparisons are reported in Supplementary Notes 6, 7, and 10.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.