Scalable long read self-correction and assembly polishing with multiple sequence alignment

Third-generation sequencing technologies allow to sequence long reads of tens of kbp, that are expected to solve various problems. However, they display high error rates, currently capped around 10%. Self-correction is thus regularly used in long reads analysis projects. We introduce CONSENT, a new self-correction method that relies both on multiple sequence alignment and local de Bruijn graphs. To ensure scalability, multiple sequence alignment computation benefits from a new and efficient segmentation strategy, allowing a massive speedup. CONSENT compares well to the state-of-the-art, and performs better on real Oxford Nanopore data. Specifically, CONSENT is the only method that efficiently scales to ultra-long reads, and allows to process a full human dataset, containing reads reaching up to 1.5 Mbp, in 10 days. Moreover, our experiments show that error correction with CONSENT improves the quality of Flye assemblies. Additionally, CONSENT implements a polishing feature, allowing to correct raw assemblies. Our experiments show that CONSENT is 2-38x times faster than other polishing tools, while providing comparable results. Furthermore, we show that, on a human dataset, assembling the raw data and polishing the assembly is less resource consuming than correcting and then assembling the reads, while providing better results. CONSENT is available at https://github.com/morispi/CONSENT.

: Comparison of the results produced by CONSENT, with and without our segmentation stategy, as reported by ELECTOR. Using the segmentation strategy allows a 47x speed-up, while producing a slightly higher quality correction.
Longest anchors chain: Figure S1: Computation of the longest anchors chain for a set of sequences. Here, we compute this chain with the second property set to T = 3, to simplify the figure. The longest chain computed here is thus S = (red, green, blue, orange, pink). If we only considered the four first sequences, the chain S = (red, green, blue, yellow, orange, pink), which is long than S could be selected. Indeed, although R 3 does not contain the yellow anchor, S follows the two properties in sequences R 1 , R 2 , and R 4 . However, in R 5 , the yellow anchor appears before the blue anchors, and thus invalidates the first property. S thus cannot be selected as the longest chain for the set of sequences R 1 -R 5 , and S is chosen instead. Moreover, R 5 does not contain the whole set of anchors that appear in the longest anchor chain, S (orange anchor is missing). As a result, S 5 will be filtered out during the MSA and consensus computation step.  Figure S2: Segmentation strategy for consensus computation of a window. Anchors defining the longest anchors chain are used in order to segment the MSA and consensus computation. Local, independent consensus are thus computed on subsequenes delimited by these anchors. These local consensus, along with the anchors, are then concatenated, in order to rebuild to global consensus, at the scale of the original sequences length.    On the 30x datasets, LoRMA performed worse than all the other methods, displaying 4x to 11x lower numbers of bases. This can be explained by the fact that it requires deep long-read coverage, as it is the only method purely relying on a k-mer strategy, thus fully avoiding overlaps computation. Indeed, LoRMA's number of bases was much higher, and comparable to that of other methods, on the 60x datasets. However, it produced corrected reads displaying extremely short average length, barely reaching more than 800 bp on the 60x coverage datasets, and reaching less than 300 bp on the 30x coverage datasets. As for memory consumption, LoRMA consumed 32 GB, the maximum available amount of memory, on all the datasets.  Table S5: Statistics of the real long reads, before and after correction with LoRMA. Runtime and memory consumption are reported for the whole correction pipeline.
In terms of alignment identity, LoRMA outperformed all the other tools. However, the N50 of the reads was extremely short, and did not even reach 700 bp, which is consistent the short read lengths observed in Supplementary  Table S3. This confirms the fact that LoRMA tends to aggressively split the reads, and only manage to correct small portions of them. This small N50 thus explains the high alignment identities, by the fact that LoRMA tends not to correct complex regions of the reads. As a result, LoRMA also displayed the lowest genome coverage among all the tools, especially on the H. sapiens dataset, were the genome coverage was less than 30%. Assembly result for LoRMA corrected reads are thus not presented, since Miniasm could not manage to perform assembly with these reads.

Canu
We used Canu v2.0. The following command lines were used.

Daccord
Daccord requires to run a series of scripts and other programs in order to run. We provide the versions of each subprogram and the complete suite of command lines below. ConverToPacBio_q2a.py, fasta2DB, and DBsplit are provided within the DAZZ_DB suite.

Command lines
The following command lines were used for both PacBio and ONT data.

FLAS
We used FLAS commit 053c19b. The following command line was used for both PacBio and ONT data.

LoRMA
We used LoRMA v0.4. The following command line was used for both PacBio and ONT data.

MECAT
We used MECAT commit d04bfa8. The following command lines were used.

Miniasm
We used Miniasm v0.3-r179. The following command line was used for the assembly of both PacBio and ONT data.