DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies

The highly anticipated transition from next generation sequencing (NGS) to third generation sequencing (3GS) has been difficult primarily due to high error rates and excessive sequencing cost. The high error rates make the assembly of long erroneous reads of large genomes challenging because existing software solutions are often overwhelmed by error correction tasks. Here we report a hybrid assembly approach that simultaneously utilizes NGS and 3GS data to address both issues. We gain advantages from three general and basic design principles: (i) Compact representation of the long reads leads to efficient alignments. (ii) Base-level errors can be skipped; structural errors need to be detected and corrected. (iii) Structurally correct 3GS reads are assembled and polished. In our implementation, preassembled NGS contigs are used to derive the compact representation of the long reads, motivating an algorithmic conversion from a de Bruijn graph to an overlap graph, the two major assembly paradigms. Moreover, since NGS and 3GS data can compensate for each other, our hybrid assembly approach reduces both of their sequencing requirements. Experiments show that our software is able to assemble mammalian-sized genomes orders of magnitude more quickly than existing methods without consuming a lot of memory, while saving about half of the sequencing cost.

[Optional] Preparations: We have provided code to help you select a subset of the reads: https://github.com/yechengxi/AssemblyUtility The utility functions can be compiled in the same way as the main programs. After compilation, you can use the following command to select a subset of reads from fasta/fastq files. Note that longest 0 is used here, if you set it to 1 it will select the longest reads. And you can use the following command to evaluate an assembly.

./AssemblyStatistics contigs YourAssembly.fasta
The program will generate two txt files containing essential statistics about your assembly.
Step1. Use a DBG-assembler to construct short but accurate contigs. Please make sure they are the raw DBG contigs without using repeat resolving techniques such as gap closing or scaffolding. Otherwise you may have poor final results due to the errors introduced by the heuristics used in short read assembly pipelines.
./SparseAssembler LD 0 k 51 g 15 NodeCovTh 1 EdgeCovTh 0 GS 12000000 In this test run, the N50 is 29 kbp. As we have selected the beginning part of the sequencing file, which usually is of lower quality, the next step may help to improve the assembly quality. [Miscellaneous] For other more complex genomes or a different coverage, the first run may not generate the best result. The previous computations can be loaded and two parameters can be fine-tuned to construct a cleaner de Bruijn/ k-mer graph: ./SparseAssembler LD 1 NodeCovTh 2 EdgeCovTh 1 k 51 g 15 GS 12000000 The N50 is improved to 32kbp in my run.
The output Contigs.txt will be used by DBG2OLC.
Step2. Overlap and layout. Feed DBG2OLC with the contig file in fasta format from the previous step (Contigs.txt in this example). In our test run, the N50 is 583kbp.
There are three major parameters that affect the assembly quality: M = matched k-mers between a contig and a long read.
AdaptiveTh: adaptive k-mer matching threshold. If M < AdaptiveTh* Contig_Length, this contig cannot be used as an anchor to the long read.
KmerCovTh: fixed k-mer matching threshold. If M < KmerCovTh, this contig cannot be used as an anchor to the long read.
MinOverlap: minimum overlap score between a pair of long reads.
For each pair of long reads, an overlap score is calculated by aligning the compressed reads and score with the matching k-mers. [Miscellaneous] At this point, the parameters may be fine-tuned to get better performance. As with SparseAssembler, LD 1 can be used to load the compressed reads/anchored reads. Contigs: the fasta contigs file from existing assembly.
RemoveChimera: remove chimeric reads in the dataset, suggest 1 if you have >10x coverage.
For high coverage data (100x), there are two other parameters: ChimeraTh: default: 1, set to 2 if coverage is ~100x.
These two are used in multiple alignment to remove problematic reads and false contig anchors. When we have high coverage, some more stringent conditions shall be applied as with the suggested parameters.
Step 3. Call consensus. Install Blasr and the consensus module (Sparc/PBdagcon). Make sure they are in your path variable.

Illumina-only Assembly
With the rapid advancement of sequencing technology, the length of the accurate NGS reads has also become increasingly longer. As a prototype, we demonstrate that the approach can be extended to existing Illumina data. Existing DBG based assembly algorithms resorted to stretches of perfectly matching k-mers and are not robust to sequencing errors. Algorithms required growing k-mer sizes to find increasingly perfect overlaps and extensive iterations to exploit the long read information. A costly error correction module was critical in finding these perfect overlaps. In contrast, our work can quickly and reliably find the best read overlaps without per-base level correction. Since our approach of utilizing the long read information is certainly not restricted to low quality ones, we compared the performance of our assembler on longer Illumina reads with several popular assemblers, including SGA 1 , SOAPdenovo2 2 , SPAdes3 3 .
Interestingly, for the relatively longer short NGS reads generated from the latest NGS technology, traditional de Bruijn graph with a fixed k-mer size have already exposed its shortage and produces a nonoptimal assembly: smaller k-mer fails to resolve repetitive regions, while using a large k requires high quality and high coverage data. This pair of contradictory requirements makes it hardly possible to obtain the optimal assembly with limited computational resources. Iterative de Bruijn graph may partially deal with the problem at the expense of more computational time, as well as more intricate algorithm design and implementations. For example, an error correction procedure (with an exponential-complexity graph search) would be necessary to produce long and correct k-mers. SGA utilized the FM-index 4 to find exact matches, which also poses restriction on the quality of the data. In contrast, our algorithm is robust and poses loose restriction on the quality of sequence reads, and hence can find overlaps efficiently and correctly. A dataset of 50X 150bp Illumina Miseq reads of E. coli K-12 MG1655 (Accession no: SRA073308) is used here as a test case. SparseAssembler 5 with k = 31 is called to assemble the initial contigs and it reaches an N50 of 13.5 kbp. The compressed reads were calculated using the contigs and raw Illumina reads. Our overlap graph construction took only around 1 second on this dataset while the compressing is taking most of the computational time, which is roughly two minutes. The total computational time of SparseAssembler and DBG2OLC is 5 minutes. DBG2OLC exhibits excellent adaptability and overall performance compared with leading assemblers for the new types of the NGS data. There are four critical parameters: k: k-mer length (max size: 31). Figure S13.The alignment of DBG2OLC assembly of the E. coli genome (Y-axis) to the reference genome (X-axis). 30x Oxford Nanopore reads and 50x Illumina reads are used. Celera assembler specification files can be found in a different attachment.