Introduction

A principal component of computational genomics is sequence assembly, constructing the original genome sequence by combining reads, or fragments thereof, obtained from sequencing machines. De novo assembly is a sequence assembly technique that does not use a reference genome during reconstruction, and hence can facilitate the biological understanding of new or uncharacterized species. The process of mapping the sequenced reads for de novo assembly is complicated because of the lack of a reference sequence to which the sequenced reads can be aligned. Factors such as massive read sets, distinct error profiles introduced by sequencing machines, repeats in the original genome, and uneven sampling of the reference genome make this process computationally intensive. These exacting factors are exacerbated in metagenomics and single-cell sequencing datasets in tandem with the explosive growth in genomics data—240 exabytes by 2025, just by taking into account human genomes1. This “genomical” data race has created an urgent need to speed up de novo assembly algorithms.

Popular assemblers such as Velvet2, ABySS3, and ALLPATHS-LG4 use a de Bruijn Graph (DBG)5 to perform de novo assembly. A DBG is a directed graph whose vertices are k-mers, or length-k substrings of the reads. An edge exists between two vertices if they are consecutive k-mers in a read and they share an overlap of a (k − 1)-mer6. To obtain contigs, or long contiguous genomic sequences, the DBG is traversed to identify maximal paths i.e., paths in which all vertices have an in-degree and out-degree equal to 1, except for the terminal vertices. These contigs are further assembled into longer regions or scaffolds based on their relative order and orientation.

The k-value chosen for DBG construction influences its structure. A small k-value cannot distinguish repeats or duplication due to erroneous reads (of length greater than k), and connects k-mers with other (false-positive) k-mers. This results in a branched DBG, with vertices having out-degree higher than 1, and terminating maximal paths, resulting in smaller-sized contigs. A large k-value, on the other hand, can differentiate among smaller repeats (of length less than k), and hence, reduces the number of branches. However, due to low or non-uniform sampling, some k-mers that introduce vertices and edges in the DBG are missed, resulting in a fragmented DBG with dead-end paths. This occurs if reads covering consecutive k-mers are missing, with increasing k-values exacerbating the fragmentation problem. Thus, selecting the correct value for the k parameter in DBG algorithms is crucial. Striking the correct balance between the branching and fragmentation problems is key to high assembly performance.

Based on this insight, several assemblers such as IDBA (Iterative DBG Assembler)6, IDBA-UD7, SOAPdenovo28 and SPAdes9 use several different k-values during assembly. Intuitively, contigs from a smaller k-valued graph can be used to “patch up” gaps in the larger k-valued graph, while contigs from the larger k-valued graph can be used to resolve “branches” or conflicts in the smaller k-valued graph. SPAdes follows an iterative graph construction process using multiple k-values to construct a multi-sized DBG. IDBA iterates from small to large values of k, maintaining an accumulated DBG to carry useful information forward as it moves on to higher k-values. These iterative approaches establish that DBGs built with multiple k-values generate finer-quality assemblies than a DBG built with a single (be it large or small) k-value6,7,8,9. IDBA-UD is an improved version of IDBA, and in the rest of the paper, we only refer to IDBA-UD for our experiments.

Motivation for ScalaDBG

While leveraging multiple k-values during the assembly improves its quality, the time taken to perform the assembly process also increases significantly. As described in Table 1 of our previous work10, the total time taken by IDBA-UD to assemble a medium-complexity CAMI metagenomics dataset increases linearly in proportion to the number of k-values used. Furthermore, among the different stages in IDBA-UD assembly, reading the sequence file (Stage 1), processing with multiple k-values (first, building the graph, and then iterating over the graph with several different k-values (Stage 2)), and finally scaffolding (Stage 3) to get the final assembly, the iterative graph construction process contributes to 96.1% of the total execution time, as shown in Fig. 1 of our previous work10. Notably, the graph-construction step (Stage 2), consisting of building an accumulated DBG by iterating over several different k-values, is the bottleneck in the assembly workflow.

Table 1 Read Sets used in the Experiments.
Figure 1
figure 1

Original Genome Sequence: AATGCCGTACGTACGAA, Input Reads: AATGC, ATGCC, GCCGT, TGCCG, CGTAC, TACGT, ACGTA, TACGA, ACGAA. Figure shows the effect of using a small k (\(k=3\)) and a larger k (\(k=4\)) during DBG construction. Sub-figure (a) shows the graph constructed from read set with \(k=3\). The vertices are 3-mers of the read set. They are connected to each other if they have a 2-mer overlap and if they are consecutive 3-mers in a read. This graph has branching at vertex ACG due to a repeating region in the genome ACGT and ACGA. The contig set generated by identifying maximal paths in the graph is {AATGCCGT, ACGAA, ACGT, CGTACG}. As the value of k is increased to 4 (sub-figure (b)), the branch disappears as the higher k-value can now distinguish between the repeat region in ACGT and ACGA. However, some reads such as CCGTA and GTACG are not sampled from a contiguous genome sequence and so vertices and edges in the graph are missed. For e.g., GTAC and TACG cannot be connected. While they do share a 3-mer overlap, they are not part of the same read. If the read GTACG would have been part of the read set, we could have connected them. The final graph (sub-figure (c)) can be created by filling in some of the gaps in the \(k=4\) graph with contigs from the \(k=3\) graph. The vertices for which new edge is added (sub-figure (a)) are circled. The final contig set corresponds to contigs in this graph.

Our ScalaDBG System

To address this concern, we propose ScalaDBG, a new parallel assembly algorithm that parallelizes Stage 2 of the assembly workflow, the iterative DBG construction process with multiple k-values. The key insight behind ScalaDBG is that the graphs for multiple k-values need not be constructed serially. Instead, each graph construction can be done independently and in parallel. Accumulating the graph for the higher k-value, such as in IDBA-UD, introduces an apparent dependency on the graph with lower k-values. We remove this dependency, by introducing a patching technique, which can patch the higher k-valued graph (k2) with contigs from the lower k-valued graph (k1). Crucially, the first stage of graph construction of the k1 and k2 graphs can proceed in parallel and the relatively shorter stage of patching the k2 graph with the contigs from the smaller k-valued, k1 graph, happens subsequently. Thus, the more fragmented, higher k-valued graphs are cemented from the contigs of the lower k-valued graphs, with the branches of the lower k-valued graph being simultaneously removed.

ScalaDBG first performs graph construction in parallel for each k-value. Next, for each pair of graphs, the higher k-valued graph is patched using the lower k-valued graph to generate a single graph. Note that there are several independent patch processes, for a long chain of k-values, and they execute in parallel. This process is performed recurrently, until there is only a single graph, similar to a parallel tree-reduction model. Thus ScalaDBG breaks a sequentially executed chain of k-values, executing both construction and patching in parallel, with higher parallelism for a longer chain of k-values. ScalaDBG is the first assembler to parallelize multi-k-valued DBG construction. We show in our evaluation that there is no statistical difference in the assembly quality of ScalaDBG and IDBA-UD.

Removing the Assembly Kernel’s Time-Complexity Bottleneck in Genome Analyses Pipelines Using ScalaDBG

Genome analysis pipelines have several repetitive “kernels” or algorithmic modules as we have seen in the process of building our own domain-specific language (DSL) for genomics11. In a typical genome analysis pipeline, generation of genomics reads from sequencing machines is followed by the kernel(s) of genome assembly, and subsequently, analysis of the assembled sequences using a repertoire of metrics. Tremendous improvements in sequencing machines have made the genome assembly kernel the bottleneck in the process of extracting meaningful insights from raw genomics datasets. A high-latency genome assembly kernel negatively affects the subsequent analysis kernels and overall pipeline performance, especially in high-use genomics pipelines, such as MG-RAST12,13, reducing the throughput of production-scale high-performance computing pipelines14,15. ScalaDBG can be deployed on modest hardware e.g., multiple nodes in a supercomputer or commercially available cloud infrastructure (e.g., AWS Amazon cloud, Google cloud) to speed up assembly for a much lower cost. In contrast, the scalability of IDBA-UD is severely limited by the memory and compute capacity of server nodes, which are expensive to upgrade.

Multi-Grained Parallelism of ScalaDBG

ScalaDBG leverages parallelism at two levels—first: constructing several different k-valued graphs in parallel, and second: parallelizing processes such as k-mer counting, indexing, and lookup that occur within each graph-construction step. This strategy of ScalaDBG enables us to leverage the hybrid MPI-Open MP parallel programming model. While each MPI process can independently perform graph construction on different nodes in a cluster, Open-MP threads can exploit all cores on a single node. Thus, ScalaDBG utilizes both vertical scaling or scaleup in a powerful server node with a higher core-count, and horizontal scaling or scale-out in a cluster with multiple nodes. ScalaDBG completes the assembly of a SAR metagenomic dataset with a set of k-values in the range of 20–50, with a step size of 2, 6.8X faster than IDBA-UD, reducing the execution time from ~2 hours to 17 minutes.

We make the following technical contributions in this paper:

  1. 1.

    We break the dependency in DBG creation for multiple k-values—from a purely serial process to one where the most time-consuming part (the DBG creation for individual k-values) is parallelized. This innovation can be applied out-of-the-box to most DBG-based assemblers.

  2. 2.

    We develop a divide-and-conquer strategy for handling a long sequential chain of k-values, which improves the quality of assembly

  3. 3.

    We develop a software package called ScalaDBG that uses Open MP for the scale-up operation within one server and MPI for the scale-out process across multiple servers. The software package is available at https://github.com/purdue-dcsl/Scaladbg.

Related Work

Several effective de-novo assembly applications have been put forward to deal with the deluge in genomic sequences2,3,4,6,7,8,9,16,17,18,19,20,21. However, these applications are restricted to scaling up on a multi-core machine, or do not use several k-values during assembly. To the best of our knowledge, there has been no prior work on distributed and parallelized DBG construction for multiple k-values.

In previous work, Ray16, ABySS3, PASHA17, and HipMer18 can distribute the task of DBG construction to different nodes in a cluster. Metagenomics assemblers, such as Meta-velvet19 also do not apply multiple k-values. However, this approach performs poorly for datasets with uneven sequencing depths, such as in metagenomics and single-cell datasets. ScalaDBG employs multiple k-values to deal with such datasets. On the other hand, SGA21, Velvet2, SOAPdenovo8, ALLPATHS-LG4 are limited to scaling up on a multi-core node. Additionally, while IDBA, IDBA-UD, and SPAdes can operate on several k-values, their scaling is restricted to multiple cores on a single node. In contrast, ScalaDBG is a distributed and parallelized assembler operating on multiple k-values.

IDBA-UD as our algorithm for benchmarking

IDBA-UD is a an iterative k-value DBG-based assembler that runs through a range of k values from \(k={k}_{min}\) to \(k={k}_{max}\), with a step-wise increment of s. It maintains an accumulated DBG Hk at each step. In the first step, a DBG \({G}_{{k}_{min}}\) is generated from the input reads. For \(k={k}_{min}\), Hk is equivalent to \({G}_{{k}_{min}}\). After DBG construction, contigs for graph Hk are generated by considering all maximal paths in graph Hk. All vertices in any maximal path have an in-degree and out-degree equal to 1, except the vertices at the start and end of the path. Subsequently, reads from the input set that are substrings of these contigs are removed. This generally reduces the size of the input read set. Note that, a read of length r generates \(r-k+1\) vertices. Thus, as k is increased, each read introduces fewer vertices. This reduction in size of the input read set, coupled with the fact that there are fewer vertices for larger k-values, makes subsequent graph-construction steps less time consuming. For the next iteration, where \(k={k}_{min}+s\), the graph Hk, the remaining reads, and the contigs from Hk are fed as inputs. Every s-length path in Hk is upgraded to a vertex. A \((k+s+1)\)-mer in either the remaining reads or the contigs of Hk is used to connect vertices in Hk. The next set of iterations continue this process until \(k={k}_{max}\) is reached. Observe that in this algorithm, at iteration i, graph \({H}_{{k}_{min}+i\ast s}\) depends on graph \({H}_{{k}_{min}+(i-1)\ast s}\), the reduced read set, and the contigs obtained at the previous iteration (i − 1). This dependency compels IDBA-UD to work sequentially on the chain of k-values, irrespective of the length of the chain. This is the essence of the problem that we tackle in ScalaDBG.

Methods - Design of ScalaDBG

Our technique, ScalaDBG, consists of two phases—the build phase (building a DBG) and the patch phase (patching a partial DBG with contigs from a lower k-valued DBG). In addition, it employs an efficient scheduler to fully exploit all nodes in a cluster.

Build phase

We unravel ScalaDBG, by describing our protocol for only two distinct k-values first, k1 and k2, and without loss of generality, we assume \(k1 < k2\). Figure 2 shows the workflow of ScalaDBG in this scenario.

Figure 2
figure 2

High Level Architecture Diagram of ScalaDBG. This shows the graph construction with only two different k values, k1 and k2 with \(k1 < k2\). The graph Gk2 is “patched” with contigs from Gk1 to generate the combined graph Gk1–k2, which gives the final set of contigs. Different modules in ScalaDBG are highlighted by different colors.

First, DBGs Gk1 and Gk2 are built for each k-value in parallel. k1-mers and k2-mers of read set I correspond to vertices in Gk1 and Gk2 respectively. |Gk1| and |Gk2| represent the number of vertices in Gk1 and Gk2 respectively. Since \(k1 < k2\), Graph Gk1 will typically be larger in size, in terms of vertices and edges, i.e., \(|{G}_{k1}| > |{G}_{k2}|\). DBG construction consumes the maximum amount of time in the entire workflow, with the construction of Gk1 being the dominant part. Notably, ScalaDBG creates DBGs for the two different k-values in parallel, unlike in all prior approaches. Observe that branching in Gk1 will be greater than Gk2, while the number of gaps or holes will be higher in Gk2, relative to Gk1. ScalaDBG produces contigs Ck1 from Gk1 by detecting maximal paths similar to IDBA’s algorithm, but does not yet create Ck2 from Gk2.

Patch phase

Graph Gk2 has gaps relative to graph Gk1, ans hence our fundamental strategy is to patch graph Gk2, i.e., fill the gaps in Gk2 using the contigs Ck1, since they would have the required information. ScalaDBG generates k2-mers from Ck1, and inserts them as vertices back into Gk2. These freshly added vertices could lead to additional edges, if there exists an overlap of a (k2 − 1)-mer in the contig set Ck1 or the read set. The resulting graph obtained by filling gaps of Gk2 using Ck1 is represented as Gk1−k2, and is used for the final contig set generation.

Example for ScalaDBG

Consider an example to illustrate the build and patch phases of ScalaDBG with k-values of 3 and 4. The graphs G3 and G4 are obtained by \({k}_{1}=3\), and \({k}_{2}=4\), and are shown in Fig. 1(a,b). In ScalaDBG, during the build phase, G3 and G4 are constructed in parallel. Contig set C3 obtained from graph G3 is {AATGCCGT, ACGAA, ACGT, CGTACG}. In the patch phase of ScalaDBG, graph G4 is patched with contigs in C3. 4-mers obtained from C3 are {AATG, ATGC, TGCC, GCCG, CCGT, ACGA, CGAA, ACGT, CGTA, GTAC, TACG}. Of these 4-mers, if two consecutive 4-mers are substrings of a sequence in the contig set, they are inserted as vertices in G4 and are connected by an edge. Vertices GTAC and TACG are connected by an edge in G4 to get G3–4. G3–4 is shown in Fig. 1(c). Final contig set C3–4 generated from G3–4 is \(\{TACGTACG,TACGAA,AATGCCGT\}\).

Patching multiple k-values in parallel

ScalaDBG provides two options in the patch phase when the total number of k-values used in the DBG creation is greater than 3. Figure 3 shows the serial patching process, and Fig. 4 shows the parallel patching process when there are four different k-values: k1, k2, k3, k4, and \(k1 < k2 < k3 < k4\). As explained earlier, individual DBGs corresponding to each of the k-values are constructed in parallel. In serial patching, contigs are first generated for the lowest k-valued (k = k1) graph Gk1. They are used to patch the graph corresponding to the next higher k-value (k2). In this manner, sequentially and repeatedly, contigs generated from a lower k-valued patched graph are used to patch the next higher k-valued graph. The ultimate set of contigs are obtained from the final patched graph. To summarize, in the serial variant, ScalaDBG performs graph building for individual k-values in parallel, but patching and contig generation processes are serial. While this simple approach is easier to implement, unfortunately it reduces the benefits of parallelism since the number of serially executed patch operations grow linearly with number of k-values. The total serial patch time is a significant portion of the total execution time of ScalaDBG for a long chain of k-values. To overcome this problem, ScalaDBG intelligently selects multiple patch operations to be executed in parallel. Rather than adopting an ad-hoc approach to patch graph-pairs, ScalaDBG uses a disciplined policy to patch pairs of adjacent graphs to generate a single graph, since average distance in k-values for each DBG is the lowest in this configuration. Prior work has shown that smaller jumps in k-values results in better quality aggregated DBGs6. Thus, multiple patch operations are performed in parallel akin to a reduction tree model. ScalaDBG’s parallel patching is demonstrated in Fig. 4, where graphs Gk2 patched from contigs Ck1 of graph Gk1, and Gk4 patched from contigs Ck3 of graph Gk3 are processed in parallel. In this manner, ScalaDBG accomplishes both graph building and patching in parallel, and reduces the growth of serialized patching steps to only a logarithmic factor of the number of different k-values. ScalaDBG achieves higher scope for parallelism when higher number of k-values are applied in graph construction. In the remaining paper, we denote ScalaDBG employing serial patching as ScalaDBG-SP (serial-patch ScalaDBG) and ScalaDBG performing parallel patching as ScalaDBG-PP (parallel-patch ScalaDBG). Since ScalaDBG-SP and ScalaDBG-PP merge different pairs of graphs together in the intermediate steps, the final contigs generated by the two methods may differ.

Figure 3
figure 3

Workflow of ScalaDBG using serial patching, called ScalaDBG-SP for 4 k-values, with \(k1 < k2 < \)\(k3 < k4\).

Figure 4
figure 4

Workflow of ScalaDBG using parallel patching, called ScalaDBG-PP for 4 k-values, with \(k1 < k2 < \)\(k3 < k4\).

More concretely, here is an example to show why the contigs from serial and parallel patch can vary. Say, we are using 8 k-values namely 10, 12, 14, 16, 18, 20, 22, 24, with a step size of 2. In ScalaDBG-SP, the graphs will be patched in order, in this manner: G10–12, followed by G10–14, G10–16G10–24. In other words, the difference between k-values associated with adjacent graphs, or step size remains 2, and the contigs generated by the previous graph have lengths close to the next graph. On the other hand, in ScalaDBG-PP, graphs created in stage 1 will be G10–12, G14–16, G18–20, and G22–24, step size is 2. In stage 2, graphs created will be G10–16, G18–24, step size is 4. In stage 3, graph created will be G10–24, step size is 8. Peng et al.6 show that a longer contig in the graph obtained with step size s, is not present in the graph obtained with step size s′, where s < s′. Since maximum step size in ScalaDBG-PP is higher than in ScalaDBG-SP, (8 > 2), some contigs in ScalaDBG-SP might be absent in ScalaDBG-PP, and the final contig quality in ScalaDBG-PP might drop. Intuitively, the reason that contigs of graphs with close k-values should be patched, is that contigs generated from previous k-values might not be long enough to connect any nodes in the next graph, when step is large. Say, we are using 4 k-values, k1, k2, k3, k4 where \(k1 < k2 < k3 < k4\) and the minimum distance between any two k-values is >=2. In the sequential version, we first build Gk1 and get contigs. Gk1 could have contigs of length k2 + 1, and they could be used to patch Gk2. However, they cannot be used to patch graph Gk3 since they would not be long enough to connect vertices in graph Gk3. In addition, bubble merging and dead-end removal phases prune incorrect vertices and edges in the graphs based on their multiplicity information which could be different in ScalaDBG-SP and ScalaDBG-PP. With deeper trees, the difference in quality of the PP and SP variants is possibly going to be larger and there is a tradeoff between compute efficiency and quality. Empirically we find that the difference is not statistically significant, as shown in Section 6.

Efficient scheduling of multiple k-values

ScalaDBG’scheduler adopts a greedy strategy to perform assembly for a given number of k-values and set of compute nodes, such that utilization of nodes in the cluster is maximized. We discuss the operation of the scheduler for the tree reduction pattern of ScalaDBG-PP (ScalaDBG-SP has a simpler form of a scheduler and is omitted here for space). The generated schedule consists of a set of rounds as shown in Fig. 5. Each round is the assignment of a task to a node, where a task means the node builds a graph or patches an existing graph with contigs from another graph, or the node is idle. The rounds continue until the final graph is obtained.

Figure 5
figure 5

Schedule created by the ScalaDBG Scheduler for 8 k-values and 4 nodes. Different computational nodes in the cluster execute different tasks in each round of the workflow.

The scheduler uses the following observation in creating the schedule: building graph for k1 will take longer than building graph for k2, where \(k1 < k2\) because \(|{G}_{k1}| > |{G}_{k2}|\). Hence the node processing k2 will get done earlier, and asynchronously send graph Gk2 to the node building graph Gk1, thus hiding the latency of the communication. This node is then free to take up the next task, either creating a new graph or patching an existing graph. The scheduler consistently overlaps computation with communication in this manner. Similarly, nodes patching higher k-valued graphs will finish their tasks earlier than nodes patching lower k-valued graphs. Hence these nodes will asynchronously send the patched graphs, and start building the graph for the next k-value. The tasks are assigned such that the number of idle nodes in any round is reduced. The amount of work done in each round is not necessarily the same across the nodes, e.g., in round 1, node 4 does less work than node 3, which does less work than node 2, which does less work that node 1. We use the following example to explain the ScalaDBG scheduler. In this example, there are four available nodes (n1, n2, n3, n4) and 8 input k-values, k1–k8, with \(k1 < k2 < \ldots < k7 < k8\). We explain the processing done at each node during each round with the representation shown in Fig. 5. In round 1, all nodes build new graphs from the k-values, k1–k4. Once done with the graph construction, node n2 sends graph Gk2 to node n1 and node n4 sends graph Gk4 to node n3. In round 2, node n1 does the patching to generate graph Gk1–k2 and similarly node n3 does a patching while nodes n2 and n4 build new graphs. This way the different rounds continue till the final graph \({G}_{k1-k8}\) is assembled in node n1. Note that in the later rounds, some of the nodes become idle as there are no more tasks to schedule. In this manner, arbitrary number of k-values are scheduled to run on a set of nodes by ScalaDBG’s scheduler.

Methods - Implementation of ScalaDBG

Algorithms 1 and 2 provides the pseduo code listing for ScalaDBG-SP and ScalaDBG-PP respectively.

n denotes the total number of k-mers used for assembly, mink denotes minimum k-mer size, maxk denotes the maximum k-mer size, and step represents the increment between each k-mer graph built. Each process computes the DBG corresponding to its designated kmer_size by reading the input read files. In ScalaDBG-SP (Algorithm 1), N MPI processes are used to construct the DBGs in parallel, and they then send them to a single process (the Master process) which receives the constructed DBGs. The Master process patches the graphs in sequential order (similar to Fig. 3), to create the final graph. In ScalaDBG-PP (Algorithm 2), after constructing the individual DBGs, the process are divided into two roles - either send a DBG or recieve a DBG. This role division is done in an equitable fashion. The receiving process then completes the patching of the DBG. This process is then used further for contig generation. For the subsequent stages, this process of role division and contig generation is repeated, until only one process receives the final graph. MPI implementation is done in such a way, that this process is always the Master process, and it applies the final patching and generates the ultimate contigs.

Discussion: Correctness, Implications, and Generality of ScalaDBG

In this Section, we first establish the equivalence of graphs obtained by the build and serial patching processes of ScalaDBG and the iterative build process of IDBA-UD. We then discuss the implications of the parallel patching and the graph-simplification procedures on the output of ScalaDBG.

Correctness of the ScalaDBG Methodology (essentially showing the equivalence between the final graphs of ScalaDBG-SP and iterative IDBA-UD)

For a fixed iteration set of k-values, starting from k = kmin to k = kmax, the final graphs obtained by ScalaDBG-SP and IDBA-UD are identical. We request the reader to refer to the proof of Theorem 4.1 in our previous work10. The proof uses Mathematical Induction to establish the equivalence between the final graphs using the two methods, namely: ScalaDBG-SP and iterative IDBA-UD.

Algorithm 1
figure a

ScalaDBG with serial patching. N MPI processes have ranks (identifiers) from 0 to N − 1. MPI process with rank 0 is referred to as the Master process.

Algorithm 2
figure b

ScalaDBG with parallel patching. N MPI processes have ranks (identifiers) from 0 to N − 1. MPI process with rank 0 is referred to as the Master process.

Implications of ScalaDBG’s methods

While the build and patch processes for iterative IDBA-UD and ScalaDBG are the same, the assembly metrics for the two variants of ScalaDBG and IDBA-UD are different. This can be attributed to two processes:

  • Out-of-order patching process for ScalaDBG –PP

  • Graph simplification processes in IDBA-UD, such as dead-end removal (to prune incorrect vertices and edges based on their multiplicity information and path length) and bubble merging (merging paths of similar length with the same start and end vertex). In IDBA-UD, the graph simplification process is applied to each graph during the build and after traversal, while in ScalaDBG, it is only applied after each patching.

ScalaDBG-PP builds graphs in a different order (than ScalaDBG-SP and IDBA-UD) when the number of k-values is greater than 3. IDBA-UD sequentially builds graphs where the k-value used in the next iteration differs from the k-value used in the previous iteration by step-size. ScalaDBG-PP performs a pair-wise reduction to patch graphs at every level. Lower-depths in the parallel reduction tree have graphs with k-value difference greater than step-size. Hence, the order of patching graphs is different in ScalaDBG-PP and IDBA-UD.

In addition, for the graphs built by ScalaDBG and IDBA-UD, vertices and edges have different multiplicity information. Hence, the graph-simplification procedures generate different contig sets with different assembly metrics for ScalaDBG-SP, ScalaDBG-PP and IDBA-UD. However, in our evaluation section, we will show that the difference is not statistically significant.

Generality of ScalaDBG’s methods

We demonstrate how we can leverage an out-of-the-box DBG-based assembler and parallelize DBG construction for a chain of k-values. ScalaDBG modularizes different stages in assembly such as building the graph, patching the graph with contig sets, and the generation of contigs. Hence, these modules can be reused to combine the contigs of a single k-value assembler, such as Velvet. We use an example to explain this application, as shown in Fig. 6. There are four different k-values, in increasing order: k1, k2, k3, and k4. Initially, a single k-value assembler is run for each of these k-values in parallel to generate the contig sets. The obtained contig sets are used to patch the graph associated with the contig set of the next higher k-value (k2). Graphs Gk2 and Gk4 are built from the contig sets Ck2 and Ck4, respectively, according to the definition of DBG, considering the contig set as the input read set. Then, the graphs Gk2 and Gk4 are patched using contig sets Ck1 and Ck3, respectively. After this stage, the method follows the standard parallel-patch workflow of ScalaDBG. Note that we can also employ serial patching to get the final contig set. (i.e., ScalaDBG-SP).

Figure 6
figure 6

General assembler used in conjunction with ScalaDBG’s technique. Contigs Ck1, Ck2, Ck3, and Ck4 are obtained using a general assembler.

Evaluation and Results

Evaluation setup and datasets

We used an Intel Xeon Infiniband cluster for our experiments, with each node having Intel Xeon E5-2670, 2.6 GHz, with 16 cores per node and 32 GB of memory and the nodes connected with QDR Infiniband. We used the latest version of IDBA-UD (1.1.1)7. The datasets are enlisted in Table 1. The S. aureus and SAR 324 single-cell datasets are obtained from22 and the CAMI benchmark datasets comprise our metagenomics datasets23. For ScalaDBG, the number of nodes were equal to the number of k-values while IDBA-UD can only run on a single node. Existing scaffolding techniques can be applied to output contigs that are obtained from ScalaDBG to get the final assembly. We only focus on the outputs at the contig generation stage rather than after the scaffolding process because ScalaDBG’s novelty in this implementation is restricted to the contig generation process.

Relevance of datasets

A single cell is the ultimate denomination in a multicellular organism. For example, the human body consists of roughly 37.2 trillion cells living in harmony. However, in cancer, this harmonious equilibrium is lost and this is where even one single cell can wreak havoc by evolving into a malignant tumor mass, wherein the lineages diverge and form distinct populations giving rise to what is known as clonal diversity. While in the past, technological limitations required micrograms of input tissue mass resulting in an average signal being emanated from a complex mass of heterogeneous cell types, single-cell sequencing (SCS) methods can now revolutionize the understanding of cancer biology, affording insights into the role of rare cells in the evolution of cancer. In the case of metagenomic and single cell sequencing datasets, sequencing depths of different regions of a genome, or genomes from different organisms are exceedingly uneven. Hence multiple k-values are required for accurately assembling the datasets. So we evaluate ScalaDBG and IDBA-UD using these relevant datasets.

Performance tests

Figures 7, 8, 9, 10, 11 and 12 compare the execution time of IDBA, ScalaDBG-SP, and ScalaDBG-PP to generate contigs for the read sets mentioned in Table 1. We generated different k-value configurations by varying the number, range and step sizes. For the metagenomics dataset, distinct step sizes of 28, 12, and 6 in the 40–124 range generated 3 different configurations. Similarly, step sizes of 14, 6, and 3 for the 29–71 range generated 3 different configurations for the single-cell sequencing datasets. The single-cell datasets have a narrower range because of the shorter length of the sequenced reads. To vary the number of k-values, we used step sizes of 10, 5, and 2 in the range of 20–50 for the SAR324 dataset to obtain 4, 7, and 16 k-values, respectively.

Figure 7
figure 7

Execution Time comparison for IDBA-UD, ScalaDBG-SP, ScalaDBG-PP on RM1 dataset. ScalaDBG runs on a cluster with the number of nodes being equal to the number of k-values.

Figure 8
figure 8

Execution Time comparison for IDBA-UD, ScalaDBG-SP, ScalaDBG-PP on RM2 dataset.

Figure 9
figure 9

Execution Time comparison for IDBA-UD, ScalaDBG-SP, ScalaDBG-PP on the SC-E. coli dataset. Speed up w.r.t. IDBA-UD running on the same k value configuration is shown.

Figure 10
figure 10

Execution Time comparison for IDBA-UD, ScalaDBG-SP, ScalaDBG-PP on the SC-S. aureus dataset.

Figure 11
figure 11

Execution Time comparison for IDBA-UD, ScalaDBG-SP, ScalaDBG-PP on the SC-SAR324 dataset.

Figure 12
figure 12

Execution Time comparison for IDBA-UD, ScalaDBG-SP, ScalaDBG-PP on the SC-SAR324 dataset for range (20–50).

These different k-value configurations are meant to evaluate the effect on quality and execution time of assembly for ScalaDBG and IDBA-UD. ScalaDBG is deployed on a cluster with number of nodes equal to the number of distinct k-values in an experiment, while IDBA-UD can run only a single node. Speedup over IDBA and assembly quality of ScalaDBG increases with increase in k-values for all the read datasets. In fact, speedup of ScalaDBG completely depends on the corresponding value of k. For example, for the SAR324 dataset, with the k-value range between 20 and 50 and a step size of 2, speedup of ScalaDBG-PP is 6.8X while the speedup of ScalaDBG-SP is 3.1X relative to IDBA-UD. Of all the remaining read sets and configurations, ScalaDBG-PP achieves a maximum speedup of 3.3X for the SC-SAR324 readset in the {29–71}, with a step size of 3. ScalaDBG-SP achieves a maximum speedup of 1.6X for RM1, RM2, SC-S. aureus, and SC-SAR324 readsets in the configurations processing 15k values. For all the datasets and configurations, ScalaDBG is faster than IDBA-UD. Further, speedup is higher for the larger read datasets of RM1, RM2, SAR 324, and S. aureus. Finally, as would be expected ScalaDBG-PP is always faster than the serial version.

Accuracy

Tables 2, 3 and 4 show the quality metrics for assembling the datasets in Table 1 using ScalaDBG. We used the QUAST tool24 to compare the assemblies obtained using ScalaDBG-SP, ScalaDBG-PP, and IDBA-UD. For the metagenomic datasets, Table 2 reports number of contigs, N50, and max contig length, since the reference assemblies contained multiple genomes instead of a single genome. For SAR324, we did not have access to the reference genome, so Table 4 denotes the coverage, NGA50, and number of misassemblies as NA. Table 3 reports both N50 and NGA50 since we had a single reference genome for the datasets. The most common metric to assess assembly quality is N50. N50 is defined as the length of the smallest contig above which 50% of an assembly would be represented (or smallest scaffold if it is applied after scaffold construction), a higher N50 indicating improved assembly. In presence of the reference genome, NGA50 provides more insights into the assembly quality. NGA50 is defined as the contig length such that using equal or longer length contigs that have been aligned to the reference produces 50% of the length of the reference genome; again, a higher NGA50 indicates an improved assembly.

Table 2 Accuracy Comparison for Assembler Performance for datasets RM1 and RM2., for specified set of k-values and step size configurations.
Table 3 Accuracy Comparison for Assembler Performance for SC-E. coli and SC-S. aureus datasets., for specified set of k-values and step size configurations.
Table 4 Accuracy Comparison for Performance Tests on SC-SAR 324 datasets, for specified set of k-values and step size configurations.

The table entries reveal that ScalaDBG and IDBA have comparable accuracy metrics in all cases. While the actual numbers for ScalaDBG-SP, ScalaDBG-PP, and IDBA-UD differ due to out-of-order patching and graph simplification process, we confirmed using the t-test that these differences are not statistically significant.

Time distribution for phases of ScalaDBG

ScalaDBG’s time is spent executing three major tasks: (1) the DBG construction (Build), (2) graph patching (Patch), and (3) contig generation (Contig). We profiled ScalaDBG to determine the contribution of each function to the total execution time of ScalaDBG. We present here the results of assembling the SAR 324 dataset for the k-value range of {20–50}, with a step size of 2. The experiment was run on a 16-node cluster. Figures 13 and 14 show the different sub-tasks within each task for ScalaDBG on the Y-axis while the X-axis shows the actual execution time in seconds. Note that in ScalaDBG-SP (Fig. 13), only the DBG construction executes in parallel, while in ScalaDBG-PP (Fig. 14), in addition to build, the patch and contig-generation phases execute in parallel as well. Patch tasks take less time compared to the contig-generation tasks.

Figure 13
figure 13

Execution time breakdown of ScalaDBG-SP for SAR 324 dataset, k-value range{20–50}, step size 2.

Figure 14
figure 14

Execution time breakdown of ScalaDBG-PP for SAR 324 dataset, k-value range {20–50}, step size 2.

As seen in Figs 13 and 14, ScalaDBG-PP overlaps the execution of certain graph-construction, patching, and contig-generation tasks, while in ScalaDBG-SP, all the patching and contig generation tasks are serialized. This detailed profile can be used to optimize ScalaDBG further.

Unlike IDBA-UD, ScalaDBG-SP and ScalaDBG-PP do not update the input read set at each iteration. Instead, each graph construction in ScalaDBG starts with the original read set. IDBA-UD updates the read set at each iteration, scanning all reads in the read set, and removing the ones that are already contained in any of the contigs in the contig set. However, if the reduction in the input read set is not significant at each iteration, then the overhead of updating the read set for IDBA-UD starts to dominate. Especially for lower k-values, excessive branching can lead to less reduction in the read set, increasing the overhead for IDBA-UD. In addition, the patch and contig generation occurs only logarithmic number of times in ScalaDBG-PP as compared to IDBA-UD and ScalaDBG-SP.

If we serialize the execution time of the parallel processes in ScalaDBG-SP and ScalaDBG-PP, shown in Figs 13 and 14, the serial execution time for ScalaDBG-SP is 6455 seconds and for ScalaDBG-PP is 6457 seconds. The execution time for IDBA-UD is 6897 seconds. Out of this total time, 86% of the overall execution time can be parallelized. Hence, the maximum speedup for ScalaDBG-PP is 6.8X. ScalaDBG-SP performs the patch and contig generation serially, hence its speedup drops to 3.1X. For this dataset and k-value range configuration, the additional work done by ScalaDBG is offset by the work done by IDBA-UD in updating of the read set.

Comparison with the distributed assembler ABySS

ABySS3 is a distributed assembler and it parallelize the execution of DBG construction for a single k-value on multiple nodes using MPI. Hence, we compared distributed ScalaDBG and state-of-the-art distributed assembler ABySS. We measured the execution time and quality of ScalaDBG and ABySS for completing the assembly of SC-SAR-324 dataset.

Both, ScalaDBG and ABySS were deployed on a cluster of 4 nodes, and could utilize all cores on the nodes. ScalaDBG was input k-value range of 20–50, with a step size of 10, while ABySS was executed using a median value of k = 35 to present a fair quality comparison. As shown in Table 5, ScalaDBG is significantly faster (2.5X) and produces better quality assembly than ABySS. Since ScalaDBG uses multiple k-values, namely, 20, 30, 40, and 50, as opposed to ABySS that just uses a single k-value of 35, ScalaDBG has higher N50 and maximum-contig length as compared to ABySS.

Table 5 Accuracy and Performance comparison on SC-SAR 324 datasets for ScalaDBG-PP and ABySS. ScalaDBG-PP has higher accuracy and is significantly faster than ABySS.

Scalability tests

To evaluate the scaling out for ScalaDBG, we used the SAR324 dataset. We varied ScalaDBG’s k-values, ranging from {20–50}, with a step size of 2, which translates to 16 k-values. The speedup is measured when scaling from 1 to 16 nodes, as shown in Fig. 15, resulting in a 6.8X speedup in relation to the baseline when running on a single node. ScalaDBG scales at nearly constant efficiency, as judged from the slope of the speedup curve. The speedup demonstrates that ScalaDBG can scale out efficiently on a multi-core and multi-node cluster.

Figure 15
figure 15

Speedup for ScalaDBG assembling SAR 324 dataset k-value range {20–50}, step size 2, speedup shown w.r.t. ScalaDBG running on 1 node.

Conclusion

The rapid progress of sequencing instruments and algorithms have resulted in the need for faster and more efficient assembly algorithms. Further, advances in single-cell sequencing and metagenomics domains for assessing cancer heterogeneity and the microbiome, respectively, are hindered by the time needed for genome assembly. Existing iterative DBG assemblers, such as IDBA-UD, generate longer contigs at the cost of significantly longer graph construction times due to the serial construction process for a set of k-values. In ScalaDBG, we break the serial graph construction process into multiple parallel processes. Further, our technique is also extensible in that it can be applied to other DBG-based assemblers.