Completing bacterial genome assemblies: strategy and performance comparisons

Determining the genomic sequences of microorganisms is the basis and prerequisite for understanding their biology and functional characterization. While the advent of low-cost, extremely high-throughput second-generation sequencing technologies and the parallel development of assembly algorithms have generated rapid and cost-effective genome assemblies, such assemblies are often unfinished, fragmented draft genomes as a result of short read lengths and long repeats present in multiple copies. Third-generation, PacBio sequencing technologies circumvented this problem by greatly increasing read length. Hybrid approaches including ALLPATHS-LG, PacBio corrected reads pipeline, SPAdes, and SSPACE-LongRead, and non-hybrid approaches—hierarchical genome-assembly process (HGAP) and PacBio corrected reads pipeline via self-correction—have therefore been proposed to utilize the PacBio long reads that can span many thousands of bases to facilitate the assembly of complete microbial genomes. However, standardized procedures that aim at evaluating and comparing these approaches are currently insufficient. To address the issue, we herein provide a comprehensive comparison by collecting datasets for the comparative assessment on the above-mentioned five assemblers. In addition to offering explicit and beneficial recommendations to practitioners, this study aims to aid in the design of a paradigm positioned to complete bacterial genome assembly.

Koren et al. proposed a hybrid approach to utilize short, highfidelity reads to reduce error rates in single-molecule long sequencing reads; they increased the accuracy of long reads from 80% to higher than 99.9% and the corrected sequences were then de novo assembled 7 . Such a hybrid approach was called "PBcR pipeline" 14 . Meanwhile, Bashier et al. provided a hybrid assembler (AHA) to scaffold contigs from an assembly of second-generation sequence data using PacBio long reads. The authors reconciled the long reads with the short reads to either correct errors or fill gaps in the AHA scaffolds. This hybrid assembly analysis was demonstrated in completing bacterial genomes 8 . SSPACE-LongRead was recently proposed by Boetzer and Pirovano for scaffolding draft assemblies using PacBio long reads. The authors validated SSPACE-LongRead to be better capable of producing nearly complete bacterial genomes than AHA 10 . Additionally, recent upgrades of SPAdes added support for taking short and long reads as inputs in SPAdes 3.0 9,14 , allowing hybrid assembly. The aim of this study is to compare the hybrid approaches, including ALLPATHS-LG, PBcR pipeline, SPAdes and SSPACE-LongRead, to bacterial genome completion. However, these hybrid approaches require the preparation of at least two different sequencing libraries. A more efficient strategy entails the development of a simple workflow requiring only one library and sequencing method; non-hybrid approaches, in this vein, have been proposed.
Non-hybrid approaches including hierarchical genome-assembly process (HGAP) and PBcR pipeline via self-correction (abbreviated to "PBcR pipeline(S)" hereafter) were developed to use a single, longinsert shotgun DNA library in conjunction with a PacBio singlemolecule, real-time sequencing platform for completing microbial genome assemblies 4,11 . In contrast to hybrid approaches, HGAP and PBcR pipeline(S) do not require highly accurate short reads for error correction yet require 80-100X of PacBio sequence coverage for selfcorrection 14 . The key component of HGAP is to develop a consensus algorithm which exploits the inherent advantages of SMRT sequencing quality values to preassemble long and highly accurate overlapping sequences by correcting errors on the longest reads using shorter reads from the same library 11 . The authors who proposed the PBcR pipeline (Koren et al. 7 ) improved the correction algorithm to perform self-alignment and correction, and released the version of error correction algorithm at http://www.cbcb.umd.edu/software/PBcR/ closure/ 4 . The PBcR pipeline is therefore also capable of performing self-correction and non-hybrid assembly for exclusive PacBio reads, we refer to this non-hybrid approach as PBcR pipeline(S). The HGAP software is, in fact, a derivative of PBcR pipeline and is implemented in SMRT Analysis 2.0 or higher. Single molecule sequencing data provided in these two publications were downloaded and analyzed by both non-hybrid approaches separately.
This article reviews strategy and provides a performance comparison among the various methods to complete bacterial genome assemblies. While the algorithms were documented in individual research papers, each work used different datasets for evaluation, rendering cross-comparisons difficult. We thus collected datasets for the comparative assessment on the above-mentioned assembly approaches. We used QUAST 15 along with the NCBI reference sequences to assess the quality of assemblies generated by the various approaches. We also generated assembly dot plots for the sake of comparison against the reference genome to evaluate the assembly's accuracy by r2cat 16 . We aim to highlight the experimental design on library preparation for each method and provide explicit guidance for practitioners. The detailed procedures, along with the analyzed data and thoroughly-evaluated results, are available online (http://sb. nhri.org.tw/comps).

Methods
Assemblers. ALLPATHS-LG is fully automated and requires minimal operator intervention. Prior to executing ALLPATHS-LG, we prepared the data for import into the pipeline; we gathered the read data in the appropriate formats and subsequently provided two information files, including in_groups.csv and in_libs.csv to perform ALLPATHS-LG (release 44837) 6 . SPAdes 3.1 was used for assembly of short reads and hybrid assembly of short and long reads 17 . SPACE-LongRead 1.1 was used to scaffold the SPAdes-assembled contigs from short reads using PacBio long reads 10 . The pre-compiled source code executed under PBcR pipeline was downloaded from http://www.cbcb.umd.edu/software/PBcR/closure/; which is identical to the PBcR pipeline(S) 4 . The latest PBcR pipeline implemented in Celera Assembler 8.2 (wgs-8.2) was also downloaded and used for hybrid and non-hybrid assembly 18 . The HGAP executive programs are implemented in SMRT Analysis 2.0 or higher. We downloaded and installed SMRT v2.0.1 in order to implement HGAP and Quiver 11 . PacBio produces data in HDF5 format (*.h5); the corresponding input file of SMRT Analysis is a bas.h5 or an associated bax.h5 file. All the other assemblers including ALLPATHS-LG, PBcR pipeline, SPAdes, and SSPACE-LongRead expect filtered subreads in fasta or fastq format as an input file. We executed SMRT Analysis to produce subreads by trimming and filtering the raw reads with the following parameters: minSubReadLength 5 50, readScore 5 0.75 and minLength 5 50. We used QUAST 2.3 15 and r2cat 16 to evaluate assemblies. We performed all analysis on a server with Intel Xeon E7-4820 processors 8-core 2.00 GHz and 256 GB of RAM.
Data. To evaluate the assemblers on bacterial genome completion, we collected available sequencing data mainly from the three studies, ALLPATHS-LG, PBcR pipeline(S) and HGAP 4,6,11 , based on the existence of reference genomes. Because PacBio RS machine was upgraded to PacBio RS II, we also downloaded the data from a single SMRT cell produced by the latest system. The nine different datasets of the five bacterial species employed in this study are summarized and the brief descriptions of libraries are provided in Table 1. For examples, with respect to the dataset 1 of E. coli, we downloaded the three-library sequencing data and used NC_000913 as reference genome. Because R. sphaeroides 2.4.1 has two chromosomes and five plasmids, the seven reference sequences including NC_007488, NC_007489, NC_007490, NC_007493, NC_007494, NC_009007, and NC_009008 were used for assembly evaluation. Please note that according to the definition of microbial genome complexity described in Koren et al's publication 4 , M. ruber DSM 1279 belongs to Class III genome (a maximum repeat size is greater than 7 Kbp) while the other four species are class I genomes (have few repeats other than the rDNA operon sized 5-7 Kbp). The first type of hybrid approach designed for ALLPATHS-LG is to combine two short libraries (short overlapping and jumping reads) with one long library ( Table 1, D1-D3). Another type of hybrid approach is to combine one short library with one long library (D4 and D5 in Table 1). In contrast to the hybrid approaches, the non-hybrid approach requires single-library long reads. We therefore employed the five datasets (D5-D9, shown in Table 1) including three species, E. coli, M. ruber and R. heparinus, for non-hybrid assembly evaluation.

Results and disscusion
Five assemblers including ALLPATHS-LG, SPAdes, SSPACE-LongRead, PBcR pipeline and HGAP were used and compared in this study. As compared and evaluated in GAGE-B, a single library of short reads could not be completely de novo assembled by various assemblers into finished genomes, and a jumping library was still necessary to produce large scaffolds 19 . Besides, two recent publications have demonstrated that hybrid assemblies combining 454 with two paired Illumina libraries (fragment reads and jumping reads) did not produce complete genomes 14,20 . In order to provide a strategy for bacterial genome completion, we have surveyed the assemblers that are able to utilize the PacBio long reads. As illustrated by Figure 1, ALLPATHS-LG and SPAdes are the two hybrid assemblers that take short and long reads as inputs to perform de novo assembly 6,14 . SSPACE-LongRead is designed to scaffold pre-assembled contigs using long reads 10 . PBcR pipeline uses short reads to correct long reads and then to de novo assemble the corrected PacBio long reads (PBcR) 4,7,14 . In addition to the hybrid approaches, non-hybrid approaches-HGAP 11 and PBcR pipeline(S) 4 -were used in this study. As summarized in Table 1, the nine datasets were used to evaluate the five assemblers on bacterial genome completion. Ribeiro et al. has employed ALLPATHS-LG to assemble 16 bacterial samples and hence has generated nearly perfect genome assemblies in some cases 6 . In order to evaluate the performance of ALLPATHS-LG on reproducing the bacterial genome assemblies, we have executed the routine on the sequence data (Table 1, D1-D3). Some of the identical datasets were hybrid assembled by using SPAdes. As for another type of hybrid approach: to combine one short library with one long library, we used PBcR pipeline, SPAdes and SSPACE-LongRead to assemble the reads from Dataset 4 and Dataset 5. Besides, HGAP and www.nature.com/scientificreports SCIENTIFIC REPORTS | 5 : 8747 | DOI: 10.1038/srep08747 PBcR pipeline(S) were conducted for non-hybrid assemblies on the long reads of the three species (Table 1, D5-D9) ALLPATHS-LG completed bacterial genomes under a wellcontrolled coverage. In addition, to leverage ALLPATHS-LG on the datasets of reads available on Ribeiro's ftp (see Table 1, D1-D3), the raw short reads were directly downloaded from the Sequence Read Archive (SRA). The results of the assembly operation can be found in Table 2, in terms of number of contigs and N50; the results generated from the website data strongly corroborate the results obtained in the previous study 6 . The details of assemblies evaluated by QUAST are shown in Additional file 1: Table S1-S3. Single contigs were generated for E. coli and S. pneumoniae while 11 contigs were generated for R. sphaeroides. Furthermore, the assembly results exhibited parallelism to the reproducible results obtained from the website data, which occurred when the fraction of reads in ALLPATHS-LG was specified to be identical to the website data, i.e. the fractions of fragment reads and jumping reads were set to 0.088 and 1 (for E. coli), 0.384 and 1 (for R. sphaeroides), and 0.187 and 1 (for S. pneumoniae), respectively. These fractions of read data are equivalent to approximately 52X, 191X and 100X genome coverage for fragment libraries and 79X, 87X, 100X genome coverage for jump libraries of the three aforementioned species. However, the assemblies appeared to manifest less accurate results when the raw data was utilized in ALLPATHS-LG (over 497X genome coverage in the fragment libraries), which suggests that the effect of coverage on the assembly methodology must be explored in further detail. In this vein, ALLPATHS-LG was firstly performed on the data with 50X genome coverage according to the laboratory formula described in its publication 6 . ALLPATHS-LG, in the case of E. coli, generated a single contig exhibiting nearly perfect accuracy. The approach assembled two contigs for S. pneumoniae but was unable to produce an assembly for R. sphaeroides at as low as 50X coverage. As discussed by Ribeiro et al., coverage is difficult to control due to sample-tosample variability; ALLPATHS-LG was employed to process the 100X genome coverage data to ensure that steady assemblies could be obtained. As is evident from the results ( Table 2 and Additional file 2: Figure S1-S3), a customized implementation of ALLPATHS-LG is able to complete accurate bacterial genomes.
ALLPATHS-LG produced accurate but gapped assemblies in the absence of long reads. ALLPATHS-LG has been proposed to complete bacterial genomes in which it explicitly requires minimum of two libraries (short and jumping libraries) 6 . However, to the best of our knowledge, few bacterial genomes have been completed using this strategy 12,21 . It is therefore speculated that the methodology that concatenates three data types generated from Illumina and Pacific Biosciences impedes the applicability of ALLPATHS-LG. Ribeiro et al. has examined the use of ALLPATHS-LG closely, the authors have not supplied the algorithm with long reads to evaluate its performance limits and have stated that "the omission of long reads cuts at the heart of the method and would be expected to have deleterious effects". In this and other cases, ALLPATHS-LG is often used without long reads 13,21,22 . To this end, we assessed the performances of ALLPATHS-LG without supplying long reads ( Table 2). Although ALLPATHS-LG could produce nearly complete genome assemblies for E. coli, the number of uncall bases (N's), representing gaps in the scaffolds, substantially increase in the absence of long reads (e.g., from 0 to 533 per 100 Kbp in the assemblies obtained from Website data), which corresponds to the role of long reads in filling gaps 6 (see Additional file 1: Table S1-S3). As per the effect of coverage, similar results can be found in Table 2 (comparing with and without PacBio), i.e. extremely high coverage is not necessary for optimal assembly. Evidently, ALLPATHS-LG was impeded from the generation of complete genomes by the lack of long reads; although diagonal-like dot plots against reference genomes (Addition file 2:  Figure S4-S6) were observed, the accurate assemblies were gapped and sometimes fragmented.
SPAdes did not fully utilize the data designed for ALLPATHS-LG. It assembled short reads with long reads efficiently. We applied SPAdes to hybrid assemble the datasets originally designed for ALLPATHS-LG (Table 1, D1-D3), but got unsatisfied assemblies ( Table 2). Dozens of contigs were generated even if the PacBio long reads were used, and the N50 values obtained from SPAdes were as low as one tenth of the values obtained from ALLPATHS-LG. We speculated that the requirement of ALLPATHS-LG-the short fragment library whose insert lengths are slightly shorter than twice the read lengths-is not optimal to SPAdes. Additionally, the PacBio long reads used in ALLPATHS-LG are 1 , 3 Kbp, such a length may not long enough for SPAdes to perform efficient hybrid assembly. We have replaced the long reads of the Dataset 1 with the data of a single-SMRT cell from the Dataset 5, and the assembly result obtained from SPAdes was obviously improved  Figure 1 | Comparisons of the assemblers conducted in this study. SSPACE-LongRead is a scaffolder using single molecule long reads to upgrade preassembled contigs constructed from short reads. ALLPATHS-LG and SPAdes are hybrid assemblers which take short reads and long reads as inputs. PBcR pipeline uses short reads to correct long reads by pacBioToCA, and then assembles corrected long reads (PBcR) by Celera assembler (runCA). Hierarchical genome-assembly process (HGAP) and PBcR pipeline via self-correction (PBcR pipeline(S)) take long reads as input to produce non-hybrid assembly. (N50 from 1 Mbp to 3 Mbp, Additional file 1, Table S1), which suggests that 10 Kbp long read library is benefit to SPAdes for producing high quality assemblies. Although SPAdes was unable to produce a single-contig assembly with additional long reads from a single SMRT cell, it generated the highest N50 statistics in comparison with the results from PBcR pipeline and SSPACE-LongRead using a small amount of long reads (one and two SMRT cells in Table 3). In addition, SPAdes generated assemblies within 3 hours. Moreover, SPAdes was capable of reconstructing the genome of E. coli as the latest PacBio RS II long reads (Dataset 9) were hybrid assembled with the short reads (Dataset 4) (see Additional file 1: Table S4 and Additional file 2: Figure S7 for details).
Hybrid assembly from one short and one long library was inefficient to complete bacterial genome. To assemble the hybrid data (one short and one long library), we conducted PBcR pipeline, SPAdes and SSPACE-LongRead on the dataset D4 1 D5. We found that several factors influence assembly results generated by PBcR pipeline, such as read depth, specifying genome size or not, and Celera Assembler parameters. The detailed descriptions are provided in Additional file 2: Supporting data. In short, the expected genome size should be specified in long read correction (pacBio-ToCA), the 25X longest PBcR should be used for assembly (runCA), and the contigs with fewer than 100 mapped PacBio corrected reads should be discarded. We have followed the procedure of PBcR pipeline carefully; nevertheless, we did not produce single-contig assemblies, even when a substantial body of long reads from the 17 SMRT cells were used (evidence for this is in Table 3). The latest PBcR pipeline was recently released in Celera Assembler wgs-8.2, we thus used it to hybrid assemble Datasets 4 and 5. Unlike the PBcR pipeline available at cbcb, the latest PBcR pipeline provides a single command (PBcR) to perform long read correction and assembly. Albeit the updated PBcR pipeline reduced its running time and increased the N50 statistics, it was unable to produce a single-contig assembly wherein even four SMRT-cell long reads (from D5) were used along with the Dataset 4. The four SMRT cell reads were successfully non-hybrid assembled into a single-contig using the identical pipeline (PBcR pipeline(S) in wgs-8.2). Because SSPACE-LongRead required pre-assembled contigs, we used SPAdes to assemble the Illumina short reads of the Dataset 4, then scaffolded the assembly with long read data from one to four SMRT cells and from 17 SMRT cells of the Dataset 5. The QUAST-evaluated results are shown in Table 3 (see Additional file 1: Table S4 for details). With the addition of long reads from a single-SMRT cell, the assembly N50 was increased from 139 Kbp to 2.4 Mbp using SPAdes or SSPACE-LongRead, which shows that the utilization of PacBio long reads is great capable of upgrading draft assembly constructed from short reads. Besides, in terms of running time, SPAdes and SSPACE-LongRead produced the hybrid assembles in a couple of hours. As described in the previous paragraph, SPAdes reconstructed the genome of E. coli, with the largest contig over 4.6 Mbp, when either the PacBio RS II data (D9) or the 17 SMRT cell data (D5) was used to hybrid with the short reads (D4) (as shown in Table 3). Nevertheless, with the given data, SSPACE-LongRead did not scaffold the SPAdes-assembled contigs into a single contig (see Additional file 1: Table S4 and Additional file 2: Figure S7-S8 for more details). Several studies investigated the effect of coverage on genome assemblies and found that the N50 length plateau was reached at 75X of coverage 23 . We therefore sub-sampled 75X of short reads from Dataset 4 to hybrid assemble with long reads from Dataset 5 using SPAdes. While the N50 lengths (compared to Table 3) were increased from 1.2 Mbp and 1.7 Mbp to 2.5 Mbp and 3.8 Mbp, respectively, in which three and four SMRT cell long reads were used, SPAdes was not able to complete E. coli's genome. Taken together, incorporating long reads (15X-40X) with short reads was promising to enhance the continuity of incomplete draft assemblies constructed from short reads by using SPAdes; however, such a hybrid approach was inefficient in producing complete bacterial genomes.
Non-hybrid approaches required as few as one single PacBio RS II SMRT cell to complete bacterial genome. With SMRT Analysis v2.0.1, we were able to conduct HGAP procedure and Quiver algorithm for bacterial non-hybrid de novo genome assembly 11 . Similar to HGAP, the PBcR pipeline is also capable of performing self-correction and non-hybrid assembly of PacBio reads when sufficient coverage is available 4,14 . Although Koren et al. has recommended 150X sequencing depth to facilitate the completion of an accurate microbial genome 4 , Chin et al. has demonstrated that as few as three SMRT cells (RS I system, equivalent to 90X) are sufficient to produce a single contig 11 . As was apparent from the work, read length and depth determine assembly continuity; we therefore conducted HGAP and PBcR pipeline(S) on various SMRT cells, ranging from 4 to 17 XL-C2 SMRT cells generated from the PacBio RS I system (Table 1, D5-D8), and on a single SMRT cell gathered with PacBio  RS II system and P4-C2 chemistry (Table 1, D9). The detailed procedures and QUAST-evaluated assembly results (see Additional file 1: Table S5-S6) are provided on our website. As we can see from Table 4, HGAP or PBcR pipeline(S) is capable of producing single contigs except for the dataset D7, which embodies a sequencing coverage of 124X. Nevertheless, Chin et al. has generated a single contig from three SMRT cells of the dataset D7 and stated that the assembly from the four SMRT cells contained one misassembly with respect to the reference genome 11 ; The misassemblies are highlighted with underline as listed in Table 4 and the dot plots of sequence assemblies against the reference genome are displayed in Additional file 2: Figure S9-S10. Interestingly, employing more sequencing reads does not always guarantee the perfect assembly. For example, 4 SMRT cells (70X , 77X) of dataset D5 were sufficient to produce a single contig but several of the 6 and 8 SMRT cells did not result in perfect assemblies. Similarly, PBcR pipeline(S) successfully assembled the genome of E. coli into a single contig using 6 SMRT-cell reads, but misassembled two contigs (the large contig was unable to correctly align on the reference genome) when applying the 8 SMRT-cell data of dataset D6 (Additional file 2: Figure  S10). It is therefore recommended to execute a small number of SMRT cells produced from PacBio RS I system (e.g. 3 or 4). Additional SMRT cells can be gradually appended, if necessary. Moreover, current upgrades (PacBio RS II) increase the throughput and read length yielded from a single SMRT cell. Those sequencing reads (Dataset 9) were successfully assembled by both HGAP and PBcR pipeline(S) into a single contig, as indicated in Table 4. Recently, a complete genome of C. autoethanogenum DSM10061 has been published by performing HGAP to analyze single molecule reads produced by PacBio RS II without the need for manual finishing 20 , which supports the assertion that the PacBio single-molecule technology will be valuable in future studies. Although the runtime of HGAP and PBcR pipeline(S) on the Dataset 9 was 16 and 31 hours, respectively (Additional file 1: Table S5), it only took 24 minutes and 2.3 hours using the latest version of PBcR pipeline (wgs-8.2) and HGAP 3.0 (under SMRT v2.3.0) to reconstruct the genome of E. coli from the identical dataset (D9) (Table S7 and S9). Both nonhybrid approaches are capable of completing microbial genomes; nevertheless, they each seem to possess unique advantages in finishing various bacterial genomes. As a whole, we suggested that practitioners should perform both non-hybrid approaches to de novo assemble a bacterial genome, or at least should ask a PacBio sequencing provider to run HGAP.
The latest version of non-hybrid approaches rapidly produced accurate and complete bacterial genome. As illustrated in the pre-vious paragraph, the running time for the non-hybrid approaches was greatly reduced (from over 10 hours to 30 minutes). The latest Celera Assembler wgs-8.2 including PBcR pipeline currently incorporates a novel probabilistic overlapper (named MHAP) for self-correction 24 . Such the implementation speeds up the assembly process. While we have found that the genome size should be specified for hybrid assembly (Additional file 2), the effects of genome size setting on non-hybrid approach remained unclear. We performed the varied non-hybrid assemblies with the expected genome size using PBcR pipeline(S) at cbcb and HGAP 2.0, the results are shown in Table 4. However, an exact genome size was mostly unavailable for an undetermined bacterial genome. We therefore conducted the Celera Assembler wgs-8.2 with different genome size settings (without genome size and the ratio to the genome size from 0.8 to 1.2) on the dataset D5-D9 to examine the completeness and accuracy of assembly production. Note that we discarded the contigs with fewer than 25 mapped long reads, from assemblies obtained by wgs-8.2. As can be seen in Table 5, while the coverage of long read is crucial to produce a complete genome, the parameter of genome size setting is no longer an issue (see Additional file 1: Table S7 and S8 for the detailed QUAST-evaluated results). The latest PBcR pipeline in wgs-8.2 was able to generate a singlecontig assembly, even for the Class III M. ruber (max. repeat size .7 Kbp), as the coverage was over 75X, except when applied to Dataset 6. We ascribed this exception to a higher coverage bias (given with a couple identified low-coverage regions, details are available at our website). Besides, the latest PBcR pipeline produced accurate assemblies in connection to no large structure error (evaluated by r2cat) while it was unable to resolve large repeats in P. heparinus (.5 Kbp) and led to misassemblies (Additional file 2: Figure S11). As PBcR supported two alternate consensus modules, we performed the faster algorithm by specifying -pbCNS on the data, the assemble results are summarized in Table 5. In order to resolve the misassmblies in P. heparinus, we used the default consensus module PBDAGCON to gain the accurate assemblies except one. However, a double running time was required for this consensus module (from 30 min to 1 hour). Nevertheless, the upgraded RS II system increased the average read length to 5 Kbp (in Dataset 9) and expectedly provided average read lengths in excess of 10 Kbp with new chemistry (P6-C4), which allows the full closure of most bacterial genome. Because we were unable to successfully load the old HDF5 format, generated from RS I system, into the latest HGAP 3.0, we run the HGAP 3.0 on the Dataset 9 with different setting of genome size. Similar to the results obtained from the latest PBcR pipeline (wgs-8.2), HGAP 3.0 produced accurate single-contig assemblies without the interference of genome size setting, however, it took more than 2 hours to generate an assembly (Additional file 1: Table S9). Note that the assemblies were polished by Quiver in HGAP 3.0 to provide highly consensus accuracy. The running time was correspondingly increased. We polished the assembly produced by the latest PBcR pipeline on D9, the consensus accuracy was improved from 99.96 to 99.9997% with an extra running hour. Taken together, the latest algorithms implemented in the non-hybrid approaches successfully produced accurate and complete bacterial genomes in a reasonable time.

Conclusions
With the advent of technologies in PacBio single-molecular real-time sequencing, the read length and the throughput are continuously increased. The non-hybrid approach relying on single-library preparation is the preferred way to de novo assemble and thereby complete bacterial genomes. Although it took us more than a day to perform a non-hybrid bacterial genome assembly using both HGAP 2.0 and the PBcR pipeline(S) at cbcb, the latest version of Celera Assembler (wgs-8.2, including PBcR pipeline) was capable of producing the bacterial genome assembly within 30 minutes. To the point of view on the inefficiency of hybrid assemblies (in terms of multiple library preparation and the capability of producing single contigs), we therefore recommended the practitioners to exclusively sequence their bacterial genomes using PacBio RS II system. In cooperation of the latest non-hybrid approaches, bacterial genomes can be efficiently reconstructed by either PBcR pipeline(S) in wgs-8.2 or HGAP 3.0. It is anticipated that future technological advancements in PacBio chemistry and technology will further extend the reach of microbial genome assembly-the known microbial genome to be sequenced and completed.