De novo yeast genome assemblies from MinION, PacBio and MiSeq platforms

Long-read sequencing technologies such as Pacific Biosciences and Oxford Nanopore MinION are capable of producing long sequencing reads with average fragment lengths of over 10,000 base-pairs and maximum lengths reaching 100,000 base- pairs. Compared with short reads, the assemblies obtained from long-read sequencing platforms have much higher contig continuity and genome completeness as long fragments are able to extend paths into problematic or repetitive regions. Many successful assembly applications of the Pacific Biosciences technology have been reported ranging from small bacterial genomes to large plant and animal genomes. Recently, genome assemblies using Oxford Nanopore MinION data have attracted much attention due to the portability and low cost of this novel sequencing instrument. In this paper, we re-sequenced a well characterized genome, the Saccharomyces cerevisiae S288C strain using three different platforms: MinION, PacBio and MiSeq. We present a comprehensive metric comparison of assemblies generated by various pipelines and discuss how the platform associated data characteristics affect the assembly quality. With a given read depth of 31X, the assemblies from both Pacific Biosciences and Oxford Nanopore MinION show excellent continuity and completeness for the 16 nuclear chromosomes, but not for the mitochondrial genome, whose reconstruction still represents a significant challenge.

• Miniasm. Miniasm is a fast assembler developed for long and error-prone reads. An assembly graph is generated from overlapping reads found by Minimap, a MinHash-sketch-based aligner 1 . Small bubbles are collapsed and unitigs are built from the graph, without any error correction step nor a consensus generation from the aligned reads. We ran MiniMap version r122 and Miniasm version r104 using default settings and parameters.
• Racon. Racon aligns the long reads to a low accuracy draft assembly and improves the quality of the assembly by generating a consensus from the aligned reads. We tested Racon (github commit 28980bec3e98189853ed919764d5a8a9e6291264) on a Miniasm assembly generated as in the previous point. The time and memory consumption reported for Racon include the resources used to generate the initial Miniasm assembly. The raw reads (reads.fasta) were aligned to the Miniasm assembly (contigs.fasta) using GraphMap 2 : $ graphmap align -a anchor --rebuild-index -B 0 -r contigs.fasta \ -d reads.fasta -o output.sam --extcigar -t 8 and the consensus was generated with Racon in the following way: $ racon -M 5 -X -4 -G -8 -E -6 --bq 10 -t 8 contigs.fasta \ output.sam consensus.fasta We ran a second iteration of consensus generation by realigning the raw reads against the consensus from the first iteration of Racon and by generating a new consensus in the same way as in the first iteration.
• ABruijn. ABruijn is the only long-read assembler considered here that is not based on the OLC paradigm, but on a generalized and more flexible De-Bruijn graph approach called A-Bruijn, which can accommodate and assemble error-prone reads. From the generated A-Bruijn graph, an error-prone draft assembly is built, the long reads are aligned against it by BLASR and finally partial order alignment 5 is used to correct the draft assembly. We ran ABruijn 0.4b with default parameters.
• npScarf. The npScarf pipeline is a scaffolding tool provided within the Japsa package (https://github.com/ mdcao/japsa). It takes as input an initial NGS-based draft assembly from SPAdes and a bam alignment file between the long reads and the draft assembly, obtained with BWA 6 . The bam alignments are then used to scaffold contigs and resolve repeat regions. npScarf can make use of the MinION real-time sequencing feature as it can be fed long reads from a stream. We used npScarf from the Japsa package version 1.6-08a with default parameters in its non-real-time fashion. To generate the bam file, we used bwa mem (version 0.7.12) with the following parameters: -x ont2d -a -Y for ONT data, and -x pacbio -a -Y for PacBio data.

3/9
Table S1. Statistic information for the 2D-All and 2D-Pass ONT datasets for the S288C strain.
Oxford Nanopore Datasets • SMIS. SMIS (https://github.com/fg6/smis.git), or the Single Molecule Integrated Scaffolding pipeline, is in development at the Wellcome Trust Sanger Institute and aims to be a comprehensive pipeline for long reads exploitation, from scaffolding of fragmented NGS-based assemblies to structure variation detection. We assessed the SMIS capabilities as a scaffolding tool. We presented the SMIS scaffolding results when using as input the SPAdes assembly generated as described above and the long reads. From each long read, SMIS creates fake-mates sequences with fixed length and fixed insert length (2000 bp and 200 bp, respectively). Such fake-mates are then aligned against the SPAdes assembly via BWA. If enough fake-mates bridge multiple contigs, the latter are scaffolded together and the gap size is estimated from the initial fixed insert and filled with 'Ns'.

Extraction of the 31X ONT-Emu PacBio subset
We provide a python code to select a subset of reads with desired depth from an initial fasta/fastq file: https://github. com/fg6/random_subreads. The subset can be extracted completely randomly, or following a Gaussian distribution around a desired length position. The randomly selected subset will have a read length distribution similar to the initial dataset. This is because each read has the same probability to be picked, and there are more reads with length around the initial distribution peak. To modify the distribution shape, for instance to have a peak in a different position, we can modify the probability for a read to be selected depending on its length by assigning a weight: reads with length around the initial distribution peak will have a smaller weight (=smaller probability to be picked), while reads at lengths around the new, desired peak will have higher weight. Then we can select the reads (pseudo-)randomly taking into account the assigned weights. The weight to assign to each read can be tricky to determine and depends on how different the initial and the desired shapes are, but also on how much we want to subset the sample: the larger the final subset, the more difficult it will be to change the original shape.
For the PacBio ONT-Emu datasets, we generated PacBio subsamples with shapes similar to that of the ONT 2D-Pass datasets. For this particular case, we created a new branch of the mentioned repository called "YeastStrainsStudy" that incorporates the heuristically optimized weights to be assigned to each read according to its length. Because of its partially random nature, the subsamples generated contain each time a different group of reads. The exact group of reads used in this study for the 31X ONT-Emu subsample can be obtained using the scripts available from GitHub: https://github.com/ fg6/YeastStrainsStudy.git.

Oxford Nanopore: S288c 2D-Pass versus 2D-All data
Here, we compare the de novo assemblies from ONT data when using only the best 2D reads, i.e. the 2D-Pass reads, to when we use all the 2D reads, i.e. including the 2D reads from both the 'Pass' and the 'Fail' directories (2D-All). While the 'Fail' reads' accuracies are lower than those of the Pass reads, they might comprise longer reads which could improve the contiguity 4/9 of the assemblies. The reads statistic information for the 2D-All and 2D-Pass are summarized in the Supplementary Table S1, while the related assemblies' information are shown in the Supplementary Table S2. The assemblies from the 2D-All or 2D-Pass samples do not differ significantly, but, except for a couple of cases, the assemblies have typically longer reference coverage and slightly higher accuracy when running on the higher quality 2D-Pass dataset; also, most pipelines are able to reconstruct more genes in the 2D-Pass case. The assemblies contiguity though appear higher on the 2D-All sample, as shown by the assembly Na50s. From the resource point of view, the inclusion of the 'Fail' data increased the depth from 31X to 61X, and this resulted in a 2-3 fold longer running time in almost all assemblies except for Miniasm and PBcR. For the latter the running time was about 2 times longer when running on 2D-Pass data, probably because of the additional higher sensitivity parameters used for the lower depth case (see Supplementary Note). When running on the smaller dataset (2D-Pass) the maximum memory requirement slightly decreased or remained the same for all the pipelines except for Miniasm and Canu for which the memory needed was 2 and 4 times lower than on the 2D-All case, respectively.
Even though the Na50s are longer for the 2D-All data, we decided to present our assessment studies using the 2D-Pass based assemblies because of their higher accuracies.

Depth study II: PacBio samples at 120X, 80X, 61X, 31X, 20X and 10X
Statistic information for the assemblies based on the whole S288c PacBio dataset and its randomly selected subsets are shown in the Supplementary Table S5. From 10X to 31X the performances are similar to the one observed for the ONT-Emu samples in Table 4, with PBcR-MiSeQ providing the longest, more accurate assembly at 10X and the other Celera-based pipelines catching up quickly already at 20X although with a lower accuracy. At 31X Canu and PBcR-Self generated the highest accuracies between the non-hybrid pipelines, around 99.9%, second only to the 99.97% accuracy of the hybrid pipeline PBcR-MiSeQ. PBcR-Self produced the most contiguous assembly, with an Na50 of 740 Mb, while SMARTdenovo generated the assembly with the longest reference coverage and the highest number of genes reconstructed, even though with a slightly lower accuracy than Falcon or ABruijn. The accuracy kept slightly improving for all pipelines until 61X depth to remain basically unchanged afterwards, while the Na50 keep increasing until 80X, but did not change or got slightly worse when approaching the depth of 120X. In conclusion, Canu and PBcR-Self were the best performing pipelines for the datasets we analyzed in this study, providing assemblies with high reference coverage, high accuracy and high Na50s already at 31X depth. Increasing the depth beyond 31X improved their accuracy from 99.9% up to 99.97 − 99.98%, a level commonly reached by Illumina-only assemblies. Unlike the Illumina-only assemblies, they achieved quite long Na50s: 549 kb for Canu and 740 kb for PBcR-Self.

5/9
Also Falcon and SMARTdenovo reached long Na50s (at 120X depth 740 kb and 667 kb, respectively), but their accuracies remained at a lower 99.9%; a similar value for the accuracy was provided by ABruijn, which produced slightly lower Na50 (546 kb at 120X) but was able to reconstruct more genes. The highest number of genes (6,608 out of 6,615) was reconstructed by Canu when run on the 80X and the 120X samples.

Other Strains Assemblies from PacBio Data
For the N44, CBS432, and SK1 strains no reference genome exists. We de novo assembled the 148X depth N44, the 135X depth CBS432, and the 248X depth SK1 PacBio data with the same assembler pipelines used for the S288C strain to obtain very contiguous assemblies whose statistic information are summarized in the Supplementary Table S4. While we cannot directly estimate the assembly accuracies, contiguity and possible misassemblies, we can expect these assemblies to have similar accuracy obtained for the 120X depth S288c PacBio data, up to 99.98% for the Celera-based assemblers. Table S4 also shows that Canu is the pipeline that reconstruct for each strain the highest number of genes. It also suggests that the SK1 strain has an higher number of genes in common with the reference strain than CBS432 and N44, as expected. A comprehensive structure-variation analysis of these strains using the same PacBio datasets used here can be found in 7 .