New insights on Pseudoalteromonas haloplanktis TAC125 genome organization and benchmarks of genome assembly applications using next and third generation sequencing technologies

Pseudoalteromonas haloplanktis TAC125 is among the most commonly studied bacteria adapted to cold environments. Aside from its ecological relevance, P. haloplanktis has a potential use for biotechnological applications. Due to its importance, we decided to take advantage of next generation sequencing (Illumina) and third generation sequencing (PacBio and Oxford Nanopore) technologies to resequence its genome. The availability of a reference genome, obtained using whole genome shotgun sequencing, allowed us to study and compare the results obtained by the different technologies and draw useful conclusions for future de novo genome assembly projects. We found that assembly polishing using Illumina reads is needed to achieve a consensus accuracy over 99.9% when using Oxford Nanopore sequencing, but not in PacBio sequencing. However, the dependency of consensus accuracy on coverage is lower in Oxford Nanopore than in PacBio, suggesting that a cost-effective solution might be the use of low coverage Oxford Nanopore sequencing together with Illumina reads. Despite the differences in consensus accuracy, all sequencing technologies revealed the presence of a large plasmid, pMEGA, which was undiscovered until now. Among the most interesting features of pMEGA is the presence of a putative error-prone polymerase regulated through the SOS response. Aside from the characterization of the newly discovered plasmid, we confirmed the sequence of the small plasmid pMtBL and uncovered the presence of a potential partitioning system. Crucially, this study shows that the combination of next and third generation sequencing technologies give us an unprecedented opportunity to characterize our bacterial model organisms at a very detailed level.


Supplementary Information
Supplementary Text pMEGA similarity to P. haloplanktis TAC125 chromosomes Nucleotide similarity to P. haloplanktis TAC125 chromosomes is scarce (Supplementary Table  S5), and most of it falls in intergenic regions with the exception of two regions. The first region is found in chromosome I, PSHA_p00039-PSHA_p00041 genes show 97.6% identity to chromosome I PSHA_RS02020-PSHA_RS02030 genes. This region is similar to IS679 insertion sequence 1 , which belongs to the IS66 family 2 , and contains three ORFs: tnpA (PSHA_RS02020), tnpB (PSHA_RS02025) and tnpC (PSHA_RS02030). tnpC gene has 1,563 bp and its predicted product is presumably a transposase, since it includes a DDE motif (the triad of acidic amino acids that defines a classical transposase active site). tnpA (300 bp) and tnpB (348 bp) genes function is unknown 2 . The disposition of the three reading frames in IS679 elements suggests translational coupling. Compared to tnpA, tnpB is typically found in the translational reading frame -1 and its initiation codon overlaps with the termination codon of tnpA. tnpC initiation codon is located downstream tnpB. A similar organization is present also in chromosome I and in pMEGA analogue regions. Usually, IS679 members include relatively wellconserved imperfect terminal inverted repeats (IR) of about 20 bp, and putative IR sequences were identified also in the analysed DNA sequences (data not shown).
The second region of similarity is found in chromosome II, PSHA_p00006 displays 96.7% identity to chromosome II PSHA_RS16255. This DNA region contains a non-coding RNA named HEARO (HNH Endonuclease-Associated RNA and ORF) RNAs 3 and a gene coding for an HNH endonuclease (PSHA_RS16255). HNH endonucleases are a family of homing endonucleases, which are frequently embedded within group I and group II introns and are responsible for the transfer of these elements 4 . These enzymes are commonly involved in the transposition of a variety of mobile genetic elements 5 . HEARO representatives are found in species from ten different bacterial phyla, predominantly Firmicutes, Proteobacteria, and Cyanobacteria 3 . This pattern of distribution is a strongly indicative of its function as a selfish genetic element. Thus, HEARO RNA together with its associated HNH endonuclease gene probably form a mobile genetic element. HEARO typically integrates upstream a RUGA motif (ATGA or GTGA) 3 . The comparison of the genomic sequence flanking the HEARO present in pMEGA and in chromosome II with sequences of a corresponding location in different organisms allowed the identification of the conserved RUGA motif at a possible integration site (ATGA) ( Supplementary  Fig. S9).
Protein similarity searches revealed that pMEGA shows homology to some chromosomal proteins (Supplementary Table S5) aside from the ones mentioned above. Chromosome I hosts a type II toxin-antitoxin system HipA family toxin (homologous to PSHA_p00008), an integrase (homologous to PSHA_p00026), a serine protease (homologous to PSHA_p00029), an endonuclease (homologous to PSHA_p00033) and type I restriction-modification system subunits R, S and M (homologous to PSHA_p00046, PSHA_p00048, PSHA_p00049). However, homology is low, with a percentage of identity of 37% at maximum. Chromosome II harbours five proteins with homology to pMEGA proteins: chromosome partitioning protein ParA (homologous to PSHA_p00001), a Ton-B receptor (homologous to PSHA_p00010), DNA replication terminus site-binding protein (homologous to PSHA_p00014) and DNA PolV subunit UmuC and UmuD (homologous to PSHA_p00030 and PSHA_p00031), being the maximum percentage of identity 68%. pMEGA and chromosome II share a similar genetic organization (partitioning protein ParA and replication initiator protein), which further supports the unidirectional mechanisms of chromosome II replication due to the clear plasmidic origin of the abovementioned protein functions 6 .
pMEGA similarity to other bacteria Shared regions of similarity between pMEGA, P. arctica and P. nigrifaciens contain the ParA (PSHA_p00001) and ParB (PSHA_p00002) proteins, DNA replication terminus site-binding protein (PSHA_p00014), an hypothetical protein (PSHA_p00025), an integrase (PSHA_p00026), Type I restriction-modification system (PSHA_p00045-PSHA_p00049), the restriction endonuclease Mrr (PSHA_p00050), an hypothetical protein (PSHA_p00051) and the RepB family plasmid replication initiatior protein (PSHA_p00052). Despite these regions are found both in P. arctica and P. nigrifaciens, P. nigrifaciens shows a higher percentage of identity with pMEGA, suggesting that P. nigrifaciens plasmid is the closest related sequence to pMEGA. Additionally, P. nigrifaciens also shows homology to the RepB family plasmid replication initiator protein (PSHA_p00027) and to the DNA PolV operon (PSHA_p00030-PSHA_p00033) and P. arctica to NYN domain-containing protein (PSHA_p00004) and Type II toxin-antitoxin HipBA system (PSHA_p00008-PSHA_p00009). Taking advantage from the high level of nucleotide sequence conservation amongst the pMEGA plasmid and P. arctica and P. nigrifaciens plasmids, the multiple alignment of the nucleotide sequence encompassing the functions involved in replication initiation (repB) and plasmid partitioning (parAB), allowed us to make some hypothesis concerning regulation of these functions in the psychrophilic plasmids. repB and parAB operon are transcribed by two divergent promoters (likely overlapping) located in the 279 bp long region (this distance is 270 bp and 269 bp in P. nigrifaciens and P. arctica plasmids, respectively) which separates RepB and ParA translational start sites. This organization suggests a common (negative) regulation of both promoters by the binding of ParA when its concentration rises, due to a higher plasmid copy number 7 . pMEGA RepB is a Rolling Circle Replication (RCR) initiator protein, belonging to the Rep_3 superfamily (PF01051). Its capacity to bind specific DNA sequences (the bind site) and to exert topoisomerase-like function allows the enzyme to cleave a specific DNA sequence (the nick site) and to release a 3'-OH free end while it remains bound to the 5'-P end by a phosphotyrosine link 8 . A careful inspection of the sequence downstream the repB gene highlights the presence of two direct repeats, located 45 base pairs from a potential hairpin forming sequence, which may represent the nick site 8 .

Supplementary Figures
Supplementary Figure S1. Multi-panel plots of average base quality per read vs. read length. (A) PacBio raw reads. (B) ONT raw reads. Within each plot, the read length histogram is shown on the top panel, the histogram of average base quality per read on the side panel. Plots were generated using NanoPlot 9 .
PacBio read quality distribution had one peak centred around the average quality score, while ONT read quality distribution had multiple peaks and higher variability.
Supplementary Figure S2. Venn diagram showing the amount of residual SNPs (A) and InDels (B) that overlapped between the drafts assembled from the three technologies. The total number of residual InDels per draft is lower than that listed in Table 2 because homopolymer insertions were collapsed in drawing the Venn diagram, but counted as multiple insertions.
Supplementary Figure

Supplementary Tables
Supplementary Table S1