Background & Summary

Spironucleus salmonicida (“the salmonid killer”) causes systemic infections in farmed Atlantic salmon, Chinook salmon and Arctic char1,2, thus poses a threat to sustainable aquaculture. Outbreaks of spironucleosis in farmed Atlantic salmon, Salmo salar, is a recurring problem and causes mass mortality and economical loss in for example Northern Norway. Salmon infected with S. salmonicida develops internal haemorrhaging, splenomegaly and granulomatous lesions in the liver and spleen, and drug treatment is not possible. This makes studies of the parasite important in order to develop alternative strategies to manage the parasite3.

S. salmonicida belongs to diplomonads, a group of unicellular protists with two diploid nuclei bearing different life styles. There are parasitic diplomonads like S. salmonicida, for example Giardia species which cause diarrhoea in various animals including humans4. All members of the Giardia genus are strictly intestinal parasites, while S. salmonicida is a well-adapting pathogen that can colonize different sites in the host5. It was shown in our previous study that S. salmonicida possesses an extended metabolic repertoire and more extensive gene regulation, probably making it more adapted to cope with environmental fluctuations6. There are also free-living diplomonads like Trepomonas sp. PC1, and comparative genomics have shown that the free-living life style most likely is a secondary adaptation and evolved from its parasitic ancestor7,8.

We recently published two Giardia genome assemblies, Giardia intestinalis WB9 and Giardia muris10, in near-complete chromosomes using Pacbio reads alone or in combination with optical maps. With similar sequencing and assembly strategy, we obtain a high-quality reference genome of S. salmonicida in near-complete chromosomes. Diplomonad genomes in near-complete chromosomes make it possible to study gene organization at the chromosomal level, and provide ground for studying chromosomal evolution.

Methods

DNA preparation and sequencing

S. salmonicida (ATCC 50377), previously known as Spironucleus barkhanus2, was isolated from a muscle abscess in Atlantic salmon grown in Vesterålen Sea in northern Norway. Cells were obtained from American Type Culture Collection (ATCC) in 2008 and grown axenically in slanted polypropylene tubes using a modified liver digest yeast extract (LYI) medium11 at 16 °C. Stocks of the S. salmonicida ATCC isolate were cryopreserved in liquid nitrogen. DNA extraction was performed in 2015, new batch of cells were taken up from the cryopreserved stock, cultured until confluent and DNA extracted directly to ensure minimal accumulation of mutations. Around 108 cells were harvested for DNA extraction using a phenol-chloroform extraction. The DNA was then purified using the Qiagen Genomic Tip 100/G according to the manufacturer protocol. The purified DNA was quantified using a Qubit fluorometer and quality checked using a NanoDrop and agarose gel electrophoresis. DNA was then stored at −20 °C and 40 μg of gDNA was sent directly for sequencing at the Uppsala Genome Center hosted by the Science for Life Laboratory (Uppsala University). A 10 kbp PacBio library was generated following the standard SMRT bell construction protocol according to the manufacturer recommended protocol. The library was sequenced on 6 SMRT cells of the Pacbio RS II instrument using the P6-C4 chemistry, which generated 267,495 reads in 2.6 billion bases (Table 1) with an N50 length of 14.6 kbp.

Table 1 Comparison of the old and the new S. salmonicida genome assemblies.

Genome assembly

The genome was assembled using the same method described in the G. intestinalis WB genome publication9. HGAP12 was used for de novo genome assembly, followed by consensus sequence calling with Quiver12. Both programs are part of SMRT Analysis (v2.3.0) pipeline12 from Pacbio. This yielded 61 contigs.

The contigs were then mapped to the optical maps of the nine chromosomes (Fig. 1, obtained for the old genome assembly in 20116) using MapSolver (v3.2.0) from OpGen, and the mapping information was used to stitch together neighboring contigs into scaffolds, as described in the G. intestinalis WB genome publication9. To further close the gaps in the nine scaffolds, PBJelly (v15.8.24)13 was run using the Pacbio reads, and the resulting scaffolds were polished with Quiver12. Canu (v1.4)14 was used to assemble reads that failed to map to the scaffolds, which generated 33 contigs with sizes <  = 36 kbp. These small contigs do not map to the optical maps because they are below the size limitation for a sequence to uniquely map to an optical map. Another round of Quiver polishing was applied on Canu contigs combined with the nine scaffolds. To further improve per base quality, DNA Illumina paired-end reads (SRR94859415) were mapped to the draft assembly using BWA-MEM (v0.7.15)16 and the resulting BAM file was fed to Pilon (v1.21)17 to update indels and SNPs. The final assembly has a size of 14.7 Mbp, and is distributed in 42 scaffolds with nine major ones representing 96.5% of the total size (Table 1).

Fig. 1
figure 1

Nine near-complete chromosomes. Restriction enzyme (NheI) maps of the nine chromosomes aligned with the genomic sequences digested with NheI in silico. Each vertical line inside boxes represents a restriction enzyme cutting site. Gaps in the genomic sequences are represented with a horizontal line outside of boxes.

Heterozygosity estimation

DNA Illumina reads (SRR94859415,) were re-mapped to the base-corrected sequences using BWA (v0.7.15)16 and samtools (v1.8)18 mpileup result was generated and parsed. Sites with at least 20X base coverage and an alternative base in at least 10% of the reads were called as SNP (or allelic sequence heterozygosity (ASH)) sites.

Genome annotation

Annotation from the old S. salmonicida genome assembly6 was transferred to the new one using RATT (v0.95)19. 541 RATT transferred genes were selected to train GlimmerHMM (v3.0.1)20, which together with Prodigal (v2.6.3)21 were used for gene structural prediction. Functional annotation was performed using a combination of similarity information from BLASTP22 search against NR database and domain information from Conserved Domain (CD) search23. Transferred and predicted genes were then merged, and the inconsistent ones were manually inspected. RNA-seq reads (SRR94859524) mapped to the genome assembly were used as a guideline for structural annotation during manual examination, and the mapping was carried out with BWA (v0.7.15) since the old genome assembly contained only four introns. Searching for the conserved AC-repeat intron motif6 revealed no new intron in the new genome assembly.

The functional annotation was further improved by mining metabolic genes in Pathway Tools v21.525 as described in G. muris genome publication10. An updated and curated G. intestinalis WB genome assembly9 has been published since the previous S. salmonicida genome was annotated. Genes shared with this assembly (BLASTP e-value <1e-10) but annotated differently were double checked to incorporate the updated annotations when applicable.

There are in total 8,661 protein-coding genes (plus an additional six partial and 194 pseudo genes) annotated in the new genome assembly (Table 1). Although the number of genes increased compared to the old genome assembly, with a bigger genome size and a bigger mean intergenic region size, coding density of the new genome turns out to be slightly smaller (Table 1).

5 S ribosomal RNAs (rRNAs) were predicted using rnammer (v1.2)26, and there are 40 copies of them in tandem array located on chromosome 9 (Fig. 2). 18 S, 28 S and 5.8 S rRNA annotations were transferred from the old genome assembly, and there is one copy of each, all found in a single small contig just like in the old genome assembly6. In agreement with the old assembly, ribosomal RNA contig has a 15 times higher Pacbio read coverage compared to the genomic average, indicating the contig is most likely a collapse of repetitive reads. tRNAs were identified using tRNAscan-SE (v1.23)27, and there are more duplicated copies of tRNAs annotated in the new genome assembly (Table 1).

Fig. 2
figure 2

Circular plot of the nine chromosomes. Chromosomal sequences are represented in grey at the outermost circle with gaps in white bands and telomeres in red. Inner tracks are arranged as: GC%, 5 S rRNA/reverse transcriptase/CRMP1, CRMP2/Histone H4, coding density, SNPs density, regions with similarity. Regions with similarity represent BLASTN matches against itself with > = 95% sequence identity and > = 2000 bp in size, and two repetitive regions of 64 kbp in size are highlighted red. The circular plot was drawn with R package circlize (v0.4.8)34.

No sign of contamination has been observed at assembly nor annotation level.

Data Records

This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession AUWU00000000. The version described in this paper is version AUWU0200000028, and the GenBank assembly accession is GCA_000497125.229. Raw DNA sequence reads from Pacbio are deposited at NCBI Sequence Read Archive (SRA) under accession number SRP02856530. Pacbio reads mapped CRAM file (BAM file converted to CRAM file using samtools view command to reduce file size) is provided in the Figshare database under Digital Object Identifier (DOI) code31, and Illumina reads mapped BAM file is available in Figshare database under DOI code32.

Technical Validation

Genome completeness

Restriction enzyme (NheI) maps of the nine chromosomes align well with the genomic sequences digested with NheI in silico (Fig. 1). In fact, 95.2% of the optical maps are covered by the assembled genomic sequences. Interestingly, the right ends of chromosome 2 and 3 were reported as incomplete maps because of low coverage. However, long-read assembly extends well beyond the incomplete right ends of the optical maps, and the right end of chromosome 3 ends in telomeric repeats (TAGG)n. In fact, ten out of 18 chromosome ends are assembled into telomeric repeats, specially, chromosome 1, 5, 7 and 8 have both telomeric ends complete (Fig. 2). There are terminal gaps (Ns) at four out of the eight chromosome ends missing telomeric repeats, and the sizes were determined by the alignment to the optical maps. Three internal gaps also present in the new genome assembly, and the largest of all seven gaps is the internal gap located on chromosome 2 at a size of 87.9 kbp (Fig. 2). The new genome assembly is highly contiguous compared to the old fragmented genome assembly with 232 gaps distributed in 233 scaffolds, and preserves the synteny when compared to the old genome sequences (Fig. 3).

Fig. 3
figure 3

Dotplot of the nine chromosomes (new) vs. the old genome sequences. Blue represents forward matches while red represents reverse complement matches. MUMmer (v3.23)35 was used for the dotplot. DNA sequence alignment was generated using nucmer, and the alignment delta file was fed into mummerplot with ‘–layout’ turned on so that sequences are ordered and oriented in a way that the largest hits cluster near the main diagonal. Mplotter36 was then used to generate dots, lines and ticks from mummerplot output for drawing dotplot with R package ggplot237.

The new genome assembly is also better at resolving duplicated regions, which is defined as BLASTN matches against itself with >  = 2000 bp and 95% identity. In fact, 2.7 Mbp (19.2%) of the new genome sequences are characterized as duplicated regions compared to 1.5 Mbp of the old genome sequences. Regions involved in duplications are most commonly observed in short stretches of sequences with just one gene. However, we see also multi-gene regions involved in duplications up to 64 kbp in size (Fig. 2, ribbons highlighted in red in the innermost circle), which might be the largest repetitive regions the assembly could resolve due to the limitation of the read length. The largest BLASTN match is on chromosome 9, which is 33 kbp in size overlapping itself for 2 kbp resulting in a tandem repeat of 64 kbp in size (Fig. 2, red ribbon at chromosome 9 in the innermost circle). Manual examination of the Pacbio long read pileup and coverage at the two largest regions involved in duplications revealed the read coverage is in line with the neighbouring regions as well as the average coverage, indicating repetitive regions were correctly resolved. Genes found in duplicated regions are often members of multi-gene families. Duplicated regions also maintain higher GC contents with a mean GC level at 44.7%, while the rest of the genome sequence has a mean GC level at 30.7% and the whole genome sequence has a mean GC level at 33.5% (Table 1). Better resolved duplicated regions probably contribute to the observation of lower ASH level in the new genome assembly (Table 1).

Annotation improvement

With this near-complete reference genome, we are able to present here a more complete annotation. Among the 8,661 full length protein-coding genes, 5,016 genes have exact copies as in the old genome assembly, 1,273 genes have the same sequences but updated descriptions, and 287 genes are the same but with SNPs. Start codons of 976 genes were adjusted, based on the alignments of orthologous genes and A-rich motifs at the start codons6. 3′ end sequences of 156 genes were updated due to differences of short insertions and deletions between the two genome sequences. 303 genes share sequence similarity with genes already annotated, but no distinct orthologous gene could be assigned. There are 650 genes which are unique to the new genome assembly, most of them are hypothetical proteins, but there are also genes with putative functional annotations. The most interesting unique gene is Histone H4, which was completely missing in the old genome assembly, but is presented in ten copies (plus five pseudo copies) in the new genome assembly with nine (plus the five pseudo copies) of them arranged in tandem array interspersed by four copies of reverse transcriptase (RT) on chromosome 3 (Fig. 2).

Better genome assembly leads also to better annotation of protein families. S. salmonicida harbors a large group of cysteine-rich membrane proteins, which were divided into cysteine-rich membrane protein 1 (CRMP1) and cysteine-rich membrane protein 2 (CRMP2)6. CRMP1s resemble variant-specific surface proteins (VSPs) in G. intestinalis structurally with a transmembrane domain at the 3′ end followed by a pentapeptide motif6, and VSPs are known to be expressed on the cell surface of the parasite to assist the parasite to bypass host immune system33. We find more CRMP1s and CRMP2s in the new genome assembly, now 138 CRMP1s (plus 22 pseudo and 1 partial) compared to 125 in the old genome assembly and 248 CRMP2s (plus 30 pseudo and 1 partial) compared to 195.

With nine near-complete chromosomes, we could visualize how these cysteine-rich membrane proteins are organized along the chromosomes (Fig. 2). The gene families are enriched in arrays on certain chromosomes, especially chromosome 2 (42 CRMP1s and 99 CRMP2s) and chromosome 9 (59 CRMP1s and 41 CRMP2s). This organization of gene families is different from what has been observed in the Giardia genomes. Cysteine-rich protein family specifically VSPs are enriched at terminal ends in G. muris genome10, while they are all over the chromosomes in G. intestinalis WB genome9. We also noticed that the regions around the cysteine-rich membrane proteins as well as the chromosomal ends tend to be gene-poor (Fig. 2, green dots in coding% track), and these regions tend to have higher allelic sequence variation (Fig. 2, red dots in SNPs track) and GC level. Similar patterns have been observed in the G. intestinalis WB genome9, suggesting that gene conversion mechanisms might be involved in the expansion of the antigenic surface protein repertoire.