Background & Summary

Diatoms (i.e. Bacillariophyta) are unicellular algae with silicified cell walls that represent one of the most ecologically important phytoplankton groups1,2. Diatoms were estimated to contribute approximately 20% of global primary production on Earth, and up to 40% of marine primary production3. Diatoms are also considered as the most species-rich class of microalgae, with estimates range from 12,000 to 30,000 species4,5,6. To date, genomes of only a handful of diatom species have been constructed chromosome-level assemblies, including Thalassiosira pseudonana7, Phaeodactylum tricornutum8, Fistulifera solaris9,10 and Skeletonema marinoi11. These limited number of high-quality genome assemblies severely hinders in-depth research on the internal phylogeny and evolutionary adaption of diatoms.

Skeletonema is one of the most common diatom genera that dominates most coastal waters, some species of which often form harmful algae blooms (HABs)12,13,14,15. Of the Skeletonema species, S. marinoi is the most dominant phytoplankton species that populates in the colder water (in high-latitude ocean regions and temperate ocean regions during winter-spring seasons)12,16. Interestingly, S. tropicum of the genus Skeletonema has a dramatically different preference to temperature, which appears in tropical ocean regions and summer-autumn seasons in temperate ocean regions12,16,17. Despite of the ecological importance of Skeletonema species, genomic information of the Skeletonema species is rather limited. To date, organelle genomes of some Skeletonema species have been constructed, including mitochondrial genomes (mtDNAs)18, and chloroplast genomes (cpDNAs)19 of five Skeletonema species S. marinoi, S. tropicum, S. grevillei, S. pseudocostatum and S. costatum. The conserved genetic structures of these organelle genomes among Skeletonema species couldn’t explain their mechanisms of ecological adaptation. The chromosome-level genome assembly of the first Skeletonema species, S. marinoi was recently constructed11. The availability of this genome assembly led to the discovery of a substantial expansion of light harvesting genes and photoreceptor gene families, which might help the ecological adaptation of S. marinoi under low light condition during the winter-spring seasons. While the whole genome of S. tropicum was still lacking, hampering the comparative genomics analysis among the Skeletonema species.

In this study, we report the first chromosome-level genome assembly of the high temperature preferring Skeletonema species S. tropicum (Fig. 1A). The assembled genome size of S. tripicum was 78.69 Mb using PacBio single-molecular DNA sequencing technology20, and the contig N50 was 606.27 Kb. To obtain the high-quality genome assembly at the chromosome level, high-throughput chromatin conformation capture (Hi-C)21 was used and the contigs were clustered into 23 chromosomes, which corresponds to 91.10% of the total contig length. The final assembled genome size of S. tropicum was 78.78 Mb with the scaffold N50 length of 3.17 Mb. A total set of 20,613 putative protein-coding genes (PCGs) were predicted in S. tropicum, among which, 86.14% were annotated to the publicly available database. These chromosome-level genome assemblies of the high temperature preferring Skeletonema species S. tropicum and the low temperature preferring Skeletonema species S. marinoi set up a valuable platform for elucidating mechanisms of temperature adaptation for surviving adverse environments.

Fig. 1
figure 1

Construction of the first chromosome-level genome assembly of S. tropicum. (A). Circos plot of the S. tropicum genome assembly. From outer to inner layers were chromosomes (a), repetitive elements (b), gene densities (c), GC contents (d), respectively. The inner most part layer was the collinear gene pair blocks. (B). Hi-C intra-chromosomal contact map of the genome assembly in S. tropicum.

Methods

Strain isolation and genome sequencing

The S. tropicum strain (CNS00166) analysed in this study was isolated using single-cell capillary from marine water collected in Jiaozhou Bay, China in October 2019. The CNS00166 strain was purified using sterilized seawater for many times. The CNS00166 strain is kept and available in the Key Laboratory of Marine Ecology and Environmental Science from the Institute of Oceanology, Chinese Academy of Science. The axenic cultivation of this strain was maintained in L1 medium22. To ensure low bacterial contamination, penicillin and streptomycin stock solution was added into culture solution. The culture conditions, including culture seawater, temperature, salinity and irradiance intensity, were described previously18. The S. tropicum cells for sequencing were collected by centrifugation and stored in liquid nitrogen. The mtDNA and cpDNA of S. tropicum strain CNS00166 have been reported previously18,19.

High-quality and long-fragment DNA (≥40 Kb) library was prepared by extracting DNA using a magnetic-bead based protocol11. For genome survey analysis, short reads were obtained using MGI short-reads sequencing. The MGI sequencing library (DNBSEQ) was constructed and sequenced using the MGISEQ-2000-PE150 platform. A total of 40.91 Gb (519X sequencing depth) short reads were obtained in this study for genome survey and genome assembly (Table 1). For chromosome-level genome assembly, PacBio continuous long reads (CLR) sequencing library was constructed and sequenced using PacBio Sequel SMRT Cell 1 M. As a result, 10.04 Gb (127X sequencing depth) of PacBio long reads were obtained (Table 1). The N50 length and maximum length of PacBio sequencing reads were 18.18 Kb and 215.08 Kb, respectively. For the Hi-C analysis, algal samples were processed as previously described11 and the Hi-C library was sequenced with MGISEQ-2000-PE150. This process yielded a total of 50.93 Gb of raw data for predicting the spatial proximity of chromatin loci. Three replicates of each RNA sample of S. tropicum in the exponential growth were collected by centrifugation. High-quality RNA was extracted using cetyltrimethylammonium bromide (CTAB) methods11, followed by RNA quality checking using Agilent 2100 Bioanalyzer and NanoDrop. The short-length and full-length transcriptome libraries were sequencing by MGISEQ-2000-PE150 platform and PacBio Sequel SMRT Cell 1 M, respectively.

Table 1 Statistics of S. tropicum genome assembly and annotation.

Genome survey and genome assembly

The genome survey was conducted based on k-mer distribution using the short-length reads using Jellyfish V2.1.423 with k-mer size = 21 and GenomeScope V1.024. The estimated genome size of S. tropicum (CNS00166 strain) was 73.10 Mb with heterozygous ratio was 0.73% and repeat ratio was 48.60% (Table 1).

The PacBio long-read data was used for de novo genome assembly by MECAT225, the primary assembled genome was polished by Arrow (https://github.com/PacificBiosciences/GenomicConsensus) using PacBio long reads and by pilon26 using short reads. Purge Haplotigs27 was used to remove redundancy from the assembled genome. The size of this genome assembly was 78.69 Mb, which was similar to the estimated genome size based on the k-mer analysis. The assembled genome consisted 376 contigs and the N50 was 606.27 Kb. The completeness and quality of this genome assembly was evaluated by BUSCO v5.4.328 against the stramenopiles_odb10 data set. Among the BUSCO orthologous groups, 96.00% were identified as complete in the assembled genome (Table 2).

Table 2 Summary of BUSCO analysis of genome assembly and annotation in S. tropicum.

A total of 50.93 Gb Hi-C sequencing raw data was obtained (Table 1), then was conducted quality control by HiC-Pro v2.5.029. The contigs were mapped onto chromosome-level scaffolds by Juicer v1.630 and 3D-DNA31. As a result, 23 chromosome-level scaffolds were obtained with an anchored rate was 91.10% (Fig. 1), and the length range was from 1558 Kb to 5738 Kb (Table 3). The anchored rate was a little lower probably due to the high heterozygous ratio and repeat content of S. tropicum in this study, the final assembled contigs might contain some highly heterozygosity allelic sequences that are redundant. As only one set of these highly heterozygosity sequences was anchored into the genome assembly with the help of Hi-C data, resulting in relatively lower anchored rate. Finally, the size of genome assembly was 78.78 Mb with the scaffold N50 was 3.17 Mb.

Table 3 Statistics of chromosome length in S. tropicum.

Genome annotation

The genome annotation steps included three parts: repetitive elements annotation, non-coding RNAs annotation and PCGs annotation. The homolog repetitive elements were predicted by RepeatMasker v4.0.732 and RepeatProteinMask v4.0.7 (http://www.repeatmasker.org/cgibin/RepeatProteinMaskRequest) based on the RepBase v21.12 database33. For de novo-based repetitive elements, a de novo repetitive element database was generated by RepeatScout34, Piler35 and LTR_FINDER v1.0736 at first, then de novo-based repetitive elements were predicted by RepeatMasker. Combination of homology-based and de novo-based approaches, a total of 38.73 Mb of transposable elements (TEs) were obtained, contributing 49.17% of assembled genome (Table 4). The DNA, LINE, SINE and LTR account for 5.39%, 5.18%, 0.065% and 25.26% of genome, respectively. In addition, tandem repeats were annotated by Tandem Repeats Finder (TRF v4.09)37, and a total of 6.08 Mb of tandem repeats were obtained accounting for 7.72% of total genome.

Table 4 Statistics of transposable elements (TEs) in S. tropicum.

Non-coding RNAs are annotated divided into several types, including tRNA, rRNA, snRNA and miRNA. The tRNAs were predicted through tRNAscan-SE38. The rRNA were annotated by Blast v2.2.3139 using the reference sequences of S. marinoi. The snRNAs and miRNAs were identified through INFERNAL in RFAM40.

The PCGs were annotated through integrated approaches, including de novo-, homology- and transcriptome-based information. The de novo prediction were conducted using AUGUSTUS41 and SNAP42, and yielded 24,008 and 31,109 genes, respectively. For the homology-based prediction, the PCG sequences of closely related or model species, including S. marinoi11, T. pseudonana7, Fragilariopsis cylindrus43, Seminavis robusta44, P. tricornutum8 and Arabidopsis thaliana45, were aligned against the S. tropicum genome using Blast v2.2.31, then the gene structures were predicted from these alignments by Exonerate v2.2.046. A total of 84,803 homologous genes were obtained. For the transcriptomic prediction, the RNA-Seq short-read data were aligned to the assembled genome through HISAT2 v 2.1.047 and then assembled and corrected by StringTie v1.3.448 and Pasa_lite (https://github.com/PASApipeline/PASA_Lite). Iso-Seq long-read data were used to get full-length non-chimeric reads by the SMRT Analysis System. A total of 334,554 genes were predicted by the RNA-Seq and Iso-Seq, which contained some redundancy. Finally, gene models from these strategies were merged to form a consensus gene set using MAKER249, and 20,613 PCGs were predicted, with an average gene length of 1675.09 bp and exon length of 750.84 bp (Table 1). The statistics of gene models, including gene length, intron length, exon number and exon length in S. tropicum were comparable to S. marinoi (Fig. 2).

Fig. 2
figure 2

The composition of gene elements in the S. tropicum and other closely related species. (A) Distribution of gene length. (B) Distribution of exon number. (C) Distribution of intron length. (D) Distribution of exon length.

For the functional prediction, these PCGs were annotated to the public databases, including GenBank Nr, SwissProt, Kyoto Encyclopedia of Genes and Genomes (KEGG), eukaryotic orthologous groups (KOG), TrEMBL, InterPro and gene ontology (GO), through Blast v2.2.31 with e-value less than 1e-5. Among all the PCGs, 17,757 genes (86.14%) were functionally annotated to at least one database, and 6544 genes (31.74%) were annotated to at least five databases (Table 5, Fig. 3).

Table 5 The Gene function annotation statistics in S. tropicum.
Fig. 3
figure 3

The venn diagram of PCG annotation of S. tropicum to five databases: NR, InterPro, KEGG, SwissProt and KOG.

Data Records

The genome sequencing data (including DNA short-reads sequencing data, DNA PacBio long-reads sequencing data, Hi-C sequencing data, RNA short-reads sequencing data and RNA PacBio long-reads sequencing data) are deposited in the NCBI SRA database under the accession numbers: SRR2685725650, SRR2685725550, SRR2839313951, SRR2685725350, and SRR2685725250. The genomic assembly and annotation results were available at the figshare database52. The genome assembly has also been deposited to NCBI under the accession number of JAWZXG00000000053.

Technical Validation

Low contamination ratio of Bacteria

The low bacteria contamination in the axenic culture of diatom was the critical factor for the high-quality genome assembly. To check low bacteria contamination, 1 Mb of clean short-reads data were selected randomly, and blasted to NCBI NT database. The result showed that the bacteria contamination of S. tropicum was as low as 0.26% (Table 6). The top 20 species of reads annotated to NT database included the Skeletonema species and other closely species, indicating the absence of bacteria comtamination in this project (Table 7). In addition, the short-read DNA data were mapped to the PacBio assembled genome using BWA v. 0.7.1054 to evaluate the GC contents and sequencing depth with 1 Kb window length statistics (Fig. 4), the results showed that the almost all GC points located at the 45%, indicating no exogenous species pollution was found. In addition, the sequencing depth of many points was close to 0, which probably due to its high repeat contents of S. tropicum genome. The reads of repeat content were usually matched to multiple locations of genome assembly in the BWA alignment, resulting in the filtration of the score. Thus, the sequence depth of some locations appeared to 0. The results altogether suggested that genome assembly of S. tropicum was not contaminated by bacteria or other species.

Table 6 Statistics of clean reads of short DNA sequences annotated to NT database.
Table 7 Top 20 species of reads annotated to NT database (length > = 100 bp).
Fig. 4
figure 4

The distribution of GC ratio and sequencing depth. Histograms on the top and right show the frequency distribution of GC ratio and sequencing depth, respectively.

Evaluating genome assembly and annotation completeness

In this study, a total of 519X and 127X of MGI short reads and PacBio reads were used, respectively, which could ensure the quality in the genome assembly. The quality assessments of the genome assembly and annotation were evaluated by BUSCO analysis (Table 2). The results showed that 96.00% and 86.00% were identified as complete orthologs for genome assembly and PCGs annotations, respectively, indicating the high quality of this genome. Although the high heterozygous ratio and repeat content, a high quality genome assembly was obtained in this study. The Hi-C heatmap shows a well-organized interaction pattern within the chromosomal region (Fig. 1), and assembly resulted in 23 chromosome-level scaffolds. Collinearity analysis of amino acid sequences of PCGs between S. tropicum and the same genus species S. marinoi was conducted (Fig. 5A) through Blast v2.2.31 with the evalue less than 1e-05 to identify homologous PCGs, then followed analysed and visualized by WGDI55 and Circos56. The collinearity analysis of DNA sequence (Fig. 5B) was also conducted using mummer 3.057 with minimum alignment length of 1000 bp and many-to-many alignment allowing for rearrangements. The results showed that almost all chromosomes of S. tropicum displayed high homology with the chromosomes of S. marinoi. The clearly strong collinearity between the two close phylogenetic species indicated high quality sequencing and assembly of S. tropicum. Taken together, these confidently confirm the accuracy of the genome assembly and annotation.

Fig. 5
figure 5

Collinearity analysis between S. tropicum and S. marinoi in the view of amino acid sequences (A) and DNA sequences (B).