Background & Summary

Neosalanx taihuensis, a member of the Salangidae family of the Osmeriformes, is an economically important aquaculture fish in China with a transparent body and feeds on zooplankton1,2,3. N. taihuensis is endemic to fresh and brackish waters widely distributed in China and has been artificially introduced to numerous lakes and reservoirs due to its high commercial value4. The natural population is not only distributed in the estuary area from the Yellow and Bohai Seas to the coast of the South China Sea but also in the main streams of the Yangtze River, Huai River and Yellow River and their subsidiary lakes5,6. Among these sites, the Yangtze River basin and Taihu Lake are the core habitats of the natural population of N. taihuensis7,8,9. The natural N. taihuensis population size has declined seriously due to overfishing and habitat destruction10. Fortunately, artificial translocation activities for N. taihuensis have greatly expanded the spatial-geographic distribution and population diversity of the species7,11,12,13. Translocation activities in waters such as the Erhai Sea, Fuxian Lake, Dianchi Lake and the Three Gorges Reservoir have resulted in the formation of stable populations of N. taihuensis in these new habitats8,11,14,15. The study of genetic diversity between translocated and natural populations has become an interesting issue for researchers, and a variety of molecular markers, including COI, microsatellites, and Cytb, have been developed7,9,11,16,17. The analysis of these markers has shown that the genetic diversity of the translocated population of N. taihuensis was higher than that of the natural population and has preliminarily revealed the molecular mechanism of N. taihuensis adaptation to the environment7,9,11,16,17.

More specifically, translocated N. taihuensis also exhibit plasticity in their reproductive biology. In natural habitats such as the Yangtze River basin and Taihu Lake, N. taihuensis commonly has two breeding groups, a spring breeding group and an autumn breeding group18,19,20,21,22,23, with the spring breeding group being the main source of population supplementation21,24. In contrast, the reproductive pattern of the translocated population of N. taihuensis has changed. The reproductive behavior of the translocated N. taihuensis population in Erhai shows only one spawning period, i.e., from late autumn to early winter22. In contrast, the translocated population of N. taihuensis in Dianchi has formed three reproductive groups, namely, the winter group, the autumn group and the spring group23. The translocated population has a longer reproductive period than the natural population, showing a more obvious adaptation in reproductive strategy. The current research on the differentiation of reproductive populations of N. taihuensis is limited to the description and statistics of epigenetic phenomena, and few studies have investigated the relevant genetic mechanism and molecular evolution.

Until now, molecular biology and genomic research on N. taihuensis has been rare due to the lack of a reference genome. The lack of information on the N. taihuensis genome greatly limits the study of N. taihuensis phylogeny and genetic differentiation. Likewise, it is not possible to explore the adaptation and reproductive strategies of N. taihuensis at the genomic level.

In this study, we report a gap-free genome assembly for N. taihuensis combining short reads, PacBio HiFi long reads, Nanopore ultralong reads and Hi-C data. The assembled N. taihuensis genome was approximately 397.29 Mb with a contig N50 of 15.61 Mb. Gene annotation yielded 20,024 protein-coding genes, and 98.16% of the predicted genes were annotated in publicly available biological databases, including NR, GO, KOG, KEGG, TrEMBL, Interpro and SwissProt. This high-quality, gap-free assembled genome will provide an important resource for studying the reproductive biology and ecological adaptability of N. taihuensis.

Methods

Ethics declarations

This work was approved by the Bioethical Committee of Freshwater Fisheries Research Center (FFRC) of the Chinese Academy of Fishery Sciences (CAFS) (FEH20200807, 2020/08/07). Sampling was performed in strict accordance with Freshwater Fisheries Research Center Experimental Animal Ethics Guidelines.

Sample collection

Muscle tissue samples were collected from adult N. taihuensis for this study (Fig. 1). The collection site was located at Taihu Lake, Huzhou, Zhejiang Province (coordinates: E120°5′0.999996″, N31°0′59.999976″). Sampling was performed in strict accordance with relevant Chinese laws and experimental ethical guidelines. After the muscle tissue samples were collected, they were rapidly frozen in liquid nitrogen and stored at −80 °C until DNA extraction.

Fig. 1
figure 1

Demonstration image of N. taihuensis.

DNA and RNA extraction, library construction, sequencing, assembly, and bioinformatics analyses in this study were performed using standard experimental and analytical protocols from BGI Genomics (Shenzhen, China).

RNA isolate, cDNA library construction and sequencing

For gene structure annotation, RNA was isolated from the muscle tissue samples using the TRIzol Total RNA Isolation Kit (Takara, USA) following the manufacturer’s protocols25. Then, the RNA was sheared and reverse transcribed using random primers to obtain cDNA, which was used for library construction. The library quality was determined using a Bioanalyzer 2100. Subsequently, these libraries underwent paired-end sequencing with a read length of 150 bp on the BGISEQ sequencing platform (BGI).

WGS library construction, sequencing and genome survey

Extracted DNA from N. taihuensis muscle tissue using hypervariable minisatellite probe (MZ 1.3), along with locus-specific minisatellite probes (g3, MS1, MS43). Fragmented this DNA between 50 and 800 bp using a Covaris E220 ultrasonicator, following manufacturer guidelines, creating a short insert whole-genome shotgun (WGS) library. Built and sequenced a library with fragments between 300 and 400 bp on the MGISEQ platform. Generated 45.69 Gb DNBSEQ data for short inserts, offering insights into the N. taihuensis genome (Table 1). Utilized the FastQC (v0.1)26 to remove low-quality or adapter-linked reads. From the refined data, determined the K-mer frequency distribution using Jellyfish (v2.2.6)27 and analyzed with GenomeScope (v1.0)28. Determined the N. taihuensis genome to be around 356 Mb with a heterozygosity rate of 0.77% (Fig. 2 and Table 2).

Table 1 Sequencing data used for the genome N. taihuensis assembly.
Fig. 2
figure 2

K-mer analysis of N. taihuensis genome.

Table 2 The information of genome survey analysis.

PacBio library construction, sequencing and de novo assembly

DNA from N. taihuensis muscle tissue was extracted using a QIAGEN Blood & Cell Culture DNA Midi Kit (QIAGEN, Germany). A PacBio library with an insert size of around 20 kb was then prepared using the SMRTbell Express Template Prep Kit 2.0 from PacBio (Pacific Biosciences, USA). It was sequenced on a PacBio Sequel II SMRT cell in CCS mode. After processing with the SMRT Link (v8.0.0)29 CCS algorithm with parameters “--minPasses 3 --minPredictedAccuracy 0.99 --minLength 500”, 25.88 Gb HiFi reads were obtained, excluding adaptors and less accurate reads. The reads had an N50 length of 15.17 kb and an average length of 14.91 kb (Table 1). The initial genome de novo assembly was done using Hifiasm (v0.15.1)30 with standard settings, and any redundant sequences were later purged using the Purge-Haplotigs31 program with the parameters “-j 80 -s 80 -a 75”.

Hi-C library preparation, sequencing and chromosome anchoring

A Hi-C library was created using the Mbo I restriction enzyme32. Muscle tissue samples underwent 1% formaldehyde treatment at room temperature for 10–30 minutes to crosslink chromatin-interacting proteins. Post-digestion with Mbo I restriction enzyme (NEB, Ipswich, USA), fragment ends were flattened, repaired, biotin-labeled, and ligated to form loops using T4 DNA ligase (Thermo Scientific, USA). After protein removal and ultrasound disruption of the loops, the Hi-C library was sequenced on an MGISEQ platform. For the chromosome-level assembly, 69.21 Gb of Hi-C sequencing data were produced, leading to the clustering, ordering, and orientation of contigs into 28 pseudochromosomes using Juicer (v1.5)33 and 3D-DNA (v180922)34 pipelines (Table 1). Scaffolding errors were later reviewed and curated using Juicebox (v1.11.08)33.

Oxford Nanopore Technologies library preparation, sequencing and assembly

An ultralong library was created using Oxford Nanopore Technologies (ONT). Genomic DNA from N. taihuensis muscle tissue was extracted via the CTAB method, focusing on fragments over 5 kb, with SageHLS HMW library system. This DNA was processed using the Ligation sequencing 1D kit (SQK-LSK109) and sequenced on the Promethion platform at the Genome Center of Grandomics (Wuhan, China). ONT ultralong reads were refined, discarding those shorter than 5 kb or with QV below 7. This yielded around 15.01 Gb of ultralong reads with an N50 of 32.49 kb. Errors in these reads were corrected using the Necat pipeline (v 20200119)35. The revised ultralong reads then filled gaps in the N. taihuensis assembly using three iterations of LR_Gapcloser (v1.0) with the parameter “--max_distance 1000000 – coverage 0.8 – tolerance 0.2” and one round of TGSgapcloser (v 1.0.1) pipeline with the parameter “--min_idy 0.3”36,37. Finally, a gap-free chromosome-level assembly was generated with a genome size of 397.29 Mb and a contig N50 of 15.61 Mb (Table 3). The concluding N. taihuensis assembly was generated with PacBio, ONT and HiC data and formed 28 contigs representing 28 chromosomes (Fig. 3, Table 4).

Table 3 The statistics of length and number for the de novo assembled of N. taihuensis, P. chinensis and P. hyalocranius genomes.
Fig. 3
figure 3

Characteristics of the N. taihuensis genome. Hi-C chromatin interaction map of the N. taihuensis assembly.

Table 4 Statistics of chromosomal level assembly of N. taihuensis genome.

Repetitive sequence annotation

Following genome assembly, repetitive sequences were annotated (Fig. 4). Using RepeatModeler (v1.0.4)38 and LTR-FINDER (v1.0.7)39 with default parameters, repetitive elements and long terminal repeats were identified. By merging these findings, a de novo repeat sequence library was constructed. This library was then used to screen for interspersed repeats and low-complexity sequences via RepeatMasker (v4.0.7)40. For homolog-based prediction based on the Repbase database41, DNA and protein transposable elements (TEs) were detected by RepeatMasker (v4.0.7) and RepeatProteinMasker (v4.0.7)40, respectively. Tandem repeats were identified with Tandem Repeat Finder (v4.10.0)42. In total, 149.89 Mb (~37.73%) of repetitive sequences were recognized. For the predominant categories of transposable elements (TEs), long interspersed nuclear elements (LINEs) constituted 11.64% of the N. taihuensis genome, DNA hAT transposon (DNA/hAT) elements constituted 7.13%, long terminal repeats Copia (LTRs/Copia) constituted 0.36%, long terminal repeats Gypsy (LTRs/Gypsy) constituted 4.34%, and short interspersed nuclear elements (SINEs) constituted 3.05% (Table 5).

Fig. 4
figure 4

Circos plot of the N. taihuensis genome. The rings from inside to outside indicate (a) pseudochromosome length of the N. taihuensis genome, (b) gene frequency, (c) gene density, (d) TE density, and (e) GC density; b-d were drawn in 500-kb sliding windows.

Table 5 Statistics of repetitive sequences in the N. taihuensis genome.

Protein-coding gene annotation and functional annotation

To predict protein-coding genes, three strategies, including transcriptome-based annotation, homology-based annotation and ab initio prediction, were conducted. For the transcriptome-based annotation, 75.09 Gb RNA-seq data were mapped to the N. taihuensis assembly with Hisat2 (v2.1.0)43, and then the transcriptome information in BAM alignments was produced. The BAM alignments were further assembled into transcripts using Stringtie (v1.3.5)44 and validated by PASA (v2.5.2) (https://github.com/PASApipeline/PASApipeline). Subsequently, coding sequences were identified by TransDecoder (v5.5.0) (https://github.com/TransDecoder/TransDecoder) with default parameters. Assemblies and gene annotation files of four Actinopterygii species (Danio rerio, Oryzias latipes, Protosalanx hyalocranius and Salmo salar) were downloaded from a public database (Table 6). According to previous studies5, P. hyalocranius and P. chinensis are actually the same species. Gene annotation files, combined with the RNA-seq BAM alignments and the homolog assemblies, were utilized to conduct homology-based prediction with GeMoMa (v1.8)45. Based on the protein homology information, August (v3.2.1)46 and SNAP (v2006-07-28) (https://github.com/KorfLab/SNAP) were used to train the predictors. Then, ab initio prediction was generated by the August and SNAP programs with the self-training parameters. The EVidenceModeler (EVM) pipeline (v 1.1.1)47 was used to integrate all the protein-coding genes predicted by the above three strategies. Finally, the protein-coding genes that were only derived from ab initio prediction were filtered out. Overall, 20,400 protein-coding genes were obtained with an average gene length of 8,921 bp and an average CDS length of 1,673 bp. The average exon number per gene was 9, with an average exon length of 177 bp and an average intron length of 858 bp (Table 7).

Table 6 The genome information of four actinopterygii species.
Table 7 Gene annotation of N. taihuensis genome via three methods.

The final gene models predicted above were then annotated using the NCBI nonredundant (NR) protein database (97.3%) and the Swissprot48 (88.08%), KEGG49 (85.63%), KOG50 (76.36%), TrEMBL49 (97.64%), InterPro51 (90.93%) and Gene Ontology (GO)52 (67.44%) databases. In total, 20,024 (98.16%) gene models were annotated for at least one homologous hit by searching against these public databases (Table 8). Of 20,024 functional proteins, 14,776 (~72.4%) were supported by the data of five databases (InterPro, KEGG, NR, KOG, SwissPort) (Fig. 5).

Table 8 Functional annotation statics.
Fig. 5
figure 5

Venn diagram of the number of genes with homology or functional classification by each method. The Venn diagram shows the shared and unique annotations among InterPro, KEGG, KOG, NR and SwissProt.

Data Records

All the raw data for the whole genome have been deposited into the National Center for Biotechnology Information (NCBI) SRA database (Accessions for SRR22936158 to SRR22936161) under BioProject accession number PRJNA91581953. The Whole Genome Shotgun project has been deposited at GenBank under accession JARGSH00000000054.

The files for N. taihuensis gene structure annotation, gene functional annotation and repeat annotation have been deposited at Figshare55.

Technical Validation

Evaluation of the genome assembly

To compare the assembled metrics for N. taihuensis and the other Salangidae species, the assembly in this study was to the gap-free chromosome-scale assembly level (Table 3). The contig N50 of our assembly was 15.61 Mb, while that of P. chinensis56 was 103.01 Kb and that of P. hyalocranius57 was 17.74 Kb. The contig number for our assembly was 137, while that of P. chinensis was 11,196 and that of P. hyalocranius was 19,755. These statistics indicated that our assembly had reached a higher contiguous level (Table 3).

The completeness was evaluated using BUSCO58 analysis. BUSCO analysis revealed that 91.5% (single-copied gene: 90.0%, duplicated gene: 1.5%) of 3,640 single-copy orthologs (in the actinopterygii_odb10 database) were successfully identified as complete, 1.8% were fragmented and 6.7% were missing in the assembly (BUSCO v5.1.0). The accuracy rate was evaluated by mapping the sequencing data to the assembled genome. The mapping rates were 94.63%, 99.8% and 100% for the DNBSEQ, PacBio data and Nanopore data, respectively.

Evaluation of the gene annotation

The completeness and accuracy of the gene structure annotation were evaluated using three different strategies. First, BUSCO analysis revealed that 90.9% (single-copy gene: 88.5%, duplicated gene: 2.4%) of 3,640 single-copy orthologs (in the actinopterygii_odb10 database) were successfully identified as complete, while 1.6% were fragmented and 7.5% were missing in the assembly (BUSCO v5.1.0) (Table 9). Second, to determine if there was evidence of de novo annotation, homolog-based annotation and transcripts, we calculated the CDS overlap content between the final gene sets with the prediction results from the above three different methods. The results showed that more than 99.78% of genes were occupied by these three prediction results with a CDS overlap ratio greater than 80% (Table 10). Moreover, we compared the length distribution of genes, coding sequences (CDS), exons and introns among the D. rerio, O. latipes, P. hyalocranius and S. salar genomes and found similar distributions of these parameters (Fig. 6).

Table 9 BUSCO Evaluation.
Table 10 The evidence supporting gene models of the N. taihuensis genome.
Fig. 6
figure 6

The composition of gene elements in the N. taihuensis genome compared to the genomes of other species. (a) mRNA length distribution and comparison with other species. (b) Exon length distribution and comparison with other species. (c) CDS length distribution and comparison with other species. (d) Intron length distribution and comparison with other species. (e) Exon number distribution and comparison with other species.