Chromosome genome assembly and annotation of the spiny red gurnard (Chelidonichthys spinosus)

Chelidonichthys spinosus, a secondary economic fish, is increasingly being exploited and valued in China. However, overfishing has led to it being recognized as one of the most depleted marine species in China. In this study, we generated a chromosome-level genome of C. spinosus using PacBio, Illumina, and Hi-C sequencing data. Ultimately, we assembled a 624.7 Mb genome of C. spinosus, with a contig N50 of 13.77 Mb and scaffold N50 of 28.11 Mb. We further anchored and oriented the assembled sequences onto 24 pseudo-chromosomes using Hi-C techniques. In total, 25,358 protein-coding genes were predicted, of which 24,072 (94.93%) genes were functionally annotated. The dot plot reveals a prominent co-linearity between C. spinosus and Cyclopterus lumpus, indicating a remarkably close phylogenetic relationship between these two species. The assembled genome sequences provide valuable information for elucidating the genetic adaptation and potential molecular basis of C. spinosus. They also have the potential to provide insight into the evolutionary investigation of teleost fish and vertebrates.

www.nature.com/scientificdata www.nature.com/scientificdata/ evolution of C. spinosus, as its genome has not been sequenced. Genomic information, as an essential conservation and management tool [10][11][12][13][14] , is necessary for the protection and long-term survival of marine species. Genomic data and resources can greatly enhance our understanding of the species' diversity and adaptive evolution and can provide a strong foundation for the implementation of effective conservation measures.
In this study, a high-quality chromosome-level genome assembly of C. spinosus was generated by Illumina short reads, PacBio long reads, and Hi-C techniques (Fig. 1). The final assembled genome size of C. spinosus was 624.7 Mb with an N50 contig length of 13.77 Mb and scaffold N50 length of 28.11 Mb. Furthermore, 99.29% of the initially assembled sequences were anchored on 24 chromosomes. The genome contained 35.96% repeat sequences and 8,235 noncoding RNAs. A total of 25,358 protein-coding genes were predicted, of which 24,072 genes (94.93%) were functionally annotated. The assembled genome sequences can provide valuable information for elucidating the genetic adaptation and potential molecular basis of C. spinosus, which can be used to establish more effective management and conservation strategies for this species. Additionally, these genomic data can be also used for future comparative genomics and phylogenetic studies, which may shed light on the genomic evolution and phylogeny of Scorpaeniformes and other teleost fish and vertebrates.

Methods
Sample collection and sequencing. Biopsy material for DNA and RNA isolation was obtained from a juvenile spiny red gurnard (Chelidonichthys spinosus) collected from Yangtze Estuary, China (Fig. 2). Owing to the immature status of the gonad, the sex could not be determined. All experiments were performed by relevant guidelines and regulations established by the Institutional Animal Care and Use Committee of the Institute of Oceanology, Chinese Academy of Science.
The same specimen was used for genome sequencing (muscle), construction of the Hi-C library (muscle), and transcriptome sequencing of spleen, eyes, brain, kidney, liver, stomach, pelvic fins, and heart. Approximately 500 mg of tissue was dissected and stored in liquid nitrogen before being delivered in dry ice. DNA extraction was performed using an SDS-based analysis method, followed by purification using chloroform. High-quality DNA was used for library preparation and high-throughput sequencing.
For short-read sequencing, one hundred nanograms (ng) of genomic DNA (gDNA) were used to prepare the library. Briefly, gDNA was sonicated to a fragment size of 350 bp by an ultrasonicator and then the library was prepared by using an NEB Ultra DNA library prep kit (NEB, UK) following the manufacturer's instructions. The library was sequenced on Illumina Novaseq6000 platform (Illumina, Inc., San Diego, CA, USA) using the run configuration 3 × 350 bp for paired-end (PE150) sequencing. A total of 105.55 Gb (165.53 × coverage) of short sequencing reads (clean data) were generated and retained for genome survey on the Illumina sequencing platform (Table 1).
Long-read sequencing was performed using a PacBio II sequencer (Pacific Biosciences, Menlo Park, CA, USA) following the manufacturer's protocols. Briefly, g DNA was sheared into fragments using the Covaris g-TUBE device, followed by damage-repair, and end-repair using the SMRTbell Damage Repair Kit (PacBio) 15 on interrupted DNA fragments. The dumbbell connector was then attached, and the fragments were digested by exonuclease. The sequencing library was obtained by screening the target fragments using BluePippin (Sage Science, MA, USA). For the PacBio platform, a total of 164.08 Gb (262.51 × coverage) PacBio long sequencing reads (clean data) with N50 read length of 1,786 bp were obtained after removing adaptors in polymerase reads ( Table 1).
The chromosome-level assembly of the C. spinosus genome was accomplished by preparing a Hi-C library as per protocol 16 and sequencing it using the Illumina Novaseq6000 platform. This procedure began by digesting purified DNA from a fresh muscle sample with the HindIII restriction enzyme. Subsequently, this DNA was labeled with Biotin-14-dATP (Thermo Fisher Scientific, USA) through an incubation process, followed by  www.nature.com/scientificdata www.nature.com/scientificdata/ ligation using T4 DNA Ligase. Post an overnight incubation to facilitate reverse cross-linking, the ligated DNA was sheared into fragments ranging between 300 and 700 base pairs. DNA fragments harboring interaction relationships were selectively captured using streptavidin magnetic beads, allowing for the construction of the library. Following library construction, the Qubit 2.0 and Agilent 2100 systems were employed to assess the library's concentration and insert size. To ensure the library's quality, its effective concentration was precisely measured using the Q-PCR method. In conclusion, the prepared Hi-C libraries were quantified and sequenced on the Illumina NovaSeq6000 platform (Illumina, USA) using a PE-150 module. This process yielded a total of 108.42 Gb of clean data, determined using the same filter criteria as for short reads (Table 1).
For substantiating transcripts to annotate the genome structure, we carried out RNA-seq on muscle, spleen, eye, brain, kidney, liver, stomach, pelvic fin, and heart samples. This procedure was undertaken by the standard protocol supplied by Oxford Nanopore Technologies (ONT). RNA was extracted from all samples using the Illustra RNAspin Mini RNA Isolation Kit (GE Healthcare, UK), and DNA contamination was eliminated through DNase I treatment. After assessing the quality of a NanoPhotometer ® spectrophotometer (Implen,   Table 3. The result of k-mer analysis.  www.nature.com/scientificdata www.nature.com/scientificdata/ USA), we constructed the RNA-seq libraries in line with the provided protocol. The libraries were then sequenced on the ONT platform (ONT, UK), resulting in a total of 20.42 Gb of clean transcriptome data, as shown in Table 2.

Type Contig length (bp) Scaffold length (bp) Contig number Scaffold number Gap Num Gap Len (bp)
Genome assessment and assembly of C. spinosus. The k-mer method was used to survey the genomic features of C. spinosus. The k-mer count histogram (k = 21) was obtained from 105.55 Gb Illumina paired-end   www.nature.com/scientificdata www.nature.com/scientificdata/ sequencing data using Jellyfish v2.9927 17 and the genome size was estimated using the formula with amendment: G = N k-mer /Daverage k-mer , where G is genome size, N k-mer is the total number of k-mers, Daverage k-mer is the average depth of k-mers. After removing the k-mers with abnormal depth, a total of 91,430,990,545 k-mers were obtained with a k-mers peak at a depth of 134 (Fig. 3). The genome size of C. spinosus was estimated to be 637.64 Mb, and the estimated heterozygosity rate was approximately 0.24% (Table 3).
The 164.08 Gb PacBio long-read data were used for genome assembly using wtdbg2 18 followed by Quiver 19 and Pilon 20 polishing using the 105.55 Gb of Illumina HiSeq clean reads, which produced a 624.7 Mb genome assembly, consisting of 406 contigs with a contig N50 size of 13.77 Mb. In addition, the number and length of the gap are 113 and 11,300 bp (Table 4).
We employed Hi-C technology to facilitate the chromosome-level genome assembly of the C. spinosus. A robust 108.42 Gb of clean data (Table 1) was aligned to the preliminary genome assembly utilizing BWA v0.7.10 21 . Subsequent steps involved the removal of duplications, sorting, and quality control, all carried out via HiC-Pro v2.8.0 22 . For in-depth analysis, we used exclusively those read pairs that were uniquely mapped and valid. Using LACHESIS 23 , the contigs were clustered, ordered, and oriented, resulting in a chromosomal-scale assembly. The parameters deployed in LACHESIS included CLUSTER_MIN_RE_SITES at 59, CLUSTER_MAX_LINK_ DENSITY at 2, ORDER_MIN_N_RES_IN_TRUNK at 53, and ORDER_MIN_N_RES_IN_SHREDS at 52. The resulting assembled sequences were further anchored and oriented onto 24 pseudo-chromosomes, employing Hi-C data. These pseudo-chromosomes, ranging in size from 10.45 to 36.62 Mb ( Fig. 4 and Table 5), encompassed approximately 99.29% of the total genome. The final C. spinosus genome assembly was presented with 293 scaffolds, spanning a cumulative length of 624,711,834 base pairs, a contig N50 of 13.77 Mb, and a scaffold N50 of 28.11 Mb (Table 4). Furthermore, we have produced dot plots depicting the high-quality chromosome assembly genomes of C. spinosus and its three closest related species (Anoplopoma fimbria, Gasterosteus aculeatus, Cyclopterus lumpus). These dot plots serve to visually demonstrate the co-linearity relationship among them. Specifically, the dot plot highlights a pronounced co-linearity between C. spinosus and Cyclopterus lumpus, suggesting a remarkably close phylogenetic relationship between these two species. (Fig. 5). repetitive sequence annotation. A combined strategy using homology alignments and de novo searches to identify whole-genome repeats was applied in our repeat annotation pipeline. Transposon element (TE) and tandem repeat were annotated by the following workflows. Firstly, RepeatModeler2 (v2.0.1) 24 was adopted for ab initio prediction, mainly using two ab initio prediction software RECON (v1.0.8) 25 and RepeatScout (v1.0.6) 26 . Then full-length long terminal repeat retrotransposons (fl-LTR-RTs) were identified using both LTRharvest (v1.5.9) 27 (-minlenltr 100 -maxlenltr 40000 -mintsd 4 -maxtsd 6 -motif TGCA -motifmis 1 -similar 85 -vic 10 -seed 20 -seqids yes) and LTR_finder (v1.1) 28 (-D 40000 -d 100 -L 9000 -l 50 -p 20 -C -M 0.9). The high-quality intact fl-LTR-RTs and non-redundant LTR library were then produced by LTR_retriever (v2.8) 29 . Non-redundant species-specific TE library was constructed by combining the denovo TE sequences library above with the known Repbase (v 19.06) 30 , REXdb (V3.0) 31, and Dfam (v3.2) 32 database and RepeatClassifier 24 was used to classify the prediction results. Final TE sequences in the C. spinosus genome were identified and classified by homology search against the library using RepeatMasker (v4.10) 33 . Tandem repeats were annotated by Tandem Repeats Finder and MIcroSAtellite identification tool (MISA v2.1) 34 . The results showed that 35.96% of the C. spinosus genome was annotated as repetitive elements, of which transposable elements (TE) were a total length of 183.01 Mb, accounting for 29.28% of the whole genome. Tandem repeats were a total length of 41.75 Mb and represented 6.68% of the whole genome (Table 6).
Protein-coding gene prediction and annotation. We integrated three approaches, namely, de novo prediction, homology search, and transcript-based assembly, to annotate protein-coding genes in the genome. www.nature.com/scientificdata www.nature.com/scientificdata/ Firstly, the de novo gene models were predicted using two ab initio gene-prediction software tools, Augustus (version 2.4) 35 and SNAP (2006-07-28) 36 . Secondly, GeMoMa (v 1.7) 37 was performed for prediction based on homologous species. The protein sequences of C. lumpus, D. rerio, L. tanakae, and S. umbrosus were downloaded from GenBank and Ensembl database 38 . The protein sequences were aligned against the genome assembly using TBLASTN v2.2.2631 39 (E-value ≤ 1e-5), and then matching proteins were aligned to the homologous genome sequences for accurate spliced alignments with GeneWise v2.4.132 40 . Thirdly, for the transcript-based prediction, RNA-sequencing data were mapped to the reference genome using Hisat (v 2.0.4) 41 and assembled by Stringtie (v 1.2.3) 42 . GeneMarkS-T (v 5.1) 43 was used to predict genes based on the assembled transcripts. The PASA (v 2.0.2) 44 software was used to predict genes based on the unigenes (and full-length transcripts from the PacBio (ONT) sequencing) assembled by Trinity (v2.11) 45 . Finally, gene models from these different approaches were  Table 6. The repeat sequence statistics of the assembled genome. Note: Type: the type of repetitive sequence (Class I: retrotransposons; Class II: DNA transposon; Class III: Tandem repeats); Number: the number of repetitive sequences; Length: the total length of predicted repetitive sequences; Rate (%): the proportion of repetitive sequences in the total genome; the "0" represents a ratio that is less than 0.01.
www.nature.com/scientificdata www.nature.com/scientificdata/ combined using the EVM software (v 1.1.1) 46 and updated by PASA. The final gene models were annotated by searching the GenBank Non-Redundant (NR, 20200921), TrEMBL (202005), Pfam (33.1), SwissProt (202005), eukaryotic orthologous groups (KOG, 20110125), gene ontology (GO, 20200615) and Kyoto Encyclopedia of Genes and Genomes (KEGG, 20191220) databases. Overall, A total of 25,358 protein-coding genes were predicted by integrating the prediction of ab initio, homology-based, and RNA-seq strategies (Table 7), with an average gene length of 15,136 bp, exon length of 2,314 bp, coding sequence of 1,678 bp and intron length of 12,822 bp ( Table 8). The statistics of gene models, including coding length, gene length, intron length, and exon length in C. spinosus were comparable to those for close-related species (Fig. 6). Ultimately, 24,072 genes (94.93% of the total) were successfully annotated GO, KEGG, KOG, Pfam, Swissprot, TrEMBL, EggNOG and NR database (Table 9).
Noncoding rNas annotation and Pseudogene prediction. Non-coding RNAs are usually divided into several groups, including miRNA, rRNA, tRNA, snoRNA, and snRNA. The tRNAscan-SE (v 1.3.1) 47 was used to predict tRNA with eukaryote parameters. Identification of the rRNA genes was conducted by Barrnap (v 0.9) 48 based on Rfam (v 12.0) database 49 . miRNA genes were identified by searching miRBase (v 21) databases 50 . The snoRNA and snRNA genes were predicted using INFERNAL (v 1.1) 51 based on the Rfam database. Finally, a total of 6,326 tRNAs, 962 rRNAs, and 947 miRNAs were predicted (Table 10).
Pseudogenes typically exhibit sequences similar to functional genes but may lose their biological function due to genetic mutations, such as insertions and deletions. To identify such pseudogenes, we performed a comprehensive genome scan using the GenBlastA program (v 1.0.4) 52 after excluding predicted functional genes. Subsequently, we employed GeneWise (v 2.4.1) 53 to search for immature mutations and frameshift mutations to analyze the putative candidate genes. As a result, a total of 82 pseudogenes were identified, encompassing a combined length of 335,565 base pairs, with an average length of 4,092 base pairs (Table 10).

Data records
The sequencing dataset and genome assembly were deposited in public repositories. Illumina, PacBio, Hi-C and RNA-seq sequencing data used for Genome assembly have been deposited in the Genome Sequence Archive (GSA) at the National Genomics Data Center (NGDC)/China National Center for Bioinformation (CNCB) under accession number CRA009849 54 . This Whole Genome Shotgun project has been deposited at GenBank   www.nature.com/scientificdata www.nature.com/scientificdata/ under the accession JARXKY000000000 55 . The version described in this paper is version JARXKY000000000.1 55 . Moreover, the genomic annotation results have been deposited in the figshare database 56 .

Technical Validation
The assembly was evaluated using three criteria: the mapping of Illumina reads, core gene integrity, and BUSCO assessment. The reads from the short-insert library were re-mapped onto the assembly using BWA 21 . The assembly completeness was evaluated using Core Eukaryotic Genes Mapping Approach (CEGMA) software (version 2.5) 57 and Benchmarking Universal Single-Copy Orthologs (BUSCO) software (version 2.0) 58 . The Illumina reads fully (98.94%) mapped to the assembled genome, including 97.23% of paired-end reads (Table 11).