A chromosome-level reference genome of the Antarctic blackfin icefish Chaenocephalus aceratus

The blackfin Icefish (Chaenocephalus aceratus) belongs to the family Channichthyidae and the suborder Notothenioidei which lives in the Antarctic. We corrected the mis-scaffolds in the previous linkage map results by Hi-C analysis to obtain improved results for chromosome-level genome assembly. The final assembly analysis resulted in a total of 3,135 scaffolds, a genome size of 1,065.72 Mb, and an N50 of 33.46 Mb. 820.24 Mb, representing 88.88% of the total genome, is anchored to 24 chromosomes. The final gene set of 38,024 genes, including AFGPs, was annotated using RNA evidence, proteins, and ab-initio predictions. The complete percentage of BUSCO analysis is 92.7%. In this study, we aim to contribute to the study of polar fishes by improving the genome sequences of the blackfin icefish with the AFGP genes belonging to the Notothenoidei.


Background & Summary
The Antarctic Ocean is a very cold and difficult place for any species to survive.The seawater temperature is at subzero levels even in summer, and the intertidal ecosystem does not function because ice covers the shoreline and coastal waters to depths ≥30 m.However, some species can survive in these extreme environments.The Antarctic marine fish fauna consists of approximately 275 species, 95 of which belong to the perciform suborder Notothenioidei.Some species have unusual adaptations, such as the presence of antifreeze glycoprotein (AFGP) in their blood or the absence of hemoglobin, to survive under these frigid conditions 1,2 .The blackfin icefish is a species of crocodile icefish belonging to the family Channichthyidae and the suborder Notothenioidei.Its natural habitat ranges from Southern Georgia to the northern part of the Antarctic Peninsula in the Atlantic sector of the Southern Ocean and Bouvetøya Island.It is found in shelf waters to a depth of 450-770 m 3 .Blackfin icefish species have thin, highly vascularized, scaleless skin; elongated bodies; and a weaker skeleton in comparison with most red-blooded notothenioid species.Their body structure makes them extremely vulnerable to injury 4 .Icefish, also known as white-blooded fish, belong to a unique family in that they are the only known vertebrates to lack hemoglobin.Consequently, their blood oxygen-carrying capacity is just 10% of that of other teleosts.The blood of the blackfin icefish Chaenocephalus aceratus has significantly fewer erythrocytes.The blood sample of C. aceratus does not have a trace of red color.Instead, it has a translucent whitish color.The plasma is clear.The cell mass at the bottom of a centrifuged hematocrit tube has been reported to be creamy white, accounting for approximately 1% of the blood content 5 .The 15 known species of the notothenioid family Channichthyidae, including C. aceratus, have the same diploid number of chromosomes (2n = 48), predominantly acrocentric chromosomes 6 .
A previous study 7 reported the genome assembly of the blackfin icefish and published its genetic linkage map.However, its chromosome-level genome assembly remains unknown.Here, we report the upgraded chromosome-level whole-genome assembly of the blackfin icefish using the Hi-C approach with the tissue of the same individual used in the previous study.The genome assembly was highly consistent with the genetic linkage map at the chromosome level, and some mis-scaffolding in the genetic linkage map was rectified.We compared the chromosome-level genome sequence with that of another icefish, the South Georgia icefish (Pseudochaenichthys georgianus), to verify chromosomal conformity.For assessing chromosomal sta bility, we compared the sequences with those of medaka (Oryzias latipes), torafugu (Takifugu rubripes), and www.nature.com/scientificdatawww.nature.com/scientificdata/stickleback (Gasterosteus aculeatus).Moreover, to perform gene prediction more accurately, we reconstructed the annotation process using the integrated process of GeneMark 8 and PASA pipeline 9 with EVidenceModeler 10 .Using the customized prediction process, we predicted the functions of 10 copies of trypsinogen genes, nine copies of antifreeze glycoprotein (AFGP) genes, and two copies of AFGP/trypsinogen-like protease chimeric genes, and a trypsinogen-like protease gene with high tandem duplication at intron and exon levels.

Methods
Hi-C sequencing.Tissue sample of blackfin Icefish from the same individuals used in the previous study 7 were used for Hi-C analysis.The Dovetail TM Hi-C library was prepared using the Dovetail TM Hi-C Library Kit (Dovetail Genomics, Santa Crus, CA, USA), according to the manufacturer's instructions.Ground tissue (250 mg) was crosslinked with PBS/formaldehyde; the chromatin sample was then prepared with SDS and wash buffer.After normalizing the chromatin sample, 800 ng of chromatin was used to prepare the library.The chromatin was picked up using chromatin capture beads and then digested using a restriction enzyme.The end was labeled with biotin and ligated to form intra-aggregated DNA.After cross-link reversal, 200 ng of DNA was sheared using the Covaris system (Covaris Inc., Woburn, MA, USA).Sheared DNA fragments were end-repaired and ligated using an Illumina adapter.Ligated DNA was purified using streptavidin magnetic beads.Purified DNA was then amplified via PCR to enrich the fragments.Capillary electrophoresis verified the amplified libraries' quality (Bioanalyzer System, Agilent Technologies, Palo Alto, CA, USA).Sequencing was performed using the Illumina NovaSeq 6000 system (Illumina Inc., San Diego, CA, USA), according to the protocols provided for 2 × 150 sequencing 11 .
Hi-C analysis with previous draft assembly.HiRise software, a pipeline for performing scaffolding analysis using proximity ligation data produced using the draft genome assembly, and Dovetail Hi-C technology were used for chromosome-level genome assembly 12 .The Hi-C reads were aligned to the draft assembly using SNAP.The positions of the mapped read pairs were used to construct a likelihood model of the genomic distance between read pairs.Genomic linking information between contigs was generated using the model and misjoins were corrected to construct a pseudomolecule-level scaffold genome.Juicer v.1.5.7 13,14 was used to generate a hic file containing contact matrices with duplicate removal from the linking data.The Hi-C raw sequence data were aligned using BWA-MEM 15 .A contact map plot was drawn in detail using Juicebox v.1.5 13, with the Juicer output being a hic file.Dovetail TM HiRise allowed the upgrade from draft genome assembly to chromosome-level genome assembly within 24 chromosomal sequences (Fig. 1a).The longest scaffold length was 48 Mb, and the scaffold N50 value was 33 Mb (Table 1).We confirmed that there were 24 scaffolds of ≥10 Mb, consistent with the number of chromosomes in the blackfin icefish (2n = 48).Moreover, the total size of unplaced scaffolds was 262.76 Mb (Table 2).

Comparative genomics analysis.
To compare genome sequences at the chromosome level, nucmer in the MUMmer software package v.4.02b 16 was used with the parameters -c 1000 -l 1000 and add--mum for unique matching and avoiding repeat regions.For a clear chromosome comparison, only long sequences corresponding to chromosomes were extracted and compared; unordered contig or scaffold sequences were excluded.Circos 17 is a useful tool for comparing genome sequences based on homogeneous coordinates.In our study, a custom script was used to convert the coordinate data obtained through nucmer into a readable format in Circos.The results of chromosome comparison between two genomes were diagrammed using Circos.For visualizing detailed structural variation, GenomeRibbon 18 was used to assess the coordinate data obtained through nucmer.To confirm the chromosomal stability of the Hi-C assembly, 24 chromosomes of the South Georgia icefish (P.georgianus) 19 and medaka (O.latipes) 20 genomes were compared with 24 chromosomes of the Hi-C assembly to assess their similarity.Each chromosome of the blackfin icefish was exclusively linked to each chromosome of the South Georgia icefish and medaka, thereby reconfirming the chromosomal stability of the scaffolds from the Hi-C assembly and verifying the integrity of the analysis (Fig. 2a,b).Antarctic fishes, including icefish species, diverged from the stickleback lineage approximately 77 million years ago 7 .For comparison with the chromosomes of the blackfin icefish, 21 stickleback chromosomes were aligned with the chromosome-level assembly.The results revealed that three chromosomes of the stickleback (G.aculeatus) 21 were split into six chromosomes of the blackfin icefish (Fig. 2c).Moreover, 22 chromosomes of the pufferfish (T.rubripes) 22 were compared with 24 chromosomes of the blackfin icefish.Pufferfish diverged from the Antarctic fish and stickleback lineages approximately 122 million years ago.Four chromosomes of the blackfin icefish (CAv2_00041, CAv2_00320, CAv2_00011, and CAv2_00012) were found to align with two chromosomes (chromosome 1:NC_042285.1 and chromosome 8: NC_042292.1) of pufferfish (Fig. 2d).

L G 1 4 L G 7
LG 10 LG2 1 LG17 LG15 LG24 LG8 LG 22 LG 11 LG 12 LG 19 LG 20 LG 23 LG www.nature.com/scientificdatawww.nature.com/scientificdata/The predicted genes were annotated by aligning them to the NCBI non-redundant protein (nr) database 35 using NCBI BLAST v.2.9.0 36 with a maximum e-value of 1e-5.To obtain protein domain information, InterProScan v.5.44.79 37 was used with a protein sequence translated from a transcript.Moreover, Trinotate 38 was used for the comprehensive annotation of transcriptome sequences, and TransDecoder v.5.5 with eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) and KEGG (Kyoto Encyclopedia of Genes and Genomes) were used for decoded peptide sequences.Protein signal peptide prediction was performed using SignalP v.5.0 39 , and transmembrane domain prediction was performed using TMHMM v2.0 40 .Gene Ontology (GO) terms 26 were assigned to the genes using the BLAST2GO pipeline v.4.0 41 .A total of 38,024 genes and 39,889 coding sequences (CDSs) were analyzed in the C. aceratus genome.The average length of CDSs was 1,248 bp, and the average number of exons per gene was 7.9 (Table 4).Consequently, a total of 39,889 CDSs were annotated from a minimum of 17.51% to a maximum of 90.31% in seven databases for functional annotation.In one or more databases, 79.03% of CDSs were annotated (Table 5).To confirm the gene prediction results, BUSCO was used in transcriptome mode with CDSs.The percentage of complete BUSCOs was 80.7%, while that of missing was 13.4% (Table 6).
Annotation of AFGP genes.The regions containing AFGP and trypsinogen genes were extracted from the whole-genome sequence using NCBI BLAST v.2.9.0 36 against transcript and protein sequences of the Antarctic toothfish 42 .AFGP genes were predicted using Exonerate v.2.4 with the following specific parameters:--model  www.nature.com/scientificdatawww.nature.com/scientificdata/protein2genome--minintron 20--maxintron 10000--score 250--percent 60 from the extracted region sequence.The final AFGP gene set was identified based on identity, similarity, and alignment length and was integrated into the final gene prediction data.The sequence encoding AFGP, which is similar to the long repetition of simple sequences, is very repetitive and is not assembled in the short sequence of the next-generation sequence despite their high throughput sequences.We identified that genes encoding AFGP were tandemly duplicated in the Cav2_00055 scaffold from 34,915,108 bp to 35,620,009 bp.The AFGP-trypsinogen locus was located between genes encoding mitochondrial 39 S ribosomal protein L17 (mrpl17) and E3 ubiquitin-protein ligase CBL-C isoform X1 (cbl), as reported in a previous study.However, in this study, 10 copies of trypsinogen genes, nine copies of AFGP genes, two copies of AFGP/trypsinogen-like protease chimeric genes, and a trypsinogen-like protease gene were predicted at the exon/CDS level (Fig. 3).AFGP genes evolved from trypsinogen genes in Antarctic fishes 43 .The prediction of gene features of AFGP genes is too difficult by the normal automated prediction method because the AFGP gene sequence has a high incidence of tandem repeats.We developed a customized process to predict complete AFGP gene features and analyzed exons and CDSs of AFGP and trypsinogen genes.www.nature.com/scientificdatawww.nature.com/scientificdata/Our results were consistent with previous results, except in the case of one AFGP gene.Moreover, we obtained tandemly duplicated AFGP gene sequences.Using our developed method, further analysis of the AFGP genes of other Antarctic fishes can be performed.

Data Records
The final genome assembly of the blackfin icefish was deposited at GeneBank (accession GCA_023974075.1) 44 .Also, the Hi-C raw data were deposited NCBI Sequence Read Archive (SRA) with accession number SRR24715329 11 .

technical Validation
We assessed the completeness of genome assembly using Benchmarking Universal Single-Copy Orthologs (BUSCO) 45 v.5.4.4 with the Actinopterygii lineage dataset with default parameters.A total of 3,375 (92.7%)BUSCOs were identified as complete.Of these, 3,241 (89.0%) were single-copy and 134 (3.7%) were duplicated.The numbers of partially matched and missing were 48 (1.3%) and 217 (6.0%), respectively (Table 7).The k-mer completeness and quality value (QV) were evaluated by Merqury v1.3 46 .Merqury analysis were QV of 29.96 and completeness of 88.29 (Table 8).On comparing the Hi-C scaffolds and linkage groups, high concordance www.nature.com/scientificdatawww.nature.com/scientificdata/CAV2_00055 Fig. 3 Antifreeze glycoprotein (AFGP) gene family for the blackfin icefish.AFGP gene family which has 22 genes was found on the blackfin icefish genome.It was identified in the region from 34,957,786 to 35,607,986 in the scaffold CAv2_00055 and contains 10 trypsinogen genes and 9 AFGP genes.
was noted; however, some inconsistencies remained.In particular, mis-scaffolding was noted between LG14 and LG17.Assessment of the Hi-C scaffold confirmed that the 3.26M-sized sequence located in LG14 (LG14: 6,694,531-9,957,046) was transferred to the middle of LG7 (CaV2_00044: 12,160,643-15,423,158).Moreover, the CaV2_00044 scaffold, which was consistent with LG7, was completely scaffolded on the Hi-C contact map (Fig. 1b).These results confirmed that the mis-scaffold on the linkage group was corrected through Hi-C analysis.Moreover, the Hi-C scaffold was verified with the contact map.Many linkage group-based genome assembly results have been improved or finalized for several years through Hi-C analysis 47,48 .

Fig. 1
Fig. 1 Summary of the final genome assembly results.(a) Contact map plot of the blackfin icefish genome.The Hi-C raw read pairs were aligned with the genome sequences.The x and y axes indicate their positions.The red dots indicate the position of the read pairs, and a high density of red dots denotes that they are located on the same chromosome.(b) Correction of mis-scaffolding of the linkage group in the blackfin icefish genome by Hi-C analysis.Mis-scaffolding of the LG14 linkage group was confirmed by Hi-C analysis.The 3.26M-sized sequence of LG14 was located on part of LG7, and the high density of linkage (red dot) was confirmed on the contact map at the position.(c) Overview of the blackfin icefish genome.The features are arranged in the order of gene density, repeat density, GC contents, and GC skew from outside to inside at 1-Mb intervals across the 24 chromosomes.

Fig. 2
Fig. 2 Chromosomal comparison with the blackfin icefish.P. georgianus (a) and O. latipes (b) which have the same number of chromosomes (2n = 48) were compared with the blackfin icefish.Chromosomal comparison of the blackfin icefish with G. aculeatus (c, 2n = 42) and T. rubripes (d, 2n = 44) which have less than the number of chromosomes.

Table 2 .
Summary of chromosome length of the blackfin icefish.

Table 1 .
Summary of the blackfin icefish genome assembly.

Table 3 .
Summary of annotated transposable elements of the blackfin icefish.

Table 4 .
Summary of gene predictions of the blackfin icefish.

Table 5 .
Summary of functional annotation of the blackfin icefish.

Table 6 .
Assessment of the blackfin icefish transcriptome and protein using BUSCO.