Chromosome-level genome assembly of humpback grouper using PacBio HiFi reads and Hi-C technologies

The humpback grouper (Cromileptes altivelis), a medium-sized coral reef teleost, is a naturally rare species distributed in the tropical waters of the Indian and Pacific Oceans. It has high market value, but artificial reproduction and breeding remain limited and need to be improved. Here, we assembled the genome with 1.08 Gb, with a contig N50 of 43.78 Mb. A total of 96.59% of the assembly anchored to 24 pseudochromosomes using Hi-C technology. It contained 24,442 protein-coding sequences, of which 99.3% were functionally annotated. The completeness of the assembly was estimated to be 97.3% using BUSCO. The phylogenomic analysis suggested that humpback grouper should be classified into the genus Epinephelus rather than Cromileptes. The comparative genomic analysis revealed that the gene families related to circadian entrainment were significantly expanded. The high-quality reference genome provides useful genomic tools for exploiting the genomic resource of humpback grouper and supports the functional genomic study of this species in the future.

it could cause errors and challenges in taxonomy.The groupers had a close relationship in evolution.To better understand the evolutionary relationship and taxonomy, it was necessary to acquire a specific solution by molecular biology.Besides, a high-quality reference genome resource could also provide an effective tool for genetic improvement and germplasm conservation.At present, the long-read and short-read sequencing technologies have been applied to the assembled genome.It was able to obtain highly integrated genome assemblies, especially circular consensus sequencing (CCS) improved the accuracy of PacBio SMRT sequencing.The HiFi sequence updated the genome assembly between read length and base quality significantly.
In 2021, a humpback grouper genome was constructed with the assembly of 1.013 Gb (contig N50 of 18.09 Mb) 10 .In this study, we represent a chromosome-scale genome assembly and annotation of humpback grouper with the PacBio HiFi and Hi-C sequencing technologies.Approximately 1.08 Gb genome was assembled with the contig N50 43.78 Mb.BUSCO analysis showed that 97.3% of the final assembly was complete BUSCOs.Overall, this high-quality reference genome provides a valuable basis for further genetic improvement and understanding the functional genes and molecular mechanisms in humpback grouper

Methods
DNA sample collection, library construction, and sequencing.A female humpback grouper was collected from Hainan Chenhai Aquatic Co., Ltd.The muscle tissue was collected for DNA extraction and library construction.Genomic DNA was extracted by the QIAamp DNA purification kit (Qiagen, USA).The short fragment library was generated using the Truseq Nano DNA HT Sample Preparation Kit (Illumina, USA) with an insert size of 350 bp and the Illumina NovaSeq 6000 platform.For the HiFi read generation, DNA fragment > 30 kb was selected using BluePippin Systerm (Sage Science, USA).The library was generated using the SMRTbell Template PrepKit 2.0 (PacBio, USA), and the library was sequenced in CCS on the PacBio Sequel II platform.The Hi-C library was constructed following the standard protocol described previously with certain  www.nature.com/scientificdatawww.nature.com/scientificdata/modifications 11 , and it was sequenced using the Illumina NovaSeq 6000 platform.A total of 53.1 Gb of Illumina data, 21.5 Gb PacBio of PacBio data, and 96 Gb of Hi-C data after trimming the low-quality reads and adaptor sequences from the raw data.
rNA sample collection, library construction, and sequencing.The samples of eight embryonic development stages (one cell, morula, high blastula, low blastula, gastrula, somite, neurula, and before the hatching stage) were collected for RNA extraction using TRIzol reagent (Invitrogen, USA).RNA-seq libraries were constructed using Illumina TruSeq Stranded mRNA Library Prep Kit (Illumina, USA) and sequenced by the Illumina NovaSeq 6000 platform.Further, RNA extracted from embryonic samples was mixed for Iso-seq.The Iso-seq library was constructed and sequenced on the PacBio Sequel II platform.The clean data was obtained by removing reads containing adapters, reads containing poly-N and low-quality reads from the raw data.Around 55. 6 Gb of RNA-seq data and 69.1 Gb of Iso-seq data were generated for genome annotation.
Genome assembly and quality assessment.The characterization of the genome was estimated using the Illumina short-read data, and the 17 bp k-mer analysis was applied for estimation.The estimated genome size was 1,091.59Mb, the heterozygosity rate was approximately 0.19%, and the repeated content was 45.81%.The genome was assembled using SOAPdenovo2 with k-mer set at 41 bp 12 .The gaps were filled with GapCloser.Then, the draft genome was corrected and re-assembled using HiFi long reads by Hifiasm 0.12-r304 with the parameters "-t 30 -D 10" 13 .The genome assembly was 1.08 Gb, with a contig N50 size of 43.78 Mb (Fig. 1A).To obtain the chromosome-level genome, we applied ALLHiC pipeline to link the mapped contigs to 24 pseudochromosomes 14 .Finally, 96.59% of scaffolds were mapped to 24 chromosomes (Fig. 1B).
To evaluate the assembled genome, BUSCO was applied to evaluate the completeness of genome assembly.A total of 3,345 BUSCO genes were identified, with 3,263 complete genes, 3,230 single-copy genes, 33 multi-copy genes, 47 fragmented genes, and 44 missing genes accounting for 97.3%, 96.3%, 1.0%, 1.4%, and 1.3% of the whole genome, respectively (Table 1).
repeat and noncoding rNA annotation.Repeat sequences of the humpback grouper genome were identified using a combination of homology-based and de novo approaches.For the ab initio method, the RepeatModeler (v2.0.1) 15 , RepeatScout (v1.0.5) 16 , and LTR_finder (v1.0.6) 17 were used to build the humpback grouper custom repeat database.In the homology-based method, the Repbase database 18 was used to identify repeats with the RepeatMasker and RepeatProteinMask.The total length of the repetitive elements accounted for 44.38% of the humpback grouper genome (Fig. 2C).DNA transposons represented the most abundant class of repeats (17.85% of the genome) followed by long interspersed elements (LINEs, 15.20%), long terminal repeats (LTRs, 5.38%), and short interspersed elements (SINEs, 1.11%) (Table 2).

Data Records
The genome assembly and raw reads of the genome and transcriptome sequencing for humpback grouper were deposited under the Sequence Read Archive SRP322594 39 .The genome assembly was deposited at GenBank with the accession number GCA_019925165.1 40 .Besides, the assembled genome, predicted peptide, CDS, and GO term files were available in the figshare database with the DOI number: https://doi.org/10.6084/m9.figshare.24145230.v2 41.
technical Validation evaluation of the genome assembly and annotation.To evaluate the integrity and accuracy of the genome assembly, the completeness of the final genome assembly was assessed using BUSCO (v4.0) 42 with the lineage database vertebrata_odb10 and CEGMA (v2.5) 43 .It was shown that the assembly contained 97.3% complete and 1.4% fragmented conserved single copy orthologue genes, and 94.35% of the 248 core eukaryotic genes.www.nature.com/scientificdatawww.nature.com/scientificdata/By aligning Illumina sequencing reads to the genome using BWA (v0.7.8) 44 , the reads mapping rate and the coverage rates were 99.68% and 99.91%, respectively.It was indicating high mapping efficiency and comprehensive coverage.Thus, all of the above results indicated that we obtained the high-quality genome of humpback grouper.

Fig. 1
Fig. 1 Genome assembly of the humpback grouper.(A) Genomic features.From inner to outer tracks: A, distribution of DNA TEs across the genome; B, distribution of RNA TEs across the genome; C, gene density across the genome; D, GC content across the genome.E, humpback grouper chromosomes.(B) Hi-C contact map of the humpback grouper genome.The blocks represent the contacts between one location and another.The color illustrates the contact density from red (high) to low (orange).

Fig. 2
Fig. 2 The structural and functional annotation of humpback grouper.(A) Comparisons of the predicted gene models between the humpback grouper genome and other teleosts, including CDS length, exon length, exon number, gene length, and intron length.(B) The functional annotation of humpback grouper using different databases.(C) The percentage of different types of repetitive elements in the humpback grouper genome.

Table 1 .
BUSCO evaluation result of humpback grouper genome.

Table 2 .
Statistic results of different types of annotated repeat content.

Table 3 .
Summary statistics of noncoding RNA.