Chromosome-level haplotype-resolved genome assembly for Takifugu ocellatus using PacBio and Hi-C technologies

Takifugu species serve as a model system for evolutionary studies due to their compact genomes and diverse phenotypes. The ocellated puffer (Takifugu ocellatus), characterized by special colouration, is a scarce anadromous species in the genus Takifugu. As an ornamental and tasty fish species, T. ocellatus has moderate economic value. However, the available genomic resources for this pufferfish are still limited. Here, a chromosome-level reference genome, as well as two haploid genomes, was constructed by PacBio HiFi long sequencing and Hi-C technologies. The total length of the reference genome was 375.62 Mb with a contig N50 of 11.55 Mb. The assembled sequences were anchored to 22 chromosomes with an integration efficiency of 93.78%. Furthermore, 28,808 protein-coding genes were predicted. The haplotype-resolved reference genome of T. ocellatus provides a crucial resource for investigating the explosive speciation of the Takifugu genus, such as elucidating evolutionary histories, determining the genetic basis of trait evolution, and supporting future conservation efforts.


Background & Summary
The genus Takifugu belongs to the family Tetraodontidae, which inhabits the northwest Pacific Ocean around the coastal area of east Asia 1 . Takifugu is composed of approximately 25 species 2 , which are well known for their inflation behaviour and potent neurotoxins. Meanwhile, the group exhibits diverse morphological characteristics and different ecological habits, as well as a compact genome, providing a great model for investigating species radiation. Four high-quality chromosome-level genomes of T. rubripes 3 , T. bimaculatus 4 , T. flavidus 5 , and T. obscurus 6 have been completed in the genus Takifugu since the first teleost genome of T. rubripes was published in 2002 1 , among which T. obscurus and T. ocellatus have the capacity for hypotonic adaptation 7 . The ocellated puffer Takifugu ocellatus in this study is harboured in China and Vietnam, and is commonly utilized as an ornamental fish species for culture. T. ocellatus exhibits saddle-shaped black dots profiled with orange in the dorsal region, in addition to featuring the capacity for euryhaline acclimation, both making it favour aquarium fish. Despite its deadly toxicity, it is also considered a delicacy in East Asia. Therefore, the species has considerable commercial value due to its ornamental value and edibleness. As an anadromous fish, T. ocellatus shares the same spawning sites and a similar diet to T. obscurus. They acclimate to a broad spectrum of saline water and migrate into freshwater to spawn, while the larvae remain there before emigrating to the seawater. Despite the abovementioned similarities, these two pufferfishes employ different reproductive strategies 7 . In addition, recent phylogenetic analyses in the Takifugu genus have shown that they belong to different sister groups, implying that they may be independent of each other in the evolutionary process of adapting to freshwater 8 . Therefore, a high-quality reference genome of T. ocellatus is essential to elucidate the speciation process during adaptive radiation, including clarifying the evolutionary histories and adaptation strategies.
In this work, we constructed a chromosome-level genome of T. ocellatus by combining PacBio HiFi (high fidelity) reads and Hi-C sequencing data. The genome assembly spanned 375.62 Mb consisting of 163 contigs with a contig N50 length of 11.55 Mb. After chromosome-level anchoring, 22 chromosomes with a total length of 352.28 Mb (93.78% of the draft assembly) were constructed corresponding to the karyotype. Moreover, 66.65 Mb (17.74% of the assembly) of repeat elements, and 28,808 protein-coding genes were annotated. Additionally, two chromosome-level haplotype genome assemblies were also constructed, which would serve as a baseline for future studies on allele-specific expression or conservation genomics. A contiguous and accurate reference genome is essential for basic genetic research and will facilitate evolutionary studies on this euryhaline and anadromous species. In addition, phylogenetic analysis indicated that T. ocellatus speciated from the common ancestor of Takifugu at approximately 21.4 (mya; 15.3-27.6 mya). We identified 789 gene families with expansion, 1,970 families with contraction, 1,034 rapidly evolving genes, and 767 positively selected genes in T. ocellatus. These results will help us to further explore the genetic basis of the freshwater adaptability of T. ocellatus and the explosive speciation mechanism of Takifugu species.

Methods
Sample collection and nucleic acid extraction. Healthy female T. ocellatus were collected from Fujian Takifugu Breeding Station in Zhangzhou, Fujian Province, China. Muscle, eye, skin, gill, kidney, liver, intestine, spleen, gonad, heart and stomach were sampled and frozen in liquid nitrogen immediately and then transferred to −80 °C for storage. Genomic DNA (gDNA) of T. ocellatus was extracted from enough muscle tissues following the manufacturer's protocol by an AMPure bead cleanup kit (Beckman Coulter, High Wycombe, UK), while total RNA was extracted from all tissues by a TRIzoL kit and mixed equally for transcriptome sequencing. The quality of gDNA and RNA was detected by 1.5% agarose gel electrophoresis and DNA was quantified by a Qubit fluorometer (Thermo Fisher Scientific, Waltham, MA).
Library construction and sequencing. According to the manufacturer's instructions, paired-end libraries for genome surveys with a 350 bp insert size were constructed using gDNA. Then, we sequenced this library with a strategy of 2 × 150 bp on the Illumina HiSeq. 2500 platform and obtained the raw data. For HiFi read generation, high-molecular-weight (HMW) gDNA was sheared to 15 Kbp before preparing a PacBio HiFi library. The genomic library was sequenced in CCS mode on the PacBio Sequel II system at Novogene (Tianjin). Subsequently, HiFi reads were generated from raw subreads using CCS workflow 9 (v4.2.0) with a recommended setting. Finally, 31.20 Gb of CCS reads were yielded with a mean read length of 16.1 Kb resulted in 84-fold coverage of the T. ocellatus genome. The coverage was sufficient for haplotype-resolved assembly according to recommendations 10 . For Hi-C library construction, the MboI restriction enzyme was used to digest the cross-linked high molecular weight (HMW) gDNA. After 5′ overhang biotinylated and blunt-end ligation, the DNA was physically sheared into 300-500 bp fragments. Finally, the Hi-C library was sequenced with a strategy of 2 × 150 bp on the Illumina HiSeq. 2500 platform. In addition, 29.16 Gb of paired-end clean reads were generated from the Hi-C library. The RNA-seq library was constructed using the Illumina standard protocol (San Diego, CA, United States) and sequenced on the Illumina HiSeq. 6000 platform. In total, 32.47 Gb of paired-end short clean reads were generated from the RNA-seq library ( Table 1).
Genome survey and assembly. Before assembly, the adapter sequences and low-quality reads generated from the Illumina platform were filtered using fastp (v. 0.23.1) software 11 ., and the remaining reads were used for subsequent genome survey and assembly. To estimate the major characteristics of the genome, such as genome size, heterozygosity, and repeatability, genome surveys were performed using SOAPec (v. 2.01) and GenomeScope (v. 2.0) software with 17 K-mer frequencies. With a dominant peak depth of 124.92, the estimated genome size of T. ocellatus was 369.48 Mb, and the heterozygosity and repetitive sequence content were approximately 0.47 and 27.29%, respectively (Supplementary Table 1 and Supplementary Fig. 1). The estimated genome size is slightly smaller than that of other Takifugu species that were assembled by PacBio (previously reported; 373∼404 Mb) 4,6 . Then, the HiFi long reads along with paired-end Hi-C short reads were provided to HiFiasm 12 (v0.16.1) to generate the monoploid and a pair of haplotype-resolved assembly contig graphs with default parameters. Using the Hi-C integrated algorithm, HiFiasm takes full advantage of phased graphs and long-range information to generate a haplotype-resolved assembly. Finally, three preliminary assemblies, including one monoploid assembly and two haploid assemblies, were yielded, which spanned 375.62 Mb (monoploid), 373.25 Mb (Haploid-1) and 372.15 Mb (Haploid-2), with a contig N50 length of 11.55 Mb, 4.86 Mb and 4.87 Mb, respectively ( Table 2). The genome assembly was slightly larger than the estimated genome size of 369.48 Mb (Table 1) because some repeat fragments could be assembled by high-precision CCS reads 13 . Juicer 14 and 3D-DNA 15 were implemented to obtain the Protein-coding gene finding and function annotation. For noncoding RNA (ncRNA) annotation, RNAmmer (v1.2) and tRNAScan (v1.3) were executed for rRNA and tRNA prediction, respectively. Other noncoding RNAs were detected by alignment against the Rfam database. Four types of noncoding RNAs, including 1,000 miRNAs, 810 tRNAs, 1,573 rRNAs and 775 sRNAs, were identified from the T. ocellatus genome (Table 4). Structural annotation of the protein-coding genes was conducted using ab initio, homology-based and RNA-seq-based approaches, after all repeat sequences in the T. ocellatus genome were soft-masked. For homology-based gene prediction, the protein sequences of D. rerio 20 , O. latipes 21 , T. rubripes 22 , T. flavidus 23 and T. bimaculatus 24 were downloaded from the European Nucleotide Archive and provided to GenomeThreader (v.1.7.0) 25 . In addition, the RNAseq clean data were de novo assembled using Trinity software (v.2.10.0). Braker2 26 was employed to perform ab initio gene prediction using the transcripts assembled from RNAseq and known genes of D. rerio 20 , O. latipes 21 , T. rubripes 22 , T. flavidus 23 and T. bimaculatus 24 . The optimal parameters were obtained after two rounds of model training. For another gene prediction approach, RNA-seq data were  www.nature.com/scientificdata www.nature.com/scientificdata/ aligned to the T. ocellatus genome to assemble the transcriptome via hisat2 27 and stringtie 28 (v2.1.4). Then, TransDecoder (v5.5.0) was adopted to predict the open reading frame (ORF) region. Last, a comprehensive gene set was produced by EvidenceModeler and annotated for protein-coding gene structure by PASA (v2.4.1) 29 .
For functional annotation of protein-coding genes, Diamond (v2.0.6) was applied to align protein-coding genes to the NR, TrembBL (http://www.uniprot.org/) and Swiss-Prot (http://www.uniprot.org/) protein databases with E-values < 1*10-5. The annotation of GO and KEGG pathways was performed using InterProScan  Table 2). Protein sequences shorter than 30 amino acids were filtered out in the above 13 proteome sets and provided to Orthofinder 30 (v2.5.2) to construct orthologous groups. To reveal the phylogenetic relationships among T. ocellatus and 12 other species, single-copy orthologous genes were identified and used for the construction of the phylogenetic tree (Supplementary Table 3). The single-copy orthologues were further aligned using MUSCLE (v3.8.31). Then, RAxML 31 (v8.2.12) with 1000 bootstrap replicates was executed to generate phylogenetic trees. The divergence time was estimated using MCMCTREE (PAML 32 package) based on the molecular clock data in the TimeTree 33 Table 2). We employed the software PRANK-MSA (v140110) 34 with the parameters gaprate = 0.025 and gapext = 0.75 for coding sequence alignment of each homologous group. To examine the selective constraints on the genes, we estimated the dN/dS ratio (ω) using PAML (v4.4b) 32 . We tested three hypotheses: (1) H0, all branches have the same ω; (2) H1, the branch leading to T. ocellatus has a different ω, whereas the other branches have the same ω; and (3) H2, all branches have an independent ω. We used likelihood values and degrees of freedom of the three hypotheses to perform a likelihood-ratio test (LRT). We selected genes whose likelihood values for H1 were significantly larger (adjusted LRT p value of < 0.05) than those for H0 and genes whose likelihood values of H2 were not significantly larger than those of H1. In addition, we also ran branch-site models (model = 2; NSsite = 2) to detect the genes with positively selected sites in T. ocellatus. For the null hypothesis, we set 'fix_omega = 1; omega = 1' , whereas for the alternative hypothesis, we set 'fix_omega = 0; omega = 1.   www.nature.com/scientificdata www.nature.com/scientificdata/ In this study, a high-quality reference genome and two haplotype genomes of T. ocellatus were generated, which could contribute to further research on the genetic mechanism of freshwater adaptability and anadromous characteristics. The comparison between the genomes of freshwater-adapted T. obscurus and T. ocellatus will help us to understand whether there is convergent evolution for freshwater adaptation between these two species. In addition, as the first haplotype-level genome of Takifugu species, the T. ocellatus genome assembly constructed in this study will facilitate the wide use of T. obscurus as a valuable model species to investigate the evolutionary process of adaptive radiation and genetic mechanisms hidden within the compact genome. Combining such information with gene expression data and Hi-C data from different Takifugu species, we could deeply explore whether allele-specific gene expression and the 3D structure of the genome would accelerate www.nature.com/scientificdata www.nature.com/scientificdata/ speciation. Finally, the genome of T. ocellatus, as a potential freshwater aquaculture fish, will build a foundation for breeding projects, whose goal is excellent growth traits and freshwater breeding.

Data records
The raw sequencing reads of all libraries are available from NCBI via the accession number of SRP407984 35 . The assembled genome is available in the NCBI with the accession number JAPVLW000000000 via the project PRJNA901637 36 . Besides, the assembled genome and sequence annotations are available in the figshare database with the DOI number: https://doi.org/10.6084/m9.figshare.20128412.v1 37 . technical Validation evaluating the completeness of the genome assembly and annotation. To verify the integrity and accuracy of these assemblies, the completeness of the final genome assembly was assessed using Benchmarking Universal Single-Copy Orthologues (BUSCO) 38 with the lineage database Actinopterygii_odb10. From 3,640 single-copy orthologues, ∼97.5% were fully discovered in the monoploid genome, ∼97.3%, and ∼97.1% were fully found in the Haploid-1 and Haploid-2 genomes (Supplementary Table 4). In addition, the Illumina short reads used for the genome survey were mapped to the genome using BWA 39 and counted for mapping ratio determination using SAMtools 40 Table 4. Classification of repetitive sequences and ncRNAs of the T. ocellatus genome. www.nature.com/scientificdata www.nature.com/scientificdata/ and the genome coverages of the three assemblies were 99.84%, 99.86% and 99.84%, respectively (Supplementary Table 5). The consensus quality value (QV) of genomes representing per-base consensus accuracy was estimated by Merqury 41 , and that of all three assemblies exceeded 45 (Supplementary Table 4). In addition, a total of 28,808 nonredundant protein-coding genes were successfully produced by combining de novo, homologous searching and transcriptome-assisted predictions. A total of 22,531 genes were successfully functionally annotated (Fig. 2b & Table 5). The number of genes of T. ocellatus (27,015) predicted through de novo prediction and homolog annotation was slightly greater than that of other species of Takifugu, such as T. bimaclatus (21,117) 4 and T. obscurus (22,105) 6 , but slightly lower than that of T. flavidus (29,416) 5 . Hence, the high integration efficiency, mapping ratio, recognition rate of single-copy orthologues and gene number showed that three assemblies of T. ocellatus were of high quality.
To verify the accuracy of the contig anchoring, three chromosome-level assembled genomes (monoploid genome and two haploid genomes) were first aligned and named after the chromosome number of the published T. rubripes genome. Then, the monoploid assembly was aligned to 3 other species in the genus Takifugu, including T. bimaculatus, T. flavidus and T. rubripes. Two haploid assemblies were aligned mutually with a unit of 1 Kbp. The 22 chromosomes we identified in the T. ocellatus genome aligned exactly against the chromosomes of the other three Takifugu species, which suggested a high degree of concordance among them (Fig. 1). The haplotypes also showed strongly reciprocal collinearity (Supplementary Fig. 3).   www.nature.com/scientificdata www.nature.com/scientificdata/ Phylogenetic and evolutionary analysis. A total of 21,446 orthologous gene families were identified from the 13 related species (Supplementary Table 3). A total of 2,698 single-copy orthologous gene families in a 1:1:1 manner was identified and used for phylogenetic analysis (Supplementary Table 3). Phylogenetic analysis indicated that T. ocellatus speciated from the common ancestor of Takifugu at approximately 21.4 (mya; 15.3-27.6 mya) (Fig. 3), which was located at the base of the phylogenetic tree of the Takifugu genus, which was consistent with the previous phylogenetic relationship of the Takifugu genus based on mitochondrial and whole-genome resequencing 8 . In addition, our results showed that the divergence time between the Takifugu genus and the other freshwater Tetraodontidae species was 38.4 mya (27.9-51.0 mya). In addition, the phylogenetic relationship between Tetraodontiformes and other fish was also consistent with previous taxonomic studies 4,5 . Moreover, we uncovered 789 T. ocellatus gene families with expansion and 1,970 families with contraction (Fig. 3). GO enrichment analysis showed that the expanded gene families were mainly involved in the extracellular region (GO:0005576), lipid transport (GO:0006869), single-organism transport (GO:0044765) and growth factor activity (GO:0008083) ( Supplementary Fig. 4 and Supplementary Table 6). On the other hand, the