The first chromosomal-level genome assembly and annotation of white suckerfish Remora albescens

Remora albescens, also known as white suckerfish, recognized for its distinctive suction-cup attachment behavior and medicinal significance. In this study, we produced a high-quality chromosome-level genome assembly of R. albescens through the integration of 23.87 Gb PacBio long reads, 64.54 Gb T7 short reads, and 88.63 Gb Hi-C data. Initially, we constructed a contig-level genome assembly totaling 605.30 Mb with a contig N50 of 23.12 Mb. Subsequently, employing Hi-C technology, approximately 99.68% (603.38 Mb) of the contig-level genome was successfully assigned to 23 pseudo-chromosomes. Through the integration of homologous-based predictions, ab initio predictions, and RNA-sequencing methods, we successfully identified a comprehensive set of 22,445 protein-coding genes. Notably, 96.36% (21,629 genes) of these were effectively annotated with functional information. The genome assembly achieved an estimated completeness of 98.1% according to BUSCO analysis. This work promotes the applicability of the R. albescens genome, laying a solid foundation for future investigations into genomics, biology, and medicinal importance within this species.


Background & Summary
Remora albescens, namely white suckerfish or white remora, are in the Echeneidae family, order Carangiformes, and inhabit warm seas (Fig. 1).Similar to other members of the Echeneidae family, white suckerfish have evolved front dorsal fin sucking discs, which extend from the top of the head to the tips of their pectoral fins, consisting of 13-14 plates 1 .These adaptations enable them to adhere to smooth surfaces through suction, and they spend majority of their lives clinging to a host animal, such as a manta ray or a shark 2 .They frequently affix themselves to the body, as well as within the gill chamber and the mouth of the host 2 .The relationship between a white suckerfish and its host is typically considered a form of commensalism, specifically phoresy.Besides their unique biological characteristics, the white suckerfish are used in traditional Chinese medicine for their positive impact on lung and spleen-stomach health 3 , which grants them considerable medicinal value and commercial benefits.
High-quality reference genomes are instrumental in facilitating a deep understanding and comprehensive screening of the genetic foundation and variations linked to crucial traits.This knowledge allows us to gain insights into and effectively harness the biological characteristics of the species for various purposes.Currently, the genome of the white suckerfish has not been sequenced, impeding our exploration of genetic basis behind their biological features and behaviours.Overall, a high-quality chromosome-level reference genome will contribute to a profound comprehension of the genetic mechanisms responsible for the medicinal value of R. albescens.
In this study, through the integration of PacBio High fidelity (HiFi) long-reads, T7 paired-end sequencing short-reads and high-throughput chromatin capture (Hi-C) sequencing data (Table 1), we introduce the first chromosomal-level genome assembly of R. albescens.The assembly yielded a genome of 605.30Mb, composed of 158 contigs, with a contig N50 length of 23.12 Mb.In total, 603.38 Mb, covering 99.68% of the contig-level genome, were accurately mapped onto 23 chromosomes by using Hi-C data.The BUSCO alignment analysis indicated that our ultimate assembly contained 3,571 (98.1%) complete BUSCOs.In conclusion, this high-quality chromosomal-level reference genome establishes a valuable foundation for comprehending the biological characteristics and conducting further research into the medicinal value of the R. albescens.

Fish sample collection and preparation.
A single fish, measuring 18 centimeters in length, was obtained from Northern South China Sea in June 2022 (Fig. 1).The collection of the sampled fish for this study was conducted in accordance with the guidelines and regulations set forth by the Animal Care and Use Committee of Fisheries College of Zhejiang Ocean University, as indicated by Animal Ethics no.1067.Tissues from the R. albescens were collected and preserved in liquid nitrogen until DNA or RNA extraction.Wherein, muscle and liver tissues were utilized for DNA sequencing to implement the genome assembly.Kidney, spleen, fin, gill and sucker tissues were utilized for RNA sequencing.
WGS BGISeQ library and PacBio library construction, sequencing and contig-level assembly.According to the standard phenol/chloroform extraction instruction, the whole-genome sequencing (WGS) libraries were prepared by extracting genomic DNA from muscle tissues.
To obtain BGISEQ short reads, the DNA sample underwent evaluation through 1% agarose gel electrophoresis and the Pultton DNA/Protein Analyzer (Plextech).Subsequently, a paired-end library with an insert size of 300 bp to 350 bp was constructed following the BGISEQ standard protocol.Afterward, the DNA sample was purified, quantified, and subjected to sequencing from both ends using the BGISEQ-T7 sequencing platform.BGISEQ sequencing resulted in a total of 66.21 Gb raw reads (Table 1).Following a filtering process utilizing fastp v0.23.2 4 with default parameters, which aimed to eliminate low-quality, short reads, adapters and redundant sequences, a total of 64.54 Gb clean reads were obtained (Table 1).Then by using GCE v1.0.0 software 5 , K-mer analysis was performed to estimate the genome size and heterozygosity for R. albescens, which were 563 Mb and 0.63%, respectively (Fig. 2).
To obtain PacBio long reads, the DNA sample was first evaluated using Nanodrop, Qubit and agarose gel electrophoresis.Then, the library with a fragment size of 20 kb was created utilizing the SMRTBell template preparation kit 1.0 following the manufacturer's instructions.Afterward, the DNA sample was subjected to sequencing using the PacBio Sequel II platform in Circular Consensus Sequence (CCS) mode.After removing low-quality sequences using the CCS v6.0.0 algorithm with default parameters, a sum of 23.87 Gb high-precision reads with an N50 value of 18.88 kb were obtained.With these HiFi reads, the initial contigs were assembled using the Hifiasm v0.16.1 6 and the purge_haplotigs algorithms 7 with the default settings.The assembly yielded a 605.30Mb genome with a maximum contig size of 51.46 Mb.

Hi-C library preparation, sequencing and chromosomal-level assembly. The contigs obtained
in the previous step were anchored onto chromosomes using Hi-C data.In a nutshell, 1 g of liver tissue from  R. albescens was treated with 1% formaldehyde for 20 minutes at 20-25 °C temperature to facilitate the coagulation of proteins implicated in chromatin interactions.Next, DNA was digested using MboI and the overhangs of the resulting restriction fragments were labeled with biotinylated nucleotides, after which they were ligated within a confined volume.Following the cross-link reversal, the ligated DNA was purified and fragmented to a size range of 300-500 bp.Following this step, ligation junctions were extracted by streptavidin beads and subjected to sequencing from both ends using the BGISEQ-T7 sequencing platform, producing a total of 88.75 Gb raw data (Table 1).After removing low-quality sequences and adapters, and only retaining paired-end reads, both of which are longer than 50 bp, with fastp v0.23.2 4 software, a sum of 88.63 Gb clean data were acquired (Table 1).We utilized the HiCUP pipeline 8 to obtain credible and nonredundant contigs interaction matrix, and then anchored the contigs onto chromosomes by using 3D-DNA pipeline 9 .Juicebox Assembly Tools 10 was utilized for manual error correction to rectify any occurrences of chromosome inversion and translocation.Finally, 603.38 Mb (~99.63%) of contig-level assembled sequences were positioned onto 23 pseudo-chromosomes (Fig. 3A).
RNA library construction and sequencing.Total RNA was extracted from the five tissues, including kidney, spleen, fin, gill and sucker, of the R. albescens using TRIzol reagent (Invitrogen).To evaluate RNA quality, we utilized the NanoDrop ND-1000 spectrophotometer (Labtech) and the 2100 Bioanalyzer (Agilent Technologies).The paired-end reads were sequenced using the BGISEQ-T7 Platform.Overall, 6.01 Gb of clean data were obtained following filtering process utilizing fastp v0.23.2 4 with default settings to eliminate low-quality and short reads, as well as trim adapters and polyG tails (Table 1).

Data Records
The raw sequencing data for R. albescens in this study is available from the Sequence Read Archive (SRA) under Bioproject number PRJNA1036795, which includes WGS T7 sequencing data (SRR26831100 29 ), Pacbio HiFi sequencing data (SRR26831099 30 ), Hi-C sequencing data (SRR26831098 31 ), and RNA sequencing data (SRR28537587 32 ).The assembled genome of R. albescens has been deposited in GenBank under accession JAXCVL000000000 33 .Additionally, files contained the assembled genome, protein-coding gene annotation, non-coding RNA prediction, and repeat annotation of R. albescens have been made available in the Figshare database 34 .

Table 1 .
Statistics of sequencing data for Remora albescens genome assembly and annotation.

Table 3 .
Statistics on transposable elements in the R. albescens genome.

Table 2 .
Comparison of the R. albescens genome assembly metrics with the E. naucrates.

Table 4 .
Statistics of gene predictions in the R. albescens genome.

Table 5 .
Summary of functional annotations for predicted genes of the R. albescens genome.

Table 6 .
Statistics of ncRNA in the R. albescens genome.

Table 7 .
Statistics of T7 and PacBio data remapped to the R. albescens genome.

Table 8 .
Statistics of BUSCO assessment in the R. albescens genome.