Gapless genome assembly of East Asian finless porpoise

In recent years, conservation efforts have increased for rare and endangered aquatic wildlife, especially cetaceans. However, the East Asian finless porpoise (Neophocaena asiaeorientalis sunameri), which has a wide distribution in China, has received far less attention and protection. As an endangered small cetacean, the lack of a chromosomal-level reference for the East Asian finless porpoise limits our understanding of its population genetics and conservation biology. To address this issue, we combined PacBio HiFi long reads and Hi-C sequencing data to generate a gapless genome of the East Asian finless porpoise that is approximately 2.5 Gb in size over its 21 autosomes and two sex chromosomes (X and Y). A total of 22,814 protein-coding genes were predicted where ~97.31% were functionally annotated. This high-quality genome assembly of East Asian finless porpoise will not only provide new resources for the comparative genomics of cetaceans and conservation biology of threatened species, but also lay a foundation for more speciation, ecology, and evolutionary studies. Measurement(s) Neophocaena asiaeorientalis sunameri • Gapless genome assembly • sequence annotation Technology Type(s) MGISEQ. 2000 • PacBio HiFi Sequencing • Hi-C Sample Characteristic - Organism Neophocaena asiaeorientalis sunameri Sample Characteristic - Environment seawater Sample Characteristic - Location Yellow Sea near Lianyungang City, Jiangsu Province, China


Background & Summary
The finless porpoise (Neophocaena spp.) is a group of small-sized, toothed whales that are mainly distributed in southern and eastern Asia. Their distribution includes the coastal waters of the western Pacific Ocean, Indian Ocean, Sea of Japan, and they also appear in the Bohai Sea, Yellow Sea, East China Sea, South China Sea, and middle and lower reaches of the Yangtze River in Chinese waters 1,2 . Since Cuvier first named the species Delphinus phocaenoides in 1829, the taxonomy and nomenclature of the finless porpoise have been controversial 3,4 . For decades, the finless porpoise was considered to be a single species consisting of three subspecies [5][6][7] , until Wang and Jefferson et al. concluded that the genus Neophocaena can be divided into two separate species based on their morphological and genetic characteristics, including the Indo-Pacific finless porpoise (N. phocaenoides) and the narrow-ridged finless porpoise (N. asiaeorientalis). The narrow-ridged finless porpoise can also be divided into two subspecies that include the Yangtze finless porpoise (N. a. asiaeorientalis) and the East Asian finless porpoise (N. a. sunameri) 8,9 , and this classification has been generally accepted. In 2018, Zhou et al. performed de novo genome sequencing of the Yangtze finless porpoise and re-sequenced three geographic populations in Chinese waters to investigate the freshwater adaptation mechanisms of the Yangtze finless porpoise 10 . Their results found that the genetic differentiation between the Yangtze finless porpoise and East Asian finless porpoise reached interspecific level, which supports their classification as independent species 10 .
With conservation, The IUCN Red List of Threatened Species categorized the Yangtze finless porpoise as "critically endangered" in 2013 11 , and the narrow-ridged finless porpoise as "endangered" in 2017 12 . However, the East Asian finless porpoise was not listed separately. The East Asian finless porpoise was listed in the Second Class of the National Key Protected Wild Animals List in China announced on February 5, 2021. Similar to other small cetaceans throughout the world, the East Asian finless porpoise population faces many critical factors, such as marine environment pollution, fishing injury, loss of important habitat, and decline of fish resources under the dual influence of global climate change and human activities 13 . Ultimately, the prospect of East Asian finless porpoise increasing in population is not optimistic, and it is extremely urgent to explore more conservation efforts for this species. sequenced and compared the renal transcriptomes between the Yangtze finless porpoise and the East Asian finless porpoise to investigate the mechanism of osmotic pressure regulation with adaptation to their different habitats 18 . Additionally, Li et al. used a single hydrophone to record and analyze the echolocation signals of East Asian finless porpoises in Liaodong Bay and conducted a comparative study to Yangtze finless porpoises 19 . Further, Dong et al. concluded that the migration pattern of the East Asian finless porpoise population is mainly related to the migratory distribution of its preferred fish 20 , and finless porpoises have a broad diet that largely consists of fish, shrimp, and cephalopods 21,22 . Although these studies help understand finless porpoise migration behaviors and adaptation, more research needs to focus on improving conservation efforts for the East Asian finless porpoise.
In China, conservation research on the East Asian finless porpoise is less enthusiastic than that on the Yangtze finless porpoise. There is a serious lack of basic research on the East Asian finless porpoise, especially on its current population size, distribution characteristics, migration patterns, and key habitats. To date, the population size and distribution pattern of the East Asian finless porpoise in China are still not systematically known, and little is known about its key habitats. Consequently, its conservation biology research should receive more attention because it is an endangered marine mammal that is also listed as a second-class key protected wild animal in China.
The goal of this study was to assemble a gapless genome for the East Asian finless porpoise to aid in the conservation of this species. Here, we report a gapless cetacean genome that was generated through combining PacBio HiFi long reads and Hi-C sequencing data. We sequenced and analyzed the genome of the East Asian finless porpoise at the chromosomal level to gain a deeper understanding of its genetic background and evolutionary characteristics. The assembled genome size is approximately 2. Consequently, only 28 gaps were retained for next step filling in our assembly results. As the telomere-to-telomere assembly of human genome published this year 23 , ultra-long (>100-kbp) nanopore reads can be enable to span complex repeats and complete assemblies of the centromeres and telomeres. Gene annotation yielded 22,814 protein-coding genes and 97.31% of the predicted genes were annotated in publicly available biological databases, including NR, GO, KOG, KEGG, TrEMBL, Interpro and Swissprot. This high-quality assembled genome will provide rich research resources for conservation biology and phylogenetic studies on the East Asian finless porpoise, as well as research on genetic differentiation and adaptive evolution of other small toothed whales, like the Yangtze finless porpoise.

Methods
Sample collection. A muscle sample was collected from a male specimen of East Asian finless porpoise that died in the Yellow Sea near Lianyungang City, Jiangsu Province, China, in 2019 ( Fig. 1). No ethical issues were considered in this study. The muscle sample was washed 3 times with Phosphate buffer saline (PBS), quickly frozen in liquid nitrogen, and stored at −80 °C until DNA extraction.
WGS library construction and genome size estimation. DNA was extracted from muscle specimen of the East Asian finless porpoise using MZ 1.3 (hypervariable minisatellite probe), as well as locus-specific minisatellite probes (g3, MS1 and MS43). For the short insert WGS library, DNA was sheared into fragments between 50 to 800 bp using a Covaris E220 ultrasonicator (Covaris, Brighton, UK) according to the manufacturer's instructions. Fragments between 300 to 400 bp were selected to generate a single-stranded circular DNA library. The DNA library was sequenced on a MGISEQ-2000 platform. A total of 232.16 Gb of raw short reads were generated and 182.87 Gb of clean data were retained after adaptor removing and low-quality reads filtering by SOAPnuke (v2.0) 24 with parameters "-n 0.01 -l 20 -q 0.1 -i -Q 2 -G 2 -M 2 -A 0.5" (Supplementary Table S1).
We used KmerGenie (v1.7051) 25 to estimate the genome size with varied k-mer sizes from 21 to 121 (Fig. 2a). According to the smooth curves of estimated genome size, we obtained the predicted optimal k-mer size of 91 and the predicted genome size of 2,475,638,739 bp (Fig. 2b). The predicted genome size of the East Asian finless porpoise is consistent with that of the Yangtze finless porpoise (2.49 Gb) found in a previous study 10 .
PacBio library preparation, sequencing, and de novo assembly using HiFi reads. DNA was extracted from the same muscle specimen using a QIAGEN Blood & Cell Culture DNA Midi Kit following the manufacturer's instruction (QIAGEN, Germany). After DNA extraction, two sequencing libraries were prepared according to the "Using SMRTbell Express Template Prep Kit 2.0 With Low DNA Input" protocol from PacBio with an insert size of approximately 20 kb (Pacific Biosciences, USA). The libraries were then sequenced on a PacBio Sequel II SMRT cells in circular consensus sequence (CCS) mode. A total of 5 SMRT Cells were sequenced. 2,397 Gb subreads were processed using the CCS algorithm of SMRTLink (v8.0.0) 26 with parameters "-min-Passes 3 -minPredictedAccuracy 0.99 -minLength 500", yielding 154 Gb of PacBio's long high-fidelity (HiFi) reads (Supplementary Table S1). With the HiFi reads, the primary contigs were assembled by Hifiasm (v0.15.1) 27 with default parameters. After, the Purge-Haplotigs 28 program was used to remove redundant sequences with parameters "-j 80 -s 80 -a 30", which yielded a contig assembly with a size of approximately 2.50 Gb and contig N50 of 84.69 Mb (Table 1).
Hi-C library preparation, sequencing, and chromosome anchoring. The same muscle specimen was fixed with 1% formaldehyde for 10-30 min at room temperature to coagulate proteins that are involved in chromatin interaction in the genome. The restriction enzyme Mbo I (NEB, Ipswich, USA) was then added to digest the DNA, and fragments with flat or sticky ends were obtained. The ends were flattened and repaired, and www.nature.com/scientificdata www.nature.com/scientificdata/ then labeled with biotin. The inter-match fragments were ligated with T4 DNA ligase (Thermo Scientific, USA) to form a loop. Proteins that connected the DNA fragments were then digested to obtain the crosslinked fragments, and the clip was interrupted again using ultrasound. A Hi-C library was made by capturing the biotin with magnetic beads and sequenced on a MGISEQ-2000 instrument. A total of 219.2 Gb of clean data were obtained from 263.87 Gb of sequencing data using software SOAPnuke (v2.0) 24 with parameters "-n 0.01 -l 20 Table S1).
To anchor contigs onto chromosomes, the Hi-C clean data were mapped to the assembled contigs using BWA (v0.7.12) 29 , and then erroneous mappings (MAPQ = 0) and duplicates were filtered by the juicer pipeline (v1.5) 30 to obtain the interaction matrix. Following, approximately 625.70 Mb reads (~77.77%) were used to anchor the contigs into chromosomes with 3D-DNA pipeline (v180,922) 31 . And 3D-DNA pipeline 31 was used to remove select short contigs using default parameters. The Hi-C contact maps were then reviewed with JUICEBOX Assembly Tools (v2.15.07) 30 . These processes generated a final genome assembly, where the genome size was approximately 2.50 Gb and contig N50 was 84.69 Mb. Remarkably, 52 contigs were linked onto the 21 autosomes, two sex chromosomes, and one mitochondria sequence (Fig. 3, Tables 1 and 2).
Identification of Y chromosome sequences. Generally, sequence assembly on the Y chromosome is a challenge due to its complex repetitive nature. Here, we assembled all the PacBio HiFi reads into contigs by Hifiasm 27 software, and then filtered the redundant sequences using Purge-Haplotigs 28 software. Finally, we anchored the non-redundant contigs into scaffold with Hi-C data. In order to identify the Y sequence, we mapped the scaffold sequences onto the Y chromosome of Tursiops truncates 32 . Additionally, we mapped the contig sequences of the East Asian finless porpoise genome into Y chromosome of Tursiops truncates genome using Ragtag (v2.1.0) 33 with default parameters. Based on the above two methods, we obtained a candidate Y sequence with a high degree of similarity. We selected the Y sequence assembled from the first method. The newly assembled Y chromosome sequence is 11.02 Mb in length and contains 82 intact protein-coding gene models. Of these Gb of high-quality data was used to generate 16 different k-mers depth distribution curve frequencies by KmerGenie. The k-mers value was automatically set by the software from 21 to 121. The x-axis indicates k-mer size, while the y-axis is the number of genomics k-mers at that k-mer size. (b) 91-mer depth frequency distribution. The x-axis is depth (X), while the y-axis is the proportion that represents the frequency at that depth divided by the total frequency of all depths. The genome size was estimated using the following formula: Genome size = K-mer num/Peak depth. The peak depth is approximately 28 and the estimated genome size is 2,475,638,739 bp.

Data records
The DNA sequence reads of East Asian finless porpoise (Experiment of DNA sequencing data from genome survey library: SRR21047154 63 ; Experiments of DNA sequencing data from Hi-C library: SRR2 0760935 64 -SRR20760936 65 ; Experiments of DNA sequencing data from PacBio HiFi library: SRR209979 31-SRR20997935 [66][67][68][69][70] ) have been deposited in the Sequence Read Archive (SRA) under project number SRP389529 71 . The Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession JANJGR000000000 72 . Files of the assembled genome, gene structure annotation, repeat predictions and gene functional annotation of East Asian finless porpoise were deposited at Figshare database under DOI code 73 .

Technical Validation
Evaluation of the genome assembly. By comparing the assembled metrics of the East Asian finless porpoise to the other cetacean species, our assembly substantially improved because of increased contig and scaffold lengths, which indicates that our assembly is highly contiguous. Our gapless genome assembly increased the contiguity metrics 941-fold (by contig N50) or 921-fold (by the number of contigs) compared to a previously reported Yangtze finless porpoise assembly 10 . Among the public cetacean genomes, our assembly had the longest contig N50 length and smallest gap number, which suggests that our East Asian finless porpoise genome is high quality (Table 3).
To assess the completeness of our East Asian finless porpoise genome, we performed an analysis using Benchmarking Universal Single-Copy Orthologs (BUSCO, v5.1.0) 74 with the mammalia_odb10 database. The results showed that 95.4% of the expected mammalian genes (including 93.7% single and 1.7% duplicated ones) had complete gene coverage, and 1.1% were identified as fragmented, respectively. However, 3.5% were considered missing in our East Asian finless porpoise genome. Still, the complete evaluation of the East Asian finless porpoise genome is more superior than other current public cetacean genomes (Table 3).
To evaluate the telomere sequences assembled in the East Asian finless porpoise genome, we used the Telomere Identification toolkit (Tidk, v0.2.0) (https://github.com/tolkit/telomeric-identifier) to search telomere sequences (TTAGGG) along with the genome sequence. From the results, 23 chromosomes detected at least one side of telomere sequences, such as Chr5 and Chr11. Individual sequences were identified with partial telomere sequences, which should be further investigated and optimized.
To compare the genome consistency between the East Asian finless porpoise and the Yangtze finless porpoise, we used MuMmer (v4.0.0) 75 to identify similar regions with parameters "-mum -c 500 -l 40" at the genome level. Additionally, we also used BLAST 60 and WGDI (https://github.com/SunPengChuan/wgdi) software to search the synteny blocks with at least ten gene pairs at the gene level. These two analyses revealed that the two genomes are highly consistent (Fig. 5).
Evaluation of the gene annotation. We performed BUSCO 74 analysis with the mammalia_odb10 database to assess the completeness of the coding sequences for the East Asian finless porpoise. The results showed that 97.9% of the expected mammalian genes (including 96.7% single and 1.2% duplicated ones) had complete gene coverage, and only 0.5% were identified as fragmented, respectively. However, 1.6% were considered missing in our East Asian finless porpoise genome. Compared to other complete evaluations of protein-coding genes, our East Asian finless porpoise has a high degree of integrity (Table 4).

Code availability
No specific code was developed for this work. The data analyses were performed according to the manuals and protocols provided by the developers of the corresponding bioinformatics tools that are described in the Methods section together with the versions used.