A chromosome-level genome assembly of a deep-sea starfish (Zoroaster cf. ophiactis)

Understanding of adaptation and evolution of organisms in the deep sea requires more genomic resources. Zoroaster cf. ophiactis is a sea star in the family Zoroasteridae occurring exclusively in the deep sea. In this study, a chromosome-level genome assembly for Z. cf. ophiactis was generated by combining Nanopore long-read, Illumina short-read, and Hi-C sequencing data. The final assembly was 1,002.0 Mb in length, with a contig N50 of 376 Kb and a scaffold N50 of 40.4 Mb, and included 22 pseudo-chromosomes, covering 92.3% of the assembly. Completeness analysis evaluated with BUSCO revealed that 95.91% of the metazoan conserved genes were complete. Additionally, 39,426 protein-coding genes were annotated for this assembly. This chromosome-level genome assembly represents the first high-quality genome for the deep-sea Asteroidea, and will provide a valuable resource for future studies on evolution and adaptation of deep-sea echinoderms.

DNA extraction, library preparation and sequencing. High molecular weight (HMW) genomic DNA was extracted from the frozen tissues by using the SDS method and then purified with the QIAGEN ® Genomic kit (QIAGEN) following the manufacturer's instructions. The quality of the extracted DNA was assessed using 1% agarose gel electrophoresis, and NanoDrop ™ One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA) with the OD 260/280 of 1.8-2.0 and OD 260/230 of 2.0-2.2. The quantity of the DNA was measured by Qubit ® 3.0 Fluorometer (Invitrogen, USA). DNA libraries for Illumina sequencing were prepared using the Truseq Nano DNA HT Sample Preparation Kit (Illumina USA) according to the manufacturer's protocols. The libraries were sequenced on the Illumina Hiseq 4000 platform, yielding 150-bp paired-end reads with an insert size of ~350 bp. In total, ~103 Gb of Illumina raw reads were obtained. For the Oxford Nanopore library preparation, genomic DNA fragments > 20 kb were selected using the BluePippin system (Sage Science, USA). Approximate 2 µg HMW DNA was used as input material, according to the manufacturer's instructions, for the ligation Sequencing kit SQK-LSK109 (Oxford Nanopore Technologies, UK). Sequencing was performed on a Nanopore PromethION sequencer (Oxford Nanopore Technologies, UK). A total of ~60 Gb of Nanopore raw reads were generated. A high-throughput chromatin conformation capture (Hi-C) method was applied to generate a chromosome-level genome. Briefly, the frozen arm tissues were crosslinked with 2% formaldehyde, and then digested with the restriction enzyme MboI (400 units). The DNA ends were tagged with the biotin-14-dCTP and fragments were sheared to 200-600 bp. The resulting Hi-C library was sequenced on the Illumina HiSeq 4000 platform (paired-end 150 bp reads). A final ~72 Gb of raw reads were obtained. rNA extraction and transcriptome sequencing. The total RNA was isolated from the frozen arm tissue using Trizol (Invitrogen, Carlsbad, CA, USA), following the manufacturer's instructions. Concentration of the isolated RNA was measured using the NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific, USA), and its quality was evaluated by 1.5% agarose gel electrophoresis. RNA integrity was quantified by the Agilent 5400 fragment analyzer (Agilent, USA). RNA-seq libraries were constructed by the NEBNext ® Ultra ™ RNA Library Prep Kit (NEB, USA) following the manufacturer's instructions. Libraries were then sequenced on an Illumina Hiseq 4000 platform (paired-end 150 bp reads). A total of ~8 Gb raw reads were yielded and used for the gene prediction.
Genome assembly. Genome size, proportion of repetitive sequences and heterozygosity was estimated by using the Illumina short-read data and the k-mer analysis with Jellyfish v2.3.0 26 . Based on the ~103 Gb Illumina data and the 19-mer frequency distribution analysis, a total of 78,106,733,386 k-mers were obtained after removing k-mers with abnormal depth, and the 19-mer peak was at a depth of 74. Therefore, the genome size of Z. cf. ophiactis was estimated to be 78,106,733,386/74 = 1,055 Mb, the heterozygosity was about 0.32%, and the proportion of repetitive sequences was roughly 69.85% (Fig. 1).
The Nanopore long-read data were used to generate a contig-level assembly for the Z. cf. ophiactis genome. A preliminary assembly was generated by using the program WTDBG2 v2.5 27 (parameters: -p 19 -AS 2 -s 0.05 -L 5000 -t 36 -fo starfish). Then, three rounds of polishing were carried out with ~103 Gb of Illumina reads by the software Nextpolish v1.2.0 28 . The Hi-C technology was used for chromosome-level genome assembly. of Zoroaster cf. ophiacti genome. The x-axis is the k-mer depth, and the y-axis represents the corresponding frequency of the k-mer at a given depth.
Raw Hi-C paired reads were trimmed by Fastp v0.20.0 29 , and aligned to the draft assembly with Juicer v1.5.7 30 using default settings. Contigs were scaffolded using 3D-DNA pipeline v180114 31 with all valid Hi-C reads. The chromosome-scale scaffolds were adjusted manually using Juicebox v1.11.0812 32 with the aid of the Hi-C contact map whereby redundant contigs and misjoins were removed and fixed. All the corrections were incorporated into the assembly using the 3D-DNA post-review pipeline. Ultimately, the contigs were anchored to 22 pseudo-chromosomes, accounting to 92.3% of the total genome (    43 with an e-value ≤ 1e-5. For the transcriptome-based annotation, clean RNA-seq reads were aligned to the Z. cf. ophiactis genome assembly by using HISAT2 v2.2.1 44 , and gene set was predicted by using PASA v2.3.2 45 pipeline. Finally, results from ab initio prediction, homology-based prediction, and transcript prediction were integrated by using EvidenceModeler v1.1.1 46 to generate a consensus and non-redundant gene set. Overall, 39,426 protein-coding genes were annotated for the Z. cf. ophiactis genome by combining three different methods, with an average of exon and intron length of 217.7 bp and 1952.8 bp, respectively ( Table 3). The average length and number of the genes, exons, and introns of the Z. cf. ophiactis genome were comparable to those reported in other sea stars 24 .
Functional annotation for the predicted protein-coding genes was performed against six public databases, Kyoto Encyclopedia of Genes and Genomes (KEGG), Gene Ontology (GO), NCBI-NR (non-redundant protein database), Swiss-Prot, SMART and InterProScan with BLASTP v2.2.23 47 and an e-value cutoff of 1e-5. The results showed that 36,557 (92.72%) predicted genes were annotated by at least one public database (Table 4).

Data records
All the raw sequencing data of Illumina, Nanopore, and Hi-C obtained in this study have been deposited in the NCBI Sequence Read Archive (SRA) database with the accession numbers SRR22953576-SRR22953579, and SRR24759671 under the BioProject PRJNA891479 48 . The final genome assembly has been deposited in the Science Data Bank of Chinese Academy of Sciences 49 Table 1). It is noted that the Z. cf. ophiactis genome assembly is much larger than genomes reported for other asteroids, including species in the order Forcipulatida (402-561 Mb) [21][22][23][24] , and those in the other order, Valvatida (384-608 Mb) 20,25,52 . In addition, 22 pseudo-chromosomes were generated for the Z. cf. ophiactis genome assembly. The chromosome number is consistent with previous karyotyping studies on some asteroids, including species from Forcipulatida 53 . This is also proved by recent genome studies on several starfish species where 22 pseudo-chromosomes were identified by the Hi-C method [22][23][24] .
To assess the accuracy of Z. cf. ophiactis genome assembly, the completeness of the genome assembly was assessed using the conserved metazoan gene set "metazoan_odb10" from the Benchmarking Universal Single-Copy Orthologs (BUSCO) v4.0 54 . The genome assembly was found to have a high level of completeness (95.91%). Of the 954 single-copy orthologs, 95.28% were complete and single-copy, 0.63% complete and duplicated, 0.84% fragmented, and 3.25% were missing (Table 5). In addition, clean Illumina short reads used for the genome survey were aligned back to the Z. cf. ophiactis genome assembly with Burrows-Wheeler aligner (BWA) v0.7.17-r1198 55 . As a result, 99.35% of the short reads were mapped to the genome. Together, these results indicate the high quality of the Z. cf. ophiactis genome assembly.
Chromosome synteny. Syntenic relationships among the genomes of Z. cf. ophiactis and the other two Forcipulatida star fish, Asterias rubens (GCF_902459465.1) 56 and Plazaster borealis (GCA_021014325.1) 24 were inferred and visualization by Blastp and NGenomeSyn v1.37 57 . The three starfish appeared to have very conserved syntenic relationships as every chromosome matched each other well (Fig. 3). This finding provides new evidence of a high level of synteny conservation in the order Forcipulatida 24 .

Gene annotation validation.
To evaluate the completeness of the annotated gene set, we performed the BUSCO analysis using the conserved metazoan database "metazoan_odb10". The results revealed that 97.07% of the conserved single copy ortholog genes to be complete (96.23% single-copied genes and 0.84% duplicated genes), 0.73% fragmented and 2.2% missing (Table 5). Additionally, functional annotation of the predicted genes revealed that 92.72% of them were annotated by at least one public database (  62 was used to produce the ML trees with the following parameters: -m GTRGAMMA -x 12345 -N 100. The phylogenetic tree was reconstructed with 1,316 single-copy orthologs (Fig. 4). Zoroaster cf. ophiactis was clustered with A. rubens and P. borealis within

Numbers
Percent of all genes (%)  www.nature.com/scientificdata www.nature.com/scientificdata/ the family Asteriidae, where they all belong to the order Forcipulatida, and then were grouped with two starfish species (A. planci and P. miniata) from the order Valvatida. Expansion and contraction of gene families were evaluated by CAFE v5 63 with a p-value of 0.05. A total of 1,162 gene families were expanded while 55 were contracted in the deep-sea starfish, Z. cf. ophiactis (Fig. 4).

Code availability
No specific code was used in this study. All commands and pipelines used in the data processing were performed according to manuals and protocols of corresponding bioinformatics software, with parameters described in the Methods section. If no detailed parameters were mentioned for a software, default parameters were used.