Background & Summary

The Aplacophora is a particular understudied molluscs that is evolutionarily and ecologically important in marine benthic fauna. As early-branching molluscs, Aplacophora is unusual as it is with worm-shaped, shell-less body plan, and covered by cuticle and calcareous sclerites. Two groups Caudofoveata and Solenogastres are often collectively referred to as Aplacophora1,2. The pedal groove does not exist in the ventral site of Caudofoveata, which distinguishes it from Solenogastres. Besides, Caudofoveata has gills at the tail of the body and is absence of a foot1. Caudofoveata has a worldwide distribution in benthic marine habitats and lives by burrowing in marine soft sediment feeding on organic contents or foraminiferans and small particles2,3. Due to the collection difficulty, Caudofoveata is one of the least known classes of Mollusca, with only 142 species (World Register of Marine Species, 2023). To data, a series of studies have researched their taxonomy4,5,6, phylogeny2,7,8,9,10,11, ecology3,12,13, and evolutions14,15,16,17.

Caudofoveata had been thought to be the earliest extant offshoots in Mollusca based on its unique body plan and shell-less morphological characters18,19. Phylogenomic analyses revealed that Mollusca included two clades, Aculifera (Caudofoveata, Solenogastres and polyplacophorans) and Conchifera (Gastropoda, Bivalvia, Cephalopoda, Scaphopoda and Monoplacophora)10,11. The fossil Kimberella quadrata was thought to be a stem-group mollusc and had certain traits similar to Aculifera, which indicated Caudofoveata was an early-branching Mollusca18,20,21. Overall, Caudofoveata is central to understand the origin and evolutionary history of molluscs, which is the second most diverse metazoan animal group.

Aplacophora is of particular interest from the benthic fauna of the deep sea especially regarding to its diversity and adaptation. With deep sea expeditions increasing in the Atlantic Ocean and the Northwest Pacific, more and more new Aplacophoran molluscs were described and studied3,22,23,24,25. Almost 86% of the Aplacophora were found at depths of more than 200 m1 and some species showed high abundances at deep-sea benthos26. Prochaetoderma yongei, a widespread deep sea Caudofoveata species in Atlantic, was thought to be successful due to its omnivorous and rapid development ability3,26. Helicoradomenia spp. which is a Solenogastres species has been found in the sulfide-based chemosynthetic hydrothermal vent with epi- and endocuticular bacterial symbionts27. By investigating the food sources and anatomy of 200 individuals within 60 candidate deep-sea Solenogastres species, researchers revealed a high degree of food specialization with modifications in the radula, foregut, and glands morphologies28. Considering their great ecological importance and diverse adaptation strategies to deep-sea environment, Aplacophora could be an ideal group to study deep-sea adaptation.

Despite the evolutionary and ecological importance, the studies of Caudofoveata are hampered by the lack of genomic resources. Here, we generated a high-quality chromosome-level genome of Chaetoderma sp. for the first time in the Caudofoveata based on PacBio long reads, Illumina short reads, and high-resolution chromosome conformation capture (Hi-C) sequencing reads. The final assembly of Chaetoderma sp. was 2.45 Gb, consisting of 17 chromosomes with scaffold N50 length of 141.46 Mb. We predicted 23,675 protein-coding genes from the genome of Chaetoderma sp. by integrating de-novo, homologous, and transcriptome annotation methods as well as manual correction. Through the analysis of intestinal microbial composition of Chaetoderma sp., we discovered that SUP05, a group of chemoautotrophic bacteria was the dominant bacterial community in the gut, indicating a potential symbiotic relationship between them. The resulting genome assembly, annotation, and report of symbiotic bacteria by 16S rRNA gene amplicon sequencing will provide a valuable resource for further studies of the Caudofoveata and for molluscan evolution and deep-sea ecology in general.

Methods

Sample collection and sequencing

The Chaetoderma sp. specimens were collected from Site F methane seep29 (also known as Formosa Ridge) by the TV grab in the South China Sea during the voyage of the scientific research ship KEXUE from 2020 to 2022. When the samples were collected by the TV grab onto the ship, they were flash-frozen in liquid nitrogen immediately and stored in −80 °C refrigerator. The same frozen specimen of Chaetoderma sp. was used to perform the Illumina, PacBio and Hi-C sequencing. The total genomic DNA was extracted from the body wall by SDS method and followed by chloroform purification, examination of the quantity and quality through Qubit and Agilent bioanalyzer instrument. The qualified genomic DNA was used to construct libraries.

Firstly, in order to estimate genome complexity, we used physical breaking method to break up the genome DNA to 350 bp fragment, and then built the small fragment sequencing library which was applied to an Novaseq 6000 platform (Illumina, Inc., San Diego, CA, USA). A total of 57.80 Gb 150 bp paired-end sequencing reads were obtained (Table 1). Secondly, the PacBio library was constructed by following the standard protocol of manufacturer (Pacific Biosciences, Menlo Park, CA, USA), including using g-TUBE to break up DNA, the repair of DNA, the connection of dumbbell connector, the digestion of exonuclease and the filtration of target DNA fragment by BluePippin. A total of three SMART cells and 62.4 Gb clean long HiFi reads with 26.01X coverage were sequenced through circular consensus sequencing (CCS) model (Table 1). Thirdly, we applied high-throughput chromatin conformation capture (Hi-C) method to generate a chromosome-level genome. As for Hi-C library, formaldehyde was used to fix cells, and DpnII restriction endonuclease was used to digest cells. Using the terminal repair mechanism, DNA was labelled and cycled. The Hi-C library was built by using streptavidin magnetic beads to selectively capture DNA fragments containing interaction relationships and was evaluated for quality through Qubit 2.0, Agilent 2100 systems and Q-PCR method. Illumina NovaSeq 6000 platform (Illumina, USA) was performed to execute Hi-C sequencing. We obtained 257.18 Gb clean reads in total (Table 1).

Table 1 Statistics of sequencing data.

To better annotate the genome assembly, we performed transcriptome sequencing of Chaetoderma sp. using the frozen body. The total RNA was extracted using Trizol (Invitrogen). Qubit and agarose gel electrophoresis were further applied to examine the concentration and quality of RNA. VAHTS® Universal V8 RNA-seq Library Prep Kit for Illumina (Vazyme #NR605) was used to constructed RNA-seq library. The sequencing library was further sequenced on Novaseq 6000 platform in 150 bp paired-end mode (Illumina, Inc., San Diego, CA, USA). A total of 6.5 Gb raw reads were obtained.

Genome assembly and Hi-C scaffolding

Genome size, repetitive sequence ratio, and heterozygosity were first estimated based on Illumina short-read data. We used jellyfish v2.3.030 and GenomeScope v1.031 to analyse the K-mer frequency (k = 21). Based on Illumina reads, the K-mer analysis showed that the genome size of Chaetoderma sp. was 2.2 Gb and the heterozygosity was 1.39%. HiFi-asm v0.1632 was applied to assemble the genome based on PacBio long-read data. Pilon v1.2333 was then used to polish the assembly with the Illumina short-read data. Purge_dups v1.2.534 was used to remove duplication. The assembly of Chaetoderma sp. genome is 2.45 Gb, consisting of 5,603 contigs with contig N50 of 1.06 Mb (Table 2). The BUSCO assessment value is 92.03% (metazoan_odb10) and the GC content of the genome assembly is 40.93%. As for Hi-C scaffolding, at first Juicer v1.635 was used to deal with the Hi-C sequencing data and obtain the input file for the next analysis. Then, 3D-DNA v20100836 as the core software was used to scaffold the contigs under default settings. Juicebox v1.11.0835 was used to visualize chromosome assembly results, choosing the best result from 3D-DNA output, and marking the correct boundary of the chromosome according to the interaction heatmap. Finally, we reran 3D-DNA using corrected assembly result and exported the final chromosome assembly genome. After Hi-C scaffolding, 94.83% genome reads were anchored to 17 chromosomes with scaffold N50 length of 141.46 Mb and BUSCO assessment value of 89.52% (Fig. 1, Tables 2, 4, and 6). The 17 chromosomes were exhibited clearly in the interaction heatmap (Fig. 2) and also had a conserved collinearity relationship with the chromosomes of Mizuhopecten yessoensis (Fig. 3). All the bioinformatic software mentioned in this section were used with default parameters.

Table 2 Statistics of genome assembly.
Table 3 Statistics of genome annotation.
Table 4 Statistics of BUSCO assessment after Hi-C.
Table 5 Statistics of Repeats.
Table 6 Statistics of 17 chromosomes.
Fig. 1
figure 1

Characterization of the assembled Chaetoderma sp. genome. From inner to outer layers: photograph of Chaetoderma sp.; distribution of GC content in the genome; repeat sequence density across the genome; LTR density across the genome; gene density across the genome. All the tracks are drawn in 1 Mb sliding windows and chromosome-level scaffolds at scale by circos v0.69-965.

Fig. 2
figure 2

Hi-C chromosome interaction heatmap. Abscissa and ordinate represent order of each bin on corresponding chromosome group. Scale bar illuminates intensity of interaction from white (low) to red (high).

Fig. 3
figure 3

Genomic collinearity between the molluscs Chaetoderma sp. and Mizuhopecten yessoensis.

Annotation of Repetitive Elements

De novo repeat library prediction and homology comparison were applied for repeats annotation. We employed RepeatModeler2 v2.0.137 with default parameters to construct the de novo repeat library. LTR_FINDER v1.0738 and LTR_retriever v2.9.039 were used to identify long terminal repeat (LTR) sequence in the genome by using default parameters. The de novo repeat library and LTR library were combined and removed redundancy to generate the final repeat library. RepeatMasker v4.140 (-frag 100000 -gc 33.37 -lcambig -xsmall -gff) was applied to identify repeats with RepBase and de novo species-specific library in the genome of Chaetoderma sp. The proportion of Transposon elements (TEs) in Chaetoderma sp. genome is 55.81%. Among them, retroelement accounts for 38.43%, DNA transposon accounts for 17.38%. The most abundant transposon type is the LTR (Table 5).

Gene Prediction

The Caudofoveata gene prediction is challenging because of the high ratios of TEs and long introns, as gene prediction programmes may split a single gene into truncated partial-gene models. We employed three different approaches to predict protein-coding genes, homolog-based, transcriptome-based annotation, and ab initio gene prediction. Homolog-based annotation was performed by TBLASTN v2.13.041 (-evalue 1e-10) based on homology sequences from Acanthopleura granulate, Crassostrea gigas. Genewise v2.4.142 (-nosplice_gtag -pretty -pseudo -gff -cdna -trans) was used to predict genes based on homologous proteins. Second, we fully utilized and integrated transcriptome evidence in the gene prediction process, since this evidence can be helpful in the case of high ratios of TEs and long introns. Trinity v2.13.243 was used for transcriptomic level de novo assembly with default parameters. Hisat2 v2.2.144 (--skip 8 --qc-filter) was used to align transcriptome data to the genome, StringTie v2.2.145 was used to predict the structure of all transcribed reads. Subsequently, Program to Assemble Spliced Alignment (PASA) v2.5.246 was employed to integrate genome and transcriptome results. Third, ab initio gene prediction was carried out on the repeat-masked genome assembly by Braker2 v2.1.647 and Augustus v3.5.048 with default parameters. Finally, EvidenceMolder v1.1.149 was employed to integrate gene models from different prediction tools. We further used tRNAscan-SE v1.3.150 and barrnap v0.9 (http://lup.lub.lu.se/student-papers/record/8914064) to identify tRNA and rRNA by using default parameters. Finally, we predicted 23,675 protein-coding genes from the chromosome-level genome of Chaetoderma sp. by integrating de novo, homologous, and transcriptome annotation methods as well as modification of several genes’ structure such as Hox by comparing with homolog species one by one manually (Table 3). We used the BUSCO v551 to evaluate the quality of annotation results. The BUSCO completeness score is 94.44%, and the single copy score is 91.82% (Table 4). 23,503 (99.27%) of the protein-coding genes we predicted were annotated through blasting against SwissProt52 and interproscan53 against pfam54 database (Table 3).

16S rRNA sequencing and analysis

The total genome DNA of the intestinal contents of Chaetoderma sp. was extracted through SDS method. After monitoring the DNA concentration and purity based on 1% agarose gels, DNA was diluted to 1 ng/µL using sterile water. Specific primer (V4: 515F-806R) and barcodes were applied to amplify 16S rRNA genes. The library was sequenced on the Illumina NovaSeq platform to obtain 250 bp paired-end reads. We used FLASH v1.2.1155 to merge paired-end reads and used fastp v0.20.056 for data quality control with default parameters. QIIME2 v20200657 with default parameters was used to obtain ASVs (Amplicon Sequence Variants) and annotate species based on Silva Database. The result of 16S rRNA sequencing showed that SUP05 was the most abundant bacteria in the intestinal contents of Chaetoderma sp. and SUP05 is a group of Gammaproteobacteria with chemoautotrophic ability58 (Fig. 4).

Fig. 4
figure 4

The intestinal microbial composition of Chaetoderma sp. SUP05, a group of Gammaproteobacteria with chemoautotrophic ability, was the most abundant bacteria in the intestinal contents of Chaetoderma sp.

Data Records

The raw Illumina, PacBio, and Hi-C sequencing data are deposited in the NCBI under the accession number SRP45722559. The assembled genome sequence is deposited into NCBI under accession number GCA_034401795.160. The genome annotation file is available from the Figshare repository61. The SRA database of transcriptome data is SRR2694995462. The SRA database of raw Illumina 16S rRNA sequencing is under the accession number SRP45864763.

Technical Validation

Evaluating genome assembly and annotation completeness

The final assembly of Chaetoderma sp.’s genome is 2.45 Gb, consisting of 17 chromosomes with contig N50 of 1.06 Mb and scaffold N50 of 141.46 Mb (Fig. 1, Table 2). The genome size is similar with the result that was estimated by jellyfish. In order to evaluate the genome assembly and annotation, we adopted two methods including Illumina reads remapping using Bowtie2 v2.4.564 and BUSCO v551 assessment using database metazoan_odb10. The alignment rate of Illumina reads was 95% (Table 2). 854 (89.52%) of 954 BUSCOs were included in the assembly of Chaetoderma sp. and 901 (94.44%) of 954 BUSCOs were included in the gene models of Chaetoderma sp. (Table 4). We also compared our results with other molluscs’ assembly and annotation (Table 7). Overall, these data indicate the genome assembly and annotation of Chaetoderma sp. is complete and high-quality.

Table 7 The comparison of genome assembly and annotation between Chaetoderma sp. and other Molluscs.