Background & Summary

Sipuncula (peanut worms) are unsegmented coelomate worms with bilaterally symmetrical bodies that are separated into a trunk and are retractable introverts1. Belonging to Lophotrochozoa, they are believed to form a small phylum with approximately 150 described species2. However, they are widely distributed in the world’s oceans at all depths, occupying most marine habitats, from intertidal zones to abyssal depths and polar to equatorial seas, including extreme environments. Over the past 520 million years, the typical features of extant Sipuncula have undergone only minor changes3. Therefore, Sipuncula is an exciting resource to study environmental adaptation and evolution and as an indicator of global climate change. In coastal environments, these species are critical in bioturbation to reshape the physicochemical properties and biological characteristics of the sediment4. In marine wetlands and pond aquaculture systems, Sipuncula and other taxa increase organic matter transport and improve ecosystem services5. However, gene and genome data for Sipuncula that are available in the PDB, a public database, are insufficient.

Despite the early recognition of the group, phylogenetic relationships between Sipuncula and other taxa are unclear. Sipunculus nudus was first described by Linnaeus in 1767 and was later considered to be a derived group of annelids6,7,8. Morphological and developmental characteristics suggest that Sipuncula is the sister group of Mollusca9. However, phylogenetic analyses based on mitochondrial DNA sequences as well as traits related to nervous and muscle system development indicate that Sipuncula is more closely related to Annelida than to Mollusca10,11. Torsten et al. performed phylogenomic analyses using 47,953 amino acid positions to explore the relationships among 34 annelid taxa and found that Sipuncula belongs to Annelida12. Therefore, the assignment of Sipuncula to annelids is still a controversial issue. Furthermore, the lack of segments in Sipuncula, which is different from other annelid taxa, provides a basis for understanding the mechanism underlying segment development. Genome sequence information is important for phylogenetic analyses. However, sequencing data for molluscs and annelids are limited. In Sipuncula, only one draft genome of Phascolosoma esculenta was published by Zhong et al.13. The genome data suggested that Sipuncula belonged to Annelida; however, the evolutionary relationships among Polychaeta, Oligochaeta, and Hirudinea in their reconstructed phylogenetic tree were inconsistent with previous results, making evolutionary inferences difficult. Therefore, additional genome data for Sipuncula, especially chromosome-level genome data, are needed to clarify the evolutionary relationships of lophotrochozoans and to provide genomic resources for “evo-devo” studies of body segmentation.

S. nudus is a cosmopolitan Sipuncula species that is distributed in temperate, subtropical, and tropical waters in all oceans (Fig. 1). In this study, we assembled the first high-quality genome of S. nudus using PacBio HiFi sequencing and high-throughput chromosome conformation capture (Hi-C). We used HiFi reads for assembly and Hi-C technology for chromosome anchoring. We obtained a contig N50 of 29.47 Mb and a scaffold N50 of 80.87 Mb for the final genome assembly, which is approximately 1,427 Mb. Using Hi-C data, 97.91% of the assembled bases were associated with the 17 chromosomes. These high-quality genomic data are expected to improve the resolution of phylogenetic analyses of Sipuncula and to provide a reference for detailed analyses of their characteristics, adaptation to complex habitats, and ecological niches.

Fig. 1
figure 1

Distribution of S. nudus worldwide. Red triangles represent collection locations reported in the literature.


Sample collection and DNA extraction

Male 2-year-old S. nudus samples were collected from the field of Suixi, Zhanjiang Guangdong Province, China (21°35′N, 109°81′E), and were used for whole-genome sequencing. The body wall tissue was stored in liquid nitrogen, and total genomic DNA was isolated by using the QIAGEN DNeasy Blood & Tissue Kit (QIAGEN, Shanghai, China) following the manufacturer’s instructions.

Library construction and sequencing

Three SMRTbell libraries of circular consensus sequencing (CCS) were constructed according to the standard PacBio protocol using 15–20 kb preparation solutions (Pacific Biosciences, Menlo Park, CA, USA). Five cells were sequenced on the PacBio Sequel II platform by the CCS model (Pacific Biosciences) to generate HiFi (high-fidelity) reads. The reads were produced by calling consensus from subreads that were generated by multiple passes of the enzyme around a circularized template. This resulted in a HiFi read that was both long and accurate. In total, 103.13 Gb of HiFi reads with 72.63× coverage was generated, and the N50 value was 14,008 bp (Table 1).

Table 1 HiFi sequencing data statistics.

Hi-C libraries were prepared as previously reported14. The body wall tissue cells were fixed by using formaldehyde to keep the 3D structure of DNA intact. Cells were digested with the HindIII restriction endonuclease. Biotin-labelled bases were used for end repair. The DNA fragments maintaining interaction relationships were captured to construct the Hi-C library. Finally, 289.30 Gb of high-quality Hi-C data (Q20 > 98% and Q30 > 94%) was obtained with the BGISEQ-500 sequencing platform (Table 2).

Table 2 Hi-C sequencing data statistics.

Genome survey and assembly

The size, heterozygosity, and repeat rate of the S. nudus genome were estimated using the k-mer frequency method. Jellyfish15 and GenomeScope v.1.016 were employed to calculate the K-mer frequency (k = 21), which was based on HiFi reads, and the genome size was estimated to be 1305 Mb with a peak K-mer frequency of 66X. The heterozygosity and repeat rate were 2.03% and 39.68%, respectively (Fig. 2). We first assembled the genome using HiFi reads via HiFi-asm (v0.15.1)17 with default parameters. After preliminary assembly, we used purge_haplotigs18 to purge haplotigs. The haploid genome size was 1426.68 Mb, and the N50 length was 29.46 Mb (Fig. 3 and Table 3).

Fig. 2
figure 2

Overview of the 21-mer frequency distribution in the S. nudus genome. The X-axis is the k-mer depth, and the Y-axis represents the k-mer frequency for a given depth.

Fig. 3
figure 3

Length distribution of contigs in the preliminary genome assembly. The N50 value and number of contigs were 29,460,569 bp and 17, respectively.

Table 3 Genome assembly statistics using PacBio HiFi reads and Hi-C data.

The contigs were anchored to chromosomes using Hi-C data. Juicer (version 1.6)19 was used to align the double-ended sequencing data against the assembled genome to complete the evaluation of the Hi-C library. The 3D-DNA pipeline20 under default parameters without breaking contigs was chosen to generate the final chromosome-level scaffolds. Manual checking and refinement of the draft assembly were carried out via Juicebox Assembly Tools (, v1.1108). A heatmap of the Hi-C assembly interaction bins indicated that the quality of the genome assembly was excellent (Fig. 4). The length of the final assembled genome was 1,426,776,655 bp, with a contig N50 of 29,460,569 bp and scaffold N50 of 80,869,746 bp (Table 3 and Fig. 3). Approximately 1,397 Mb (97.91%) of the contig sequences were anchored to 17 chromosomes (Table 4), which is consistent with the known karyotype in our previously published manuscript21. Using the minimap2 (v2.17, parameters: -a -x map-pb)22 alignment results and the HiFi data, we used BamDeal ( to evaluate the mapping rate and coverage and obtained estimates of 99.95% and 99.73%, respectively. The CIRCOS tool23 was used to visualize the 17 chromosomes, GC content, read depth and mapping depth (Fig. 5). The average depth of each chromosome was calculated and is shown in Fig. 6. Seventeen chromosomes had a comparable sequencing depth, and there was no whole chromosome with half the read depth. Therefore, XY- or ZW-type sex chromosomes did not exist in the assembled chromosomes of S. nudus. Based on 20-kb nonoverlapping sliding windows in the chromosomes to calculate the GC content and read average depth, there was a small cluster of sliding windows (a total of 11.6 Mb with 581 sequences) that exhibited relatively high GC contents ( > 48%) but with a normal sequencing depth (Fig. 7). By extracting those block sequences with high GC contents and mapping them to the NT database (Nucleotide Sequence Database) using MegaBlast (parameter: −e 1e-5), the alignments with identity >90% and coverage length >100 bp were filtered. The matched reference species in the alignments from the NT database were grouped into three categories: the S. nudus species, the species of other invertebrates, and all other species except the two mentioned above. All the matched sequences (228) could be correlated with S. nudus or other invertebrate species (Fig. 8), which demonstrated that the sequence blocks with high GC content and normal depth in chromosomes were from the S. nudus species rather than from contamination or cobionts.

Fig. 4
figure 4

Hi-C interaction heatmap. Chr01–Chr17 indicate the 17 chromosomes. The abscissa and ordinate represent the order of each bin on the corresponding chromosome group. The colour block demonstrates the intensity of the interaction from yellow (low) to red (high).

Table 4 Genome chromosome length statistics.
Fig. 5
figure 5

Genomic landscape of the 17 assembled chromosomes of S. nudus. Sliding window: 1 Mb; A: Assembled chromosomes; B: Gene density (0–50); C: Repeat content (0–100%); D: GC content (30–45%); E: Mapping depth (30–100×).

Fig. 6
figure 6

The read depth in each chromosome.

Fig. 7
figure 7

GC Content and Sequencing Depth. The x-axis represents the GC content; the y-axis represents the average depth.

Fig. 8
figure 8

The categories of reference species in the alignments from the NT database.

Repeat annotation

Prior to gene prediction, we identified the repetitive elements in the genome of S. nudus by using a combination of homology-based and ab initio-based methods. To identify tandem repeats, we used Tandem Repeats Finder v4.0924. For the homology-based method, transposable elements were identified by RepeatMasker v4.0.7 (-nolow -no_is -norna -engine ncbi -parallel 1) and RepeatProteinMask v4.0.7 (-engine ncbi -noLowSimple -pvalue 0.0001)25 against the TE protein databases and RepBase library v21.1226. For the ab initio-based method, LTR_FINDER v1.0627 and RepeatModeler v1.0.8 ( with default parameters were used to build the de novo library before RepeatMasker v4.0.7 was used to classify the different categories of repetitive elements against this library. The final repetitive elements were identified by integrating the results of these methods according to sequence overlap, revealing that nearly half of the genome consists of repetitive elements (Tables 5, 6; Fig. 5).

Table 5 Genome repetitive element statistics.
Table 6 TE type statistics.

Gene prediction

Gene annotation was performed by integrating homology-, de novo- and transcriptome-based information. We used the annotation data from three closely related species (Caenorhabditis elegans, Capitella teleta, and Helobdella robusta) for homology prediction. The MAKER tool28 was used to integrate the annotation data from the three related species and the transcriptome data from S. nudus. Based on AED values from MAKER, 2000 genes with complete structures were selected and used to train the de novo prediction tools Augustus29 and Snap30 to construct de novo models. Finally, all data were integrated using MAKER28. The final comprehensive gene set contained 28,749 genes (Table 7).

Table 7 General statistics of predicted protein-coding genes.

Gene function annotation

Gene function annotation was performed based on sequence similarity and domain conservation. First, the protein-coding genes of S. nudus were aligned against the KEGG31, SwissProt32, TrEMBL33, GO34, KOG (, and Nr ( databases by using BLASTP with an E-value threshold of 1e-5. Subsequently, the best match from the alignment was used to predict gene functions. Second, searches performed using InterProScan (51.0–55.0)35 against the following databases were used to identify the motif and domain: PANTHER36, Pfam37, PRINTS38, ProDom39, SUPERFAMILY40, and SMART41. In total, 88.75% of the predicted genes were functionally annotated (Table 8).

Table 8 Functional annotation statistics.

Data Records

The National Center for Biotechnology Information (NCBI) BioProject accession number for the sequence reported in this paper is PRJNA901211. The raw data for Hi-Fi and Hi-C sequencing were submitted to NCBI SRA (accession number SRP408321; and deposited in the CNGB Sequence Archive (CNSA) of the China National GeneBank DataBase (CNGBdb) (accession number CNR0640303-CNR0640323; The assembled genome sequence was deposited into NCBI under accession number JAPPUL00000000044. The assembled genome, gene structure annotation, repeat predictions, gene function annotation, KEGG analysis of expanded genes and positively selected gene data were deposited in the China National GeneBank DataBase (CNGBdb) under the project with accession number CNP0003624.

Technical Validation

Genome assembly and gene prediction quality assessment

The BUSCO pipeline was used to evaluate the completeness of the genome assembly and gene set based on a benchmark of 255 conserved genes in eukaryota_odb10 (creation date: 2020-09-10, number of genomes: 70, number of BUSCOs: 255). In total, 97.7% of the 255 expected conserved genes were identified as complete, and 2% were identified as fragmented. Furthermore, we used minimap2 (v2.17, parameters: -a -x map-pb)22 to align the assembly with the HiFi data, and the mapping rate and coverage rate were estimated to be 99.95% and 99.73%, respectively. The BUSCO (v5)45 results supported the completeness of the assembly; 97.7% of 255 conserved genes were identified as complete by using eukaryota_odb10 (Table 9). The BUSCO results and alignment results indicated high genome assembly completeness and correctness.

Table 9 Evaluation of genome assembly completeness.

Comparative genomic analysis

The protein-coding genes of S. nudus and 15 additional species were used to identify orthologous gene groups. The reference protein sequences of the following 15 species were obtained: Caenorhabditis elegans (Ensembl Release 10), Danio rerio (Ensembl Release 10), Homo sapiens (Ensembl Release 10), Drosophila melanogaster (Ensembl Release 10), Capitella teleta (NCBI: GCA_000328365.1), Crassostrea gigas (NCBI: GCF_902806645.1), Dimorphilus gyrociliatus (NCBI: GCA_904063045.1), Eisenia andrei ( PRJCA002327), Helobdella robusta (NCBI: GCF_000326865.1), Lamellibrachia satsuma (NCBI: GCA_022478865.1), Lottia gigantea (NCBI: GCF_000327385.1), Metaphire vulgaris (NCBI: GCA_018105865.1), Owenia fusiformis (NCBI: GCA_903813345.2), Phascolosoma esculenta ( PRJNA819496), and Nematostella vectensis (NCBI: GCF_932526225.1) as the outgroup. To perform the gene family analysis, orthogroups of the 16 species were identified using OrthoFinder (v2.3.11) with default parameters46. After analysis of the gene family, 416,469 genes from the 16 species were grouped into 30,677 gene families. The results revealed that 717 gene families that involved 4,217 genes were unique in S. nudus. The gene families and genome statistics of all the species are shown in Table 10. Among the orthologous genes in the 16 species, a total of 255 single-copy genes were identified. The single-copy orthologues were aligned using MUSCLE (v3.7)47 with default parameters, and then the aligned protein sequences were reverse translated into codon sequences. The alignments were then concatenated to generate a superalignment matrix for phylogenetic reconstruction based on the maximum-likelihood (ML) method using IQ-TREE (v1.6.12)48, with the best-fit evolutionary substitution model being determined using ModelFinder49. Divergence times for each node in the phylogenetic tree were estimated using MCMCtree, which is implemented in PAML package v4.8a50, under the following parameters: -nsample 100000, -rootage 800, and -burnin 500000. The calibration times were obtained from TimeTree51: 630.0–830.0 million years ago (Ma) for Caenorhabditis elegans and Homo sapiens, 424.2–440.0 Ma for Danio rerio and Homo sapiens, and 545.0–681.5 Ma for Capitella teleta and Crassostrea gigas. The phylogenetic tree representing the evolutionary relationships among Mollusca, Annelida and Sipuncula is shown in Fig. 9. Gene collinearity, which shows the preservation of ancestral genome structure in the modern genome, is an important means of unveiling genomic evolution. Thus, MCscan (Python version)52 was used for the genomic synteny analysis between S. nudus, O. fusiformis and P. esculenta. The collinearity figure was drawn based on the homologous blocks with ≥ 4 gene collinear pairs between species by JCVI ( (Fig. 10). Regarding intergenomic gene collinearity, 109 blocks containing 508 collinear gene pairs were revealed between S. nudus and O. fusiformis, and 622 blocks containing 3248 collinear gene pairs were revealed between S. nudus and P. esculenta, showing similar collinearity between the two Sipuncula species.

Table 10 The gene family statistics.
Fig. 9
figure 9

Phylogenetic tree of S. nudus and other species. The red branch represents Annelida, and the green branch represents Mollusca.

Fig. 10
figure 10

Genome synteny analysis between S. nudus and P. esculenta as well as S. nudus and O. fusiformis. Twelve chromosomes of O. fusiformis, seventeen chromosomes of S. nudus and 283 contigs of P. esculenta were shown.

The time-calibrated phylogenetic tree was used to assess gene family expansions and contractions using CAFÉ 4.2.153 with a random birth-and-death model with lambda. In total, 543 and 97 significantly expanded and contracted gene families were identified, respectively (P < 0.05). GO and KEGG enrichment analyses of the expanded gene families revealed that these families were mainly involved in pathways that are related to apoptosis, detoxification, the immune response, amino acid and fatty acid metabolism anion, oxidative stress, and energy metabolism.

PSGs (positively selected genes) were predicted using branch-site likelihood ratio tests for single-copy gene families with a conservative 10% false discovery rate (FDR) criterion54. We used proteins from S. nudus, C. teleta, E. Andrei, L. satsuma, O. fusiformis, and P. esculenta to extract 3,192 one-to-one orthologous genes using the OrthoFinder (v2.3.11) pipeline. The one-to-one orthologous genes were then used to generate multiple sequence alignments by using PRANK (v. 121002)55. The dN/dS ratios of the codons were calculated using the branch-site model of Codeml in the PAML package50, in which S. nudus was set as the foreground branch and the other five taxa as background branches. Using a likelihood ratio test (LRT) of ≤0.05 and an FDR of ≤0.05 as thresholds, 326 PSGs were identified in the S. nudus genome. These PSGs were significantly enriched in the terms “Spliceosome,” “Base excision repair,” “DNA replication,” and “Cell cycle” in the KEGG pathway enrichment analysis.

In summary, we obtained the high-quality chromosome-level genome of S. nudus, which contributes to our understanding of the evolutionary status of Sipuncula and the evolutionary relationship among the subgroups of the phylum Annelida. Gene family expansion and extraction and genomic synteny analyses revealed the potential adaptation mechanism of Sipuncula to different living environments.

Usage Notes

All analyses were run on Linux systems, and the optimal parameters are given in the Code availability section.