The first high-quality chromosome-level genome of the Sipuncula Sipunculus nudus using HiFi and Hi-C data

Zheng, Zhe; Lai, Zhuoxin; Wu, Bin; Song, Xinlin; Zhao, Wei; Zhong, Ruzhuo; Zhang, Jiawei; Liao, Yongshan; Yang, Chuangye; Deng, Yuewen; Mei, Junpu; Yue, Zhen; Jian, Jianbo; Wang, Qingheng

doi:10.1038/s41597-023-02235-7

Download PDF

Data Descriptor
Open access
Published: 25 May 2023

The first high-quality chromosome-level genome of the Sipuncula Sipunculus nudus using HiFi and Hi-C data

Zhe Zheng^1,2^na1,
Zhuoxin Lai¹^na1,
Bin Wu ORCID: orcid.org/0000-0001-7507-9970³^na1,
Xinlin Song¹,
Wei Zhao³,
Ruzhuo Zhong¹,
Jiawei Zhang¹,
Yongshan Liao^1,2,
Chuangye Yang^1,2,
Yuewen Deng^1,2,
Junpu Mei^3,4,
Zhen Yue⁴,
Jianbo Jian³ &
…
Qingheng Wang ORCID: orcid.org/0000-0001-9148-6613^1,2

Scientific Data volume 10, Article number: 317 (2023) Cite this article

1591 Accesses
1 Citations
Metrics details

Subjects

Abstract

Sipuncula is a class of exocoelomic unsegmented animals whose evolutionary relationships are unresolved. The peanut worm Sipunculus nudus is a globally distributed, economically important species belonging to the class Sipuncula. Herein, we present the first high-quality chromosome-level assembly of S. nudus based on HiFi reads and high-resolution chromosome conformation capture (Hi-C) data. The assembled genome was 1,427 Mb, with a contig N50 length of 29.46 Mb and scaffold N50 length of 80.87 Mb. Approximately 97.91% of the genome sequence was anchored to 17 chromosomes. A BUSCO assessment showed that 97.7% of the expectedly conserved genes were present in the genome assembly. The genome was composed of 47.91% repetitive sequences, and 28,749 protein-coding genes were predicted. A phylogenetic tree demonstrated that Sipuncula belongs to Annelida and diverged from the common ancestor of Polychaeta. The high-quality chromosome-level genome of S. nudus will serve as a valuable reference for studies of the genetic diversity and evolution of Lophotrochozoa.

Chromosomal-level assembly of Tokudaia osimensis, Tokudaia tokunoshimensis, and Tokudaia muenninki genomes

Article Open access 21 December 2023

Chromosome-level genome assembly of Nibea coibor using PacBio HiFi reads and Hi-C technologies

Article Open access 03 November 2022

A high-quality chromosomal-level genome assembly of Greater Scaup (Aythya marila)

Article Open access 04 May 2023

Background & Summary

Sipuncula (peanut worms) are unsegmented coelomate worms with bilaterally symmetrical bodies that are separated into a trunk and are retractable introverts¹. Belonging to Lophotrochozoa, they are believed to form a small phylum with approximately 150 described species². However, they are widely distributed in the world’s oceans at all depths, occupying most marine habitats, from intertidal zones to abyssal depths and polar to equatorial seas, including extreme environments. Over the past 520 million years, the typical features of extant Sipuncula have undergone only minor changes³. Therefore, Sipuncula is an exciting resource to study environmental adaptation and evolution and as an indicator of global climate change. In coastal environments, these species are critical in bioturbation to reshape the physicochemical properties and biological characteristics of the sediment⁴. In marine wetlands and pond aquaculture systems, Sipuncula and other taxa increase organic matter transport and improve ecosystem services⁵. However, gene and genome data for Sipuncula that are available in the PDB, a public database, are insufficient.

Despite the early recognition of the group, phylogenetic relationships between Sipuncula and other taxa are unclear. Sipunculus nudus was first described by Linnaeus in 1767 and was later considered to be a derived group of annelids^6,7,8. Morphological and developmental characteristics suggest that Sipuncula is the sister group of Mollusca⁹. However, phylogenetic analyses based on mitochondrial DNA sequences as well as traits related to nervous and muscle system development indicate that Sipuncula is more closely related to Annelida than to Mollusca^10,11. Torsten et al. performed phylogenomic analyses using 47,953 amino acid positions to explore the relationships among 34 annelid taxa and found that Sipuncula belongs to Annelida¹². Therefore, the assignment of Sipuncula to annelids is still a controversial issue. Furthermore, the lack of segments in Sipuncula, which is different from other annelid taxa, provides a basis for understanding the mechanism underlying segment development. Genome sequence information is important for phylogenetic analyses. However, sequencing data for molluscs and annelids are limited. In Sipuncula, only one draft genome of Phascolosoma esculenta was published by Zhong et al.¹³. The genome data suggested that Sipuncula belonged to Annelida; however, the evolutionary relationships among Polychaeta, Oligochaeta, and Hirudinea in their reconstructed phylogenetic tree were inconsistent with previous results, making evolutionary inferences difficult. Therefore, additional genome data for Sipuncula, especially chromosome-level genome data, are needed to clarify the evolutionary relationships of lophotrochozoans and to provide genomic resources for “evo-devo” studies of body segmentation.

S. nudus is a cosmopolitan Sipuncula species that is distributed in temperate, subtropical, and tropical waters in all oceans (Fig. 1). In this study, we assembled the first high-quality genome of S. nudus using PacBio HiFi sequencing and high-throughput chromosome conformation capture (Hi-C). We used HiFi reads for assembly and Hi-C technology for chromosome anchoring. We obtained a contig N50 of 29.47 Mb and a scaffold N50 of 80.87 Mb for the final genome assembly, which is approximately 1,427 Mb. Using Hi-C data, 97.91% of the assembled bases were associated with the 17 chromosomes. These high-quality genomic data are expected to improve the resolution of phylogenetic analyses of Sipuncula and to provide a reference for detailed analyses of their characteristics, adaptation to complex habitats, and ecological niches.

Methods

Sample collection and DNA extraction

Male 2-year-old S. nudus samples were collected from the field of Suixi, Zhanjiang Guangdong Province, China (21°35′N, 109°81′E), and were used for whole-genome sequencing. The body wall tissue was stored in liquid nitrogen, and total genomic DNA was isolated by using the QIAGEN DNeasy Blood & Tissue Kit (QIAGEN, Shanghai, China) following the manufacturer’s instructions.

Library construction and sequencing

Three SMRTbell libraries of circular consensus sequencing (CCS) were constructed according to the standard PacBio protocol using 15–20 kb preparation solutions (Pacific Biosciences, Menlo Park, CA, USA). Five cells were sequenced on the PacBio Sequel II platform by the CCS model (Pacific Biosciences) to generate HiFi (high-fidelity) reads. The reads were produced by calling consensus from subreads that were generated by multiple passes of the enzyme around a circularized template. This resulted in a HiFi read that was both long and accurate. In total, 103.13 Gb of HiFi reads with 72.63× coverage was generated, and the N50 value was 14,008 bp (Table 1).

Table 1 HiFi sequencing data statistics.

Full size table

Hi-C libraries were prepared as previously reported¹⁴. The body wall tissue cells were fixed by using formaldehyde to keep the 3D structure of DNA intact. Cells were digested with the HindIII restriction endonuclease. Biotin-labelled bases were used for end repair. The DNA fragments maintaining interaction relationships were captured to construct the Hi-C library. Finally, 289.30 Gb of high-quality Hi-C data (Q20 > 98% and Q30 > 94%) was obtained with the BGISEQ-500 sequencing platform (Table 2).

Table 2 Hi-C sequencing data statistics.

Full size table

Genome survey and assembly

The size, heterozygosity, and repeat rate of the S. nudus genome were estimated using the k-mer frequency method. Jellyfish¹⁵ and GenomeScope v.1.0¹⁶ were employed to calculate the K-mer frequency (k = 21), which was based on HiFi reads, and the genome size was estimated to be 1305 Mb with a peak K-mer frequency of 66X. The heterozygosity and repeat rate were 2.03% and 39.68%, respectively (Fig. 2). We first assembled the genome using HiFi reads via HiFi-asm (v0.15.1)¹⁷ with default parameters. After preliminary assembly, we used purge_haplotigs¹⁸ to purge haplotigs. The haploid genome size was 1426.68 Mb, and the N50 length was 29.46 Mb (Fig. 3 and Table 3).

Table 3 Genome assembly statistics using PacBio HiFi reads and Hi-C data.

Full size table

The contigs were anchored to chromosomes using Hi-C data. Juicer (version 1.6)¹⁹ was used to align the double-ended sequencing data against the assembled genome to complete the evaluation of the Hi-C library. The 3D-DNA pipeline²⁰ under default parameters without breaking contigs was chosen to generate the final chromosome-level scaffolds. Manual checking and refinement of the draft assembly were carried out via Juicebox Assembly Tools (https://github.com/aidenlab/Juicebox, v1.1108). A heatmap of the Hi-C assembly interaction bins indicated that the quality of the genome assembly was excellent (Fig. 4). The length of the final assembled genome was 1,426,776,655 bp, with a contig N50 of 29,460,569 bp and scaffold N50 of 80,869,746 bp (Table 3 and Fig. 3). Approximately 1,397 Mb (97.91%) of the contig sequences were anchored to 17 chromosomes (Table 4), which is consistent with the known karyotype in our previously published manuscript²¹. Using the minimap2 (v2.17, parameters: -a -x map-pb)²² alignment results and the HiFi data, we used BamDeal (https://github.com/BGI-shenzhen/BamDeal) to evaluate the mapping rate and coverage and obtained estimates of 99.95% and 99.73%, respectively. The CIRCOS tool²³ was used to visualize the 17 chromosomes, GC content, read depth and mapping depth (Fig. 5). The average depth of each chromosome was calculated and is shown in Fig. 6. Seventeen chromosomes had a comparable sequencing depth, and there was no whole chromosome with half the read depth. Therefore, XY- or ZW-type sex chromosomes did not exist in the assembled chromosomes of S. nudus. Based on 20-kb nonoverlapping sliding windows in the chromosomes to calculate the GC content and read average depth, there was a small cluster of sliding windows (a total of 11.6 Mb with 581 sequences) that exhibited relatively high GC contents ( > 48%) but with a normal sequencing depth (Fig. 7). By extracting those block sequences with high GC contents and mapping them to the NT database (Nucleotide Sequence Database) using MegaBlast (parameter: −e 1e-5), the alignments with identity >90% and coverage length >100 bp were filtered. The matched reference species in the alignments from the NT database were grouped into three categories: the S. nudus species, the species of other invertebrates, and all other species except the two mentioned above. All the matched sequences (228) could be correlated with S. nudus or other invertebrate species (Fig. 8), which demonstrated that the sequence blocks with high GC content and normal depth in chromosomes were from the S. nudus species rather than from contamination or cobionts.

Table 4 Genome chromosome length statistics.

Full size table

Repeat annotation

Prior to gene prediction, we identified the repetitive elements in the genome of S. nudus by using a combination of homology-based and ab initio-based methods. To identify tandem repeats, we used Tandem Repeats Finder v4.09²⁴. For the homology-based method, transposable elements were identified by RepeatMasker v4.0.7 (-nolow -no_is -norna -engine ncbi -parallel 1) and RepeatProteinMask v4.0.7 (-engine ncbi -noLowSimple -pvalue 0.0001)²⁵ against the TE protein databases and RepBase library v21.12²⁶. For the ab initio-based method, LTR_FINDER v1.06²⁷ and RepeatModeler v1.0.8 (http://repeatmasker.org/RepeatModeler/) with default parameters were used to build the de novo library before RepeatMasker v4.0.7 was used to classify the different categories of repetitive elements against this library. The final repetitive elements were identified by integrating the results of these methods according to sequence overlap, revealing that nearly half of the genome consists of repetitive elements (Tables 5, 6; Fig. 5).

Table 5 Genome repetitive element statistics.

Full size table

Table 6 TE type statistics.

Full size table

Gene prediction

Gene annotation was performed by integrating homology-, de novo- and transcriptome-based information. We used the annotation data from three closely related species (Caenorhabditis elegans, Capitella teleta, and Helobdella robusta) for homology prediction. The MAKER tool²⁸ was used to integrate the annotation data from the three related species and the transcriptome data from S. nudus. Based on AED values from MAKER, 2000 genes with complete structures were selected and used to train the de novo prediction tools Augustus²⁹ and Snap³⁰ to construct de novo models. Finally, all data were integrated using MAKER²⁸. The final comprehensive gene set contained 28,749 genes (Table 7).

Table 7 General statistics of predicted protein-coding genes.

Full size table

Gene function annotation

Gene function annotation was performed based on sequence similarity and domain conservation. First, the protein-coding genes of S. nudus were aligned against the KEGG³¹, SwissProt³², TrEMBL³³, GO³⁴, KOG (ftp://ftp.ncbi.nih.gov/pub/COG/KOG/), and Nr (https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz) databases by using BLASTP with an E-value threshold of 1e-5. Subsequently, the best match from the alignment was used to predict gene functions. Second, searches performed using InterProScan (51.0–55.0)³⁵ against the following databases were used to identify the motif and domain: PANTHER³⁶, Pfam³⁷, PRINTS³⁸, ProDom³⁹, SUPERFAMILY⁴⁰, and SMART⁴¹. In total, 88.75% of the predicted genes were functionally annotated (Table 8).

Table 8 Functional annotation statistics.

Full size table

Data Records

The National Center for Biotechnology Information (NCBI) BioProject accession number for the sequence reported in this paper is PRJNA901211. The raw data for Hi-Fi and Hi-C sequencing were submitted to NCBI SRA (accession number SRP408321; https://identifiers.org/ncbi/insdc.sra:SRP408321)⁴² and deposited in the CNGB Sequence Archive (CNSA) of the China National GeneBank DataBase (CNGBdb) (accession number CNR0640303-CNR0640323; https://db.cngb.org/search/project/CNP0003624/)⁴³. The assembled genome sequence was deposited into NCBI under accession number JAPPUL000000000⁴⁴. The assembled genome, gene structure annotation, repeat predictions, gene function annotation, KEGG analysis of expanded genes and positively selected gene data were deposited in the China National GeneBank DataBase (CNGBdb) under the project with accession number CNP0003624.

Technical Validation

Genome assembly and gene prediction quality assessment

The BUSCO pipeline was used to evaluate the completeness of the genome assembly and gene set based on a benchmark of 255 conserved genes in eukaryota_odb10 (creation date: 2020-09-10, number of genomes: 70, number of BUSCOs: 255). In total, 97.7% of the 255 expected conserved genes were identified as complete, and 2% were identified as fragmented. Furthermore, we used minimap2 (v2.17, parameters: -a -x map-pb)²² to align the assembly with the HiFi data, and the mapping rate and coverage rate were estimated to be 99.95% and 99.73%, respectively. The BUSCO (v5)⁴⁵ results supported the completeness of the assembly; 97.7% of 255 conserved genes were identified as complete by using eukaryota_odb10 (Table 9). The BUSCO results and alignment results indicated high genome assembly completeness and correctness.

Table 9 Evaluation of genome assembly completeness.

Full size table

Comparative genomic analysis

The protein-coding genes of S. nudus and 15 additional species were used to identify orthologous gene groups. The reference protein sequences of the following 15 species were obtained: Caenorhabditis elegans (Ensembl Release 10), Danio rerio (Ensembl Release 10), Homo sapiens (Ensembl Release 10), Drosophila melanogaster (Ensembl Release 10), Capitella teleta (NCBI: GCA_000328365.1), Crassostrea gigas (NCBI: GCF_902806645.1), Dimorphilus gyrociliatus (NCBI: GCA_904063045.1), Eisenia andrei (ngdc.cncb.ac.cn: PRJCA002327), Helobdella robusta (NCBI: GCF_000326865.1), Lamellibrachia satsuma (NCBI: GCA_022478865.1), Lottia gigantea (NCBI: GCF_000327385.1), Metaphire vulgaris (NCBI: GCA_018105865.1), Owenia fusiformis (NCBI: GCA_903813345.2), Phascolosoma esculenta (https://figshare.com/: PRJNA819496), and Nematostella vectensis (NCBI: GCF_932526225.1) as the outgroup. To perform the gene family analysis, orthogroups of the 16 species were identified using OrthoFinder (v2.3.11) with default parameters⁴⁶. After analysis of the gene family, 416,469 genes from the 16 species were grouped into 30,677 gene families. The results revealed that 717 gene families that involved 4,217 genes were unique in S. nudus. The gene families and genome statistics of all the species are shown in Table 10. Among the orthologous genes in the 16 species, a total of 255 single-copy genes were identified. The single-copy orthologues were aligned using MUSCLE (v3.7)⁴⁷ with default parameters, and then the aligned protein sequences were reverse translated into codon sequences. The alignments were then concatenated to generate a superalignment matrix for phylogenetic reconstruction based on the maximum-likelihood (ML) method using IQ-TREE (v1.6.12)⁴⁸, with the best-fit evolutionary substitution model being determined using ModelFinder⁴⁹. Divergence times for each node in the phylogenetic tree were estimated using MCMCtree, which is implemented in PAML package v4.8a⁵⁰, under the following parameters: -nsample 100000, -rootage 800, and -burnin 500000. The calibration times were obtained from TimeTree⁵¹: 630.0–830.0 million years ago (Ma) for Caenorhabditis elegans and Homo sapiens, 424.2–440.0 Ma for Danio rerio and Homo sapiens, and 545.0–681.5 Ma for Capitella teleta and Crassostrea gigas. The phylogenetic tree representing the evolutionary relationships among Mollusca, Annelida and Sipuncula is shown in Fig. 9. Gene collinearity, which shows the preservation of ancestral genome structure in the modern genome, is an important means of unveiling genomic evolution. Thus, MCscan (Python version)⁵² was used for the genomic synteny analysis between S. nudus, O. fusiformis and P. esculenta. The collinearity figure was drawn based on the homologous blocks with ≥ 4 gene collinear pairs between species by JCVI (https://github.com/tanghaibao/jcvi) (Fig. 10). Regarding intergenomic gene collinearity, 109 blocks containing 508 collinear gene pairs were revealed between S. nudus and O. fusiformis, and 622 blocks containing 3248 collinear gene pairs were revealed between S. nudus and P. esculenta, showing similar collinearity between the two Sipuncula species.

Table 10 The gene family statistics.

Full size table

The time-calibrated phylogenetic tree was used to assess gene family expansions and contractions using CAFÉ 4.2.1⁵³ with a random birth-and-death model with lambda. In total, 543 and 97 significantly expanded and contracted gene families were identified, respectively (P < 0.05). GO and KEGG enrichment analyses of the expanded gene families revealed that these families were mainly involved in pathways that are related to apoptosis, detoxification, the immune response, amino acid and fatty acid metabolism anion, oxidative stress, and energy metabolism.

PSGs (positively selected genes) were predicted using branch-site likelihood ratio tests for single-copy gene families with a conservative 10% false discovery rate (FDR) criterion⁵⁴. We used proteins from S. nudus, C. teleta, E. Andrei, L. satsuma, O. fusiformis, and P. esculenta to extract 3,192 one-to-one orthologous genes using the OrthoFinder (v2.3.11) pipeline. The one-to-one orthologous genes were then used to generate multiple sequence alignments by using PRANK (v. 121002)⁵⁵. The d_N/d_S ratios of the codons were calculated using the branch-site model of Codeml in the PAML package⁵⁰, in which S. nudus was set as the foreground branch and the other five taxa as background branches. Using a likelihood ratio test (LRT) of ≤0.05 and an FDR of ≤0.05 as thresholds, 326 PSGs were identified in the S. nudus genome. These PSGs were significantly enriched in the terms “Spliceosome,” “Base excision repair,” “DNA replication,” and “Cell cycle” in the KEGG pathway enrichment analysis.

In summary, we obtained the high-quality chromosome-level genome of S. nudus, which contributes to our understanding of the evolutionary status of Sipuncula and the evolutionary relationship among the subgroups of the phylum Annelida. Gene family expansion and extraction and genomic synteny analyses revealed the potential adaptation mechanism of Sipuncula to different living environments.

Usage Notes

All analyses were run on Linux systems, and the optimal parameters are given in the Code availability section.

Code availability

No specific code or script was used in this work. Commands used for data processing were all executed according to the manuals and protocols of the corresponding software.

References

Cutler, E. B. The Sipuncula: Their Systematics, Biology, And Evolution (New York: Cornell University Press, https://doi.org/10.7591/9781501723643, 1994)
Nielsen, C. Some aspects of spiralian development. Acta Zool. 91, 20–28, https://doi.org/10.1111/j.1463-6395.2009.00421.x (2010).
Article Google Scholar
Huang, D. Y., Chen, J. Y., Vannier, J. & Saiz Salinas, J. I. Early Cambrian sipunculan worms from southwest China. Proceedings of the Royal Society of London. Series B: Biological Sciences 271, 1671–1676, https://doi.org/10.1098/rspb.2004.2774 (2004).
Article Google Scholar
Li, J. et al. Experimental study of bioturbation by Sipunculus nudus in a polyculture system. Aquaculture 437, 175–181, https://doi.org/10.1016/j.aquaculture.2014.12.002 (2015).
Article Google Scholar
Li, J. et al. Bioturbation of peanut worms Sipunculus nudus on the composition of prokaryotic communities in a tidal flat as revealed by 16S rRNA gene sequences. MicrobiologyOpen 8, 1–12, https://doi.org/10.1002/mbo3.802 (2019).
Article CAS Google Scholar
Linnaeus, C. Systema Naturae Per Regna Tria Naturae, Secundum Classes, Ordines, Genera, Species, Cum Characteribus Differentiis, Synonymis, Locis (Editio duodecima, reformata. Holmiae, 1767).
Delle Chiaie, S. Memorie Sulla Storia Enotomia Degli Animali Senza Vertebre Del Regno Di Napoli Vol. 1–4 (Dalla stamperia de’Fratelli Fernandes, 1823).
Hyman, L. H. The Protostomatous Coelomates-Phylum Sipunculida. The Invertebrates. (New York: McGraw-Hill Press, 1959).
Scheltema, A. H. Aplacophora as progenetic aculiferans and the coelomate origin of mollusks as the sister taxon of Sipuncula. Biol Bull. 184, 57–78, https://doi.org/10.2307/1542380 (1993).
Article CAS PubMed Google Scholar
Shen, X., Ma, X., Ren, J. & Zhao, F. A close phylogenetic relationship between Sipuncula and Annelida evidenced from the complete mitochondrial genome sequence of Phascolosoma esculenta. BMC genomics 10, 1–11, https://doi.org/10.1186/1471-2164-10-136 (2009).
Article CAS Google Scholar
Wanninger, A., Koop, D., Bromham, L., Noonan, E. & Degnan, B. M. Nervous and muscle system development in Phascolion strombus (Sipuncula). Dev Genes Evol 215, 509–518, https://sci-hub.se/10.1007/s00427-005-0012-0 (2005).
Struck, T. H. et al. Phylogenomic analyses unravel annelid evolution. Nature 471, 95–98, https://sci-hub.se/10.1038/nature09864(2011).
Article ADS CAS PubMed Google Scholar
Zhong, S. et al. The Draft Genome of Chinese endemic species Phascolosoma esculenta (Sipuncula, Phascolosomatidae) reveals the phylogenetic position of Sipuncula. Front. Genet. 13, 1–7, https://doi.org/10.3389/fgene.2022.910344 (2022).
Article CAS Google Scholar
Belton, J. M. et al. Hi-C: A comprehensive technique to capture the conformation of genomes. Methods 58, 268–276, https://doi.org/10.1016/j.ymeth.2012.05.001 (2012).
Article CAS PubMed Google Scholar
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770, https://doi.org/10.1093/bioinformatics/btr011 (2011).
Article CAS PubMed PubMed Central Google Scholar
Vurture, G. W. et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics 33, 2202–2204, https://doi.org/10.1093/bioinformatics/btx153 (2017).
Article CAS PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175, https://doi.org/10.1038/s41592-020-01056-5 (2021).
Article CAS PubMed PubMed Central Google Scholar
Roach, M. J., Schmidt, S. A. & Borneman, A. R. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC bioinform. 19, 1–10, https://doi.org/10.1186/s12859-018-2485-7 (2018).
Article CAS Google Scholar
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell syst. 3, 95–98, https://doi.org/10.1016/j.cels.2016.07.002 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95, https://doi.org/10.1126/science.aal3327 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Wang, Q. H., Du, Y. H., Lin, S. L., Du, X. & Deng, Y. Karyotype analysis of Sipunculus nudus. Chinese Journal of Zoology 43, 95–98, https://doi.org/10.13859/j.cjz.2008.05.012 (2008).
Article Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100, https://doi.org/10.1093/bioinformatics/bty191 (2018).
Article CAS PubMed PubMed Central Google Scholar
Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res. 19, 1639–1645, http://www.genome.org/cgi/doi/10.1101/gr.092759.109 (2009).
Article CAS PubMed PubMed Central Google Scholar
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21, i351–i358, https://doi.org/10.1093/bioinformatics/bti1018 (2005).
Article CAS PubMed Google Scholar
Bergman, C. M. & Quesneville, H. Discovering and detecting transposable elements in genome sequences. Brief. bioinformatics 8, 382–392, https://doi.org/10.1093/bib/bbm048 (2007).
Article CAS PubMed Google Scholar
Jurka, J. et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 110, 462–467, https://doi.org/10.1159/000084979 (2005).
Article CAS PubMed Google Scholar
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265–W268, https://doi.org/10.1093/nar/gkm286 (2007).
Article PubMed PubMed Central Google Scholar
Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinform. 12, 1–14, http://www.biomedcentral.com/1471-2105/12/491 (2011).
Article Google Scholar
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34, W435–W439, https://doi.org/10.1093/nar/gkl200 (2006).
Article CAS PubMed PubMed Central Google Scholar
Johnson, A. D. et al. SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics 24, 2938–2939, https://doi.org/10.1093/bioinformatics/btn564 (2008).
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y. & Hattori, M. The KEGG resource for deciphering the genome. Nucleic Acids Res. 32, D277–D280, https://doi.org/10.1093/nar/gkh063 (2004).
Article CAS PubMed PubMed Central Google Scholar
Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370, https://doi.org/10.1093/nar/gkg095 (2003).
Article CAS PubMed PubMed Central Google Scholar
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28, 45–48, https://doi.org/10.1093/nar/28.1.45 (2000).
Article CAS PubMed PubMed Central Google Scholar
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 25, 25–29, https://doi.org/10.1038/75556 (2000).
Article CAS PubMed Google Scholar
Jones, P. et al. InterProScan 5: genome scale protein function classification. Bioinformatics 30, 1236–1240, https://doi.org/10.1093/bioinformatics/btu031 (2014).
Article CAS PubMed PubMed Central Google Scholar
Mi, H. et al. PANTHER version 11: expanded annotation data from gene ontology and reactome pathways, and data analysis tool enhancements. Nucleic Acids Res. 45, D183–D189, https://doi.org/10.1093/nar/gkw1138 (2017).
Article CAS PubMed Google Scholar
Bateman, A. et al. The Pfam Protein Families Database. Nucleic Acids Res. 28, 263–266, https://doi.org/10.1093/nar/28.1.263 (2000).
Article CAS PubMed PubMed Central Google Scholar
Attwood, T. K. et al. PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res. 28, 225–227, https://doi.org/10.1093/nar/28.1.225 (2000).
Article ADS CAS PubMed PubMed Central Google Scholar
Corpet, F., Gouzy, J. & Kahn, D. Recent improvements of the ProDom database of protein domain families. Nucleic Acids Res. 27, 263–267, https://doi.org/10.1093/nar/27.1.263 (1999).
Article CAS PubMed PubMed Central Google Scholar
Wilson, D. et al. SUPERFAMILY–sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic Acids Res. 37, D380–D386, https://doi.org/10.1093/nar/gkn762 (2009).
Article CAS PubMed Google Scholar
Schultz, J., Copley, R. R., Doerks, T., Ponting, C. P. & Bork, P. SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res. 28, 231–234, https://doi.org/10.1093/nar/28.1.231 (2000).
Article CAS PubMed PubMed Central Google Scholar
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP408321 (2022).
Zheng, Z. Genome of Sipunculus nudus. China National GeneBank DataBase https://doi.org/10.26036/CNP0003624 (2022).
Article Google Scholar
Zheng, Z. The first high-quality chromosome-level genome for the Sipuncula Sipunculus nudus using HiFi and Hi-C data. GenBank https://identifiers.org/ncbi/insdc:JAPPUL000000000 (2022).
Simao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212, https://doi.org/10.1093/bioinformatics/btv351 (2015).
Article CAS PubMed Google Scholar
Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 1–14, https://doi.org/10.1186/s13059-019-1832-y (2019).
Article Google Scholar
Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797, https://doi.org/10.1093/nar/gkh340 (2004).
Article CAS PubMed PubMed Central Google Scholar
Nguyen, L. T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274, https://doi.org/10.1093/molbev/msu300 (2015).
Article CAS PubMed Google Scholar
Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., von Haeseler, A. & Jermiin, L. S. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589, https://doi.org/10.1038/nmeth.4285 (2017).
Article CAS PubMed PubMed Central Google Scholar
Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591, https://doi.org/10.1093/molbev/msm088 (2007).
Article CAS PubMed Google Scholar
Kumar, S., Stecher, G., Suleski, M. & Hedges, S. B. TimeTree: A resource for timelines, timetrees, and divergence times. Mol. Biol. Evol. 34, 1812–1819, https://doi.org/10.1093/molbev/msx116 (2017).
Article CAS PubMed Google Scholar
Tang, H. et al. Synteny and collinearity in plant genomes. Science 320, 486–488, https://doi.org/10.1126/science.1153917 (2008).
Article ADS CAS PubMed Google Scholar
Han, M. V., Thomas, G. W. C., Lugo-Martinez, J. & Hahn, M. W. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3. Mol. Biol. Evol. 30, 1987–1997, https://doi.org/10.1093/molbev/mst100 (2013).
Article CAS PubMed Google Scholar
Nielsen, R. et al. A scan for positively selected genes in the genomes of humans and chimpanzees. PLoS Biol. 3, el70, https://doi.org/10.1371/journal.pbio.0030170 (2005).
Article CAS Google Scholar
Loytynoja, A. & Goldman, N. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320, 1632–1635, https://doi.org/10.1126/science.1158395 (2008).
Article ADS CAS PubMed Google Scholar

Download references

Acknowledgements

This work was supported by the Science and Technology Department of Guangdong Province (Nos. 2021A05190 and 163-2019-XMZC-0009-02-0059).

Author information

These authors contributed equally: Zhe Zheng, Zhuoxin Lai, Bin Wu.

Authors and Affiliations

Fisheries College, Guangdong Ocean University, Zhanjiang, 524088, Guangdong, China
Zhe Zheng, Zhuoxin Lai, Xinlin Song, Ruzhuo Zhong, Jiawei Zhang, Yongshan Liao, Chuangye Yang, Yuewen Deng & Qingheng Wang
Guangdong Provincial Key Laboratory of Aquatic Animal Disease Control and Healthy Culture, Zhanjiang, 524088, Guangdong, China
Zhe Zheng, Yongshan Liao, Chuangye Yang, Yuewen Deng & Qingheng Wang
BGI-Shenzhen, Shenzhen, 518083, Guangdong, China
Bin Wu, Wei Zhao, Junpu Mei & Jianbo Jian
BGI-Sanya, BGI-Shenzhen, Sanya, 572025, Hai nan, China
Junpu Mei & Zhen Yue

Authors

Zhe Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Zhuoxin Lai
View author publications
You can also search for this author in PubMed Google Scholar
Bin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xinlin Song
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Ruzhuo Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yongshan Liao
View author publications
You can also search for this author in PubMed Google Scholar
Chuangye Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yuewen Deng
View author publications
You can also search for this author in PubMed Google Scholar
Junpu Mei
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Yue
View author publications
You can also search for this author in PubMed Google Scholar
Jianbo Jian
View author publications
You can also search for this author in PubMed Google Scholar
Qingheng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Zheng Z., Wang Q.H. and Deng Y.W. designed the study; Zheng Z., Lai Z.X., Wu B, Song X.L. and Zhao W. performed genome sequencing, data processing, and genome analysis; Zhang J.W., Jian J.B., Zhen Y. and Zhong R.Z. performed the assembly quality validation and improved gene annotation; Liao Y.S. and Yang C.Y. collected and prepared the samples; Zheng Z., Lai Z.X., and Wu B wrote the paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Qingheng Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zheng, Z., Lai, Z., Wu, B. et al. The first high-quality chromosome-level genome of the Sipuncula Sipunculus nudus using HiFi and Hi-C data. Sci Data 10, 317 (2023). https://doi.org/10.1038/s41597-023-02235-7

Download citation

Received: 19 December 2022
Accepted: 15 May 2023
Published: 25 May 2023
DOI: https://doi.org/10.1038/s41597-023-02235-7