Whole genome sequencing of a novel sea anemone (Actinostola sp.) from a deep-sea hydrothermal vent

Liu, Chang; Bian, Chao; Gao, Qiang; Gao, Zijian; Huang, Yu; Wang, Lingling; Shi, Qiong; Song, Linsheng

doi:10.1038/s41597-024-02944-7

Download PDF

Data Descriptor
Open access
Published: 22 January 2024

Whole genome sequencing of a novel sea anemone (Actinostola sp.) from a deep-sea hydrothermal vent

Chang Liu¹^na1,
Chao Bian ORCID: orcid.org/0000-0001-9904-721X²^na1,
Qiang Gao^1,3,
Zijian Gao⁴,
Yu Huang ORCID: orcid.org/0000-0002-3875-7581⁴,
Lingling Wang^1,3,
Qiong Shi ORCID: orcid.org/0000-0002-6358-976X^2,4 &
…
Linsheng Song^1,3

Scientific Data volume 11, Article number: 102 (2024) Cite this article

777 Accesses
Metrics details

Subjects

Abstract

Deep-sea hydrothermal vents are usually considered as extreme environments with high pressure, high temperature, scarce food, and chemical toxicity, while many local inhabitants have evolved special adaptive mechanisms for residence in this representative ecosystem. In this study, we constructed a high-quality genome assembly for a novel deep-sea anemone species (Actinostola sp.) that was resident at a depth of 2,971 m in an Edmond vent along the central Indian Ocean ridge, with a total size of 424.3 Mb and a scaffold N50 of 383 kb. The assembled genome contained 265 Mb of repetitive sequences and 20,812 protein-coding genes. Taken together, our reference genome provides a valuable genetic resource for exploring the evolution and adaptive clues of this deep-sea anemone.

Phylogenomics and the rise of the angiosperms

Article Open access 24 April 2024

Complexity of avian evolution revealed by family-level genomes

Article Open access 01 April 2024

Analysis of nearly 3000 archaeal genomes from terrestrial geothermal springs sheds light on interconnected biogeochemical processes

Article Open access 14 May 2024

Background & Summary

Deep-sea hydrothermal vents are a representative ecosystem, where hot and chemical fluids exit the seafloor from black smoker chimneys¹. These vents are considered as extremely harsh environments with high pressure, high temperature, low oxygen, and high concentrations of methane (CH₄), heavy metals and hydrogen sulfide (H₂S)^2,3. Many species live within and around these hydrothermal vents, including various crabs, shrimps, fishes, octopus, as well as diverse sessile creatures such as sea anemones, barnacles, and tube worms^4,5. These special organisms arouse many interests to developers for drugs, enzymes, cosmetics, biofuel, and other products. However, the genetic basis of evolution and adaptation of deep-sea hydrothermal vents animals is still lacking.

Sea anemones, a group of primitive Cnidarians, are widely distributed across the whole ocean depth⁶. Their unique adaptive strategies help them live in a variety of marine habitats from shallow waters to deep-sea trenches. During a recent expedition, an anemone (Fig. 1a) was collected at 2,971 m depth in certain hydrothermal vents of Indian Ocean (E60.5, N6.4). In this area, Actinostolidae anemones showed the highest abundance reported from previous research⁷. Morphological and molecular analyses suggest that this deep-sea anemone belongs to the genus Actinostola. Here, whole genome sequencing was performed to construct a high-quality genome assembly for this newfound Actinostola sp., which will help to elucidate adaptive clues to deep-sea hydrothermal environments.

A total of 44.23-Gb paired-end reads produced by an Illumina sequencing platform were used for a genome survey (Fig. 1b). The sequencing depth with the highest frequency was identified at 54, and the total number of 17-mer reads was 19,503,242,454. Therefore, the estimated genome size of Actinostola sp. was about 487 Mb. Meanwhile, the heterozygosity rate of this genome was predicted to be 0.9% (see more details in Fig. 1b).

A 424.3-Mb draft genome was subsequently assembled based on 112.37-Gb long reads generated from a PacBio sequencing platform and 26.10-Gb short reads generated from an Illumina Hiseq Xten platform, with a contig N50 of 373 kb, a scaffold N50 of 383 kb and GC content of 38.7% (Table 1). The routine BUSCO (Benchmarking Universal Single-Copy Orthologs) method was applied to evaluate the completeness of our assembled genome, using the eukaryota_odb9 database as the reference. Finally, 252 (83.2%) BUSCO core genes were completely identified.

Table 1 Summary of the genome assembly for the sequenced Actinostola sp.

Full size table

For further repeat annotation, a total of 265-Mb data covering 62.4% of the total assembled genome were predicted to be repeat sequences. Among them, 25.5% of the genome (108.2 Mb) was DNA repeat elements, 8.4% (35.6 Mb) was long interspersed nuclear elements (LINE), 14.3% (60.6 Mb) was long terminal repeats (LTR), and 0.8% (3.6 Mb) was short interspersed nuclear elements (SINE). After masking those repetitive regions, we applied an integrated method of homologous sequence search and de novo gene prediction to obtain annotations of 20,812 protein-coding genes in the assembled genome. By searching four public databases including GO (Gene ontology)⁸, KEGG (Kyoto Encyclopedia of Genes and Genomes)⁹, SwissProt¹⁰ and TrEMBL¹¹, we found that 97.89% (19,111 in total) of these predicted genes were functionally annotated.

The coding sequences (CDS), predicted from assembled genomes of Actinostola sp. (this study) and other seven representative species (Fig. 1c), were utilized for clustering of gene families. Eventually, the 20,812 protein-coding genes of Actinostola sp. were clustered into 10,327 gene families, among them 3,526 were single-copy orthologous. A phylogenetic tree (Fig. 1c) was constructed based on these single-copy orthologous gene families with the maximum likelihood method, predicting that the divergence of our newfound Actinostola sp. from another sea anemone Exaiotasia diaphana occurred 305 million years ago (Mya). This high-quality reference genome for Actinostola sp. can also provide novel insights for enhancing wild resource conservation, discovering new functional genes, developing novel marine drugs, and elucidating special adaptive mechanisms.

Methods

Sample collection, library construction, and genome sequencing

A specimen of the Actinostola sp. was collected from an Edmond vent along the central Indian Ocean ridge for whole genome sequencing. Genomic DNA (gDNA) was extracted using QIAwave DNA Blood & Tissue Kit (Qiagen, Germantown, MD, USA). The genome was sequenced using a combination of sequencing techniques, including paired-end sequencing with a 500-bp inserted library on an Illumina Hiseq Xten platform (Illumina Inc., San Diego, CA, USA), and a PacBio library with an insert-size of 20 kb on a PacBio sequencing platform (Pacific Biosciences, Menlo Park, CA, USA).

Genome size estimation

The Illumina short reads were filtered with SOAPfilter v2.2¹². Clean reads were then used for estimation of the Actinostola sp. genome size with a 17-mer frequency distribution analysis according to the following formula¹³: Genome Size = Kmer_num/peak_depth, where k-mer_num is the total number of reads and peak_depth denotes the estimated peak frequency of 17-mers.

Genome assembly

Before assembly, the PacBio long sequencing reads were calibrated using LoRDEC¹⁴, along with the clean Illumina short reads. After correction, DBG2OLC¹⁵ was applied to assemble these long reads to contigs with assistance of the clean short reads. To further improve the genome accuracy, two rounds of polishing was performed with different strategies. First, Racon v1.3.1¹⁶ was employed for contigs polishing based on the uncorrected PacBio long reads. Second, the clean short reads were used to polish the contigs with pilon¹⁷. After heterozygosity reducing with Redundans¹⁸, we obtained a polished genome assembly for the sequenced Actinostola sp. BUSCO¹⁹ v5.22 provided quantitative measurements for the completeness of this assembly with the popular eukaryota_odb9 database as the reference.

Genome annotation

We predicted repeat elements by de novo and homology annotations. RepeatModeler²⁰ and LTR-FINDER²¹ were employed for the de novo prediction to build a repeat library. Then, the two libraries were combined and aligned to the assembled genome with RepeatMasker²². For the homology prediction, a known repeat library (Repbase²³) was employed to identify repeats with RepeatMasker and RepeatProteinMask²². Tandem repeats were detected using Tandem Repeat Finder²⁴. Finally, by integrating these data from both methods, a nonredundant set of repeat elements were obtained.

To predict protein-coding genes, protein sequences form nine representative species including California sea hare (Aplysia californica), nematode (Caenorhabditis elegans), sacoglossan sea slug (Elysia chlorotica), limpet (Lottia gigantea), two-spot octopus (Octopus bimaculoides), invasive apple snail (Pomacea canaliculata), glass anemone (Exaiptasia pallida), starlet sea anemone (Nematostella vectensis), and human (Homo sapiens), were downloaded from Ensembl²⁵, and then they were mapped to our assembled genome with TBLASTn²⁶. Subsequently, gene structures were predicted by GeneWise²⁷. Finally, we integrated all these predicted results using MAKER²⁸ to obtain a consistent gene set.

For functional annotation, BLASTp²⁹ was applied to align the predicted protein sequences against four public databases (including SwissProt¹⁰, TrEMBL¹⁰, KEGG³⁰ and InterPro⁸), and then these results were retrieved to obtain GO³¹ terms.

Data Records

Our final assembly and annotation data have been deposited at the NCBI with accession number JAUJYZ000000000³². Protein and gene coding sequences are uploaded into FigShare depository for public accession³³. The raw reads of PacBio and Illumina sequencing were also uploaded at the NCBI with accession numbers SRR25988563- SRR25988567³⁴.

Technical Validation

The genome assembly was 424.3 Mb with a scaffold N50 of 383 kb. For quantitative assessment of this genome assembly, we showed that 83.2% of the reference BUSCO genes (insecta_db9) were successfully identified in the final genome assembly version, suggesting remarkable completeness of this Actinostola sp. genome assembly.

Code availability

No custom scripts or code was used in this study. All software and pipelines were executed according to the manuals and protocols of related published bioinformatic tools. Corresponding versions and codes/parameters of software have been described in Methods.

References

Van Dover, C. L. & Trask, J. L. Diversity at deep-sea hydrothermal vent and intertidal mussel beds. Marine Ecology Progress Series 195, 169–178 (2000).
Article ADS Google Scholar
Little, C. T. S. & Vrijenhoek, R. C. Are hydrothermal vent animals living fossils? Trends in Ecology & Evolution 18, 582–588 (2003).
Article Google Scholar
Sun, S. E., Sha, Z. & Xiao, N. The first two complete mitogenomes of the order Apodida from deep-sea chemoautotrophic environments: New insights into the gene rearrangement, origin and evolution of the deep-sea sea cucumbers. Comparative Biochemistry and Physiology Part D: Genomics and Proteomics 39, 100839- (2021).
CAS Google Scholar
Tunnicliffe, V., McArthur, A. G. & McHugh, D. in Advances in marine biology Vol. 34 353–442 (Elsevier, 1998).
Zierenberg, R. A., Adams, M. W. W. & Arp, A. J. Life in extreme environments: Hydrothermal vents. Proceedings of the National Academy of Sciences 97, 12961–12962 (2000).
Article ADS CAS Google Scholar
Jamieson, A. The hadal zone: life in the deepest oceans. (Cambridge University Press, 2015).
Zhou, Y. et al. Characterization of vent fauna at three hydrothermal vent fields on the Southwest Indian Ridge: Implications for biogeography and interannual dynamics on ultraslow-spreading ridges. Deep Sea Research Part I Oceanographic Research Papers 137(JUL.), 1–12 (2018).
Article ADS CAS Google Scholar
Hunter, S. et al. InterPro: the integrative protein signature database. Nucleic acids research 37, D211–D215 (2009).
Article CAS PubMed Google Scholar
Ogata, H. et al. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic acids research 27, 29–34 (1999).
Article CAS PubMed PubMed Central Google Scholar
Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic acids research 31, 365–370 (2003).
Article CAS PubMed PubMed Central Google Scholar
Kulikova, T. et al. The EMBL nucleotide sequence database. Nucleic Acids Research 32, D27–D30 (2004).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Chen, Y. et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. Gigascience 7, gix120 (2018).
Article PubMed Google Scholar
Hequan, S., Jia, D., Mathieu, P. & Korbinian, S. findGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies. Bioinformatics 34, 550–557 (2018).
Article Google Scholar
Salmela, L. & Rivals, E. LoRDEC: accurate and efficient long read error correction. Bioinformatics 30, 3506–3514 (2014).
Article CAS PubMed PubMed Central Google Scholar
Ye, C., Hill, C. M., Wu, S., Ruan, J. & Ma, Z. DBG2OLC: efficient assembly of large genomes using long erroneous reads of the third generation sequencing technologies. Scientific reports 6, 31900 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome research 27, 737–746 (2017).
Article CAS PubMed PubMed Central Google Scholar
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PloS one 9, e112963 (2014).
Article ADS PubMed PubMed Central Google Scholar
Pryszcz, L. P. & Gabaldón, T. Redundans: an assembly pipeline for highly heterozygous genomes. Nucleic acids research 44, e113–e113 (2016).
Article PubMed PubMed Central Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article PubMed Google Scholar
Smit, A., Hubley, R. & Green, P. RepeatModeler Open-1.0. 2008–2010. Access date Dec (2014).
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic acids research 35, W265–W268 (2007).
Article PubMed PubMed Central Google Scholar
Chen, N. Using Repeat Masker to identify repetitive elements in genomic sequences. Current protocols in bioinformatics 5, 4.10. 11–14.10. 14 (2004).
Article Google Scholar
Jurka, J. et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenetic genome research 110, 462–467 (2005).
Article CAS PubMed Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research 27, 573–580 (1999).
Article CAS PubMed PubMed Central Google Scholar
Flicek, P. et al. Ensembl 2013. Nucleic acids research 41, D48–D55 (2012).
Article PubMed PubMed Central Google Scholar
Gertz, E. M., Yu, Y.-K., Agarwala, R., Schäffer, A. A. & Altschul, S. F. Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC biology 4, 1–14 (2006).
Article Google Scholar
Birney, E., Clamp, M. & Durbin, R. GeneWise and genomewise. Genome research 14, 988–995 (2004).
Article CAS PubMed PubMed Central Google Scholar
Cantarel, B. L. et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome research 18, 188–196 (2008).
Article CAS PubMed PubMed Central Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of molecular biology 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research 28, 27–30 (2000).
Article CAS PubMed PubMed Central Google Scholar
Consortium, G. O. The Gene Ontology (GO) database and informatics resource. Nucleic acids research 32, D258–D261 (2004).
Article Google Scholar
Bian, C. NCBI GenBank https://identifiers.org/ncbi/insdc:JAUJYZ000000000 (2023).
Bian, C. Actinostola_sp genome and annotation. figshare https://doi.org/10.6084/m9.figshare.23659923.v1 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP459375 (2023).

Download references

Acknowledgements

The authors are grateful to the laboratory members at Dalian Ocean University for their technical assistance and helpful discussion. This research was supported by the National key R & D program of China (no. 2018YFC0310702), Outstanding Talents and Innovative Teams of Agricultural Scientific Research in the MOAA of China, the innovation team of Aquaculture Environment Safety from Liaoning Province (no. LT202009), and funds from the Research Foundation for Talented Scholars at Dalian Ocean University.

Author information

These authors contributed equally: Chang Liu, Chao Bian.

Authors and Affiliations

Liaoning Key Laboratory of Marine Animal Immunology, Dalian Ocean University, Dalian, 116023, China
Chang Liu, Qiang Gao, Lingling Wang & Linsheng Song
Laboratory of Aquatic Genomics, College of Life Sciences and Oceanography, Shenzhen University, Shenzhen, 518057, China
Chao Bian & Qiong Shi
Southern Laboratory of Ocean Science and Engineering, Zhuhai, 519000, China
Qiang Gao, Lingling Wang & Linsheng Song
Shenzhen Key Lab of Marine Genomics, BGI Academy of Marine Sciences, BGI Marine, Shenzhen, 518081, China
Zijian Gao, Yu Huang & Qiong Shi

Authors

Chang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chao Bian
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Zijian Gao
View author publications
You can also search for this author in PubMed Google Scholar
Yu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Lingling Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qiong Shi
View author publications
You can also search for this author in PubMed Google Scholar
Linsheng Song
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

L.S. and L.W. designed the project. C.L., Q.G. and C.B. prepared the DNA for sequencing; Y.H. and Z.G. analyzed the genome size and annotated the genome; C.L., C.B., Z.G. and Y.H. performed comparative genomics analysis and gene family identification; C.L., C.B. and L.W. drafted the manuscript; L.S. and Q.S. improved and revised the manuscript.

Corresponding authors

Correspondence to Lingling Wang, Qiong Shi or Linsheng Song.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, C., Bian, C., Gao, Q. et al. Whole genome sequencing of a novel sea anemone (Actinostola sp.) from a deep-sea hydrothermal vent. Sci Data 11, 102 (2024). https://doi.org/10.1038/s41597-024-02944-7

Download citation

Received: 11 September 2023
Accepted: 10 January 2024
Published: 22 January 2024
DOI: https://doi.org/10.1038/s41597-024-02944-7