Background & Summary

Deep-sea hydrothermal vents are a representative ecosystem, where hot and chemical fluids exit the seafloor from black smoker chimneys1. These vents are considered as extremely harsh environments with high pressure, high temperature, low oxygen, and high concentrations of methane (CH4), heavy metals and hydrogen sulfide (H2S)2,3. Many species live within and around these hydrothermal vents, including various crabs, shrimps, fishes, octopus, as well as diverse sessile creatures such as sea anemones, barnacles, and tube worms4,5. These special organisms arouse many interests to developers for drugs, enzymes, cosmetics, biofuel, and other products. However, the genetic basis of evolution and adaptation of deep-sea hydrothermal vents animals is still lacking.

Sea anemones, a group of primitive Cnidarians, are widely distributed across the whole ocean depth6. Their unique adaptive strategies help them live in a variety of marine habitats from shallow waters to deep-sea trenches. During a recent expedition, an anemone (Fig. 1a) was collected at 2,971 m depth in certain hydrothermal vents of Indian Ocean (E60.5, N6.4). In this area, Actinostolidae anemones showed the highest abundance reported from previous research7. Morphological and molecular analyses suggest that this deep-sea anemone belongs to the genus Actinostola. Here, whole genome sequencing was performed to construct a high-quality genome assembly for this newfound Actinostola sp., which will help to elucidate adaptive clues to deep-sea hydrothermal environments.

Fig. 1
figure 1

Sampling details and comparative analyses of the deep-sea anemone. (a) Image of the sequenced Actinostola sp. (b) Genome survey. (c) Gene family analysis and divergence time of seven representative Cnidaria species.

A total of 44.23-Gb paired-end reads produced by an Illumina sequencing platform were used for a genome survey (Fig. 1b). The sequencing depth with the highest frequency was identified at 54, and the total number of 17-mer reads was 19,503,242,454. Therefore, the estimated genome size of Actinostola sp. was about 487 Mb. Meanwhile, the heterozygosity rate of this genome was predicted to be 0.9% (see more details in Fig. 1b).

A 424.3-Mb draft genome was subsequently assembled based on 112.37-Gb long reads generated from a PacBio sequencing platform and 26.10-Gb short reads generated from an Illumina Hiseq Xten platform, with a contig N50 of 373 kb, a scaffold N50 of 383 kb and GC content of 38.7% (Table 1). The routine BUSCO (Benchmarking Universal Single-Copy Orthologs) method was applied to evaluate the completeness of our assembled genome, using the eukaryota_odb9 database as the reference. Finally, 252 (83.2%) BUSCO core genes were completely identified.

Table 1 Summary of the genome assembly for the sequenced Actinostola sp.

For further repeat annotation, a total of 265-Mb data covering 62.4% of the total assembled genome were predicted to be repeat sequences. Among them, 25.5% of the genome (108.2 Mb) was DNA repeat elements, 8.4% (35.6 Mb) was long interspersed nuclear elements (LINE), 14.3% (60.6 Mb) was long terminal repeats (LTR), and 0.8% (3.6 Mb) was short interspersed nuclear elements (SINE). After masking those repetitive regions, we applied an integrated method of homologous sequence search and de novo gene prediction to obtain annotations of 20,812 protein-coding genes in the assembled genome. By searching four public databases including GO (Gene ontology)8, KEGG (Kyoto Encyclopedia of Genes and Genomes)9, SwissProt10 and TrEMBL11, we found that 97.89% (19,111 in total) of these predicted genes were functionally annotated.

The coding sequences (CDS), predicted from assembled genomes of Actinostola sp. (this study) and other seven representative species (Fig. 1c), were utilized for clustering of gene families. Eventually, the 20,812 protein-coding genes of Actinostola sp. were clustered into 10,327 gene families, among them 3,526 were single-copy orthologous. A phylogenetic tree (Fig. 1c) was constructed based on these single-copy orthologous gene families with the maximum likelihood method, predicting that the divergence of our newfound Actinostola sp. from another sea anemone Exaiotasia diaphana occurred 305 million years ago (Mya). This high-quality reference genome for Actinostola sp. can also provide novel insights for enhancing wild resource conservation, discovering new functional genes, developing novel marine drugs, and elucidating special adaptive mechanisms.

Methods

Sample collection, library construction, and genome sequencing

A specimen of the Actinostola sp. was collected from an Edmond vent along the central Indian Ocean ridge for whole genome sequencing. Genomic DNA (gDNA) was extracted using QIAwave DNA Blood & Tissue Kit (Qiagen, Germantown, MD, USA). The genome was sequenced using a combination of sequencing techniques, including paired-end sequencing with a 500-bp inserted library on an Illumina Hiseq Xten platform (Illumina Inc., San Diego, CA, USA), and a PacBio library with an insert-size of 20 kb on a PacBio sequencing platform (Pacific Biosciences, Menlo Park, CA, USA).

Genome size estimation

The Illumina short reads were filtered with SOAPfilter v2.212. Clean reads were then used for estimation of the Actinostola sp. genome size with a 17-mer frequency distribution analysis according to the following formula13: Genome Size = Kmer_num/peak_depth, where k-mer_num is the total number of reads and peak_depth denotes the estimated peak frequency of 17-mers.

Genome assembly

Before assembly, the PacBio long sequencing reads were calibrated using LoRDEC14, along with the clean Illumina short reads. After correction, DBG2OLC15 was applied to assemble these long reads to contigs with assistance of the clean short reads. To further improve the genome accuracy, two rounds of polishing was performed with different strategies. First, Racon v1.3.116 was employed for contigs polishing based on the uncorrected PacBio long reads. Second, the clean short reads were used to polish the contigs with pilon17. After heterozygosity reducing with Redundans18, we obtained a polished genome assembly for the sequenced Actinostola sp. BUSCO19 v5.22 provided quantitative measurements for the completeness of this assembly with the popular eukaryota_odb9 database as the reference.

Genome annotation

We predicted repeat elements by de novo and homology annotations. RepeatModeler20 and LTR-FINDER21 were employed for the de novo prediction to build a repeat library. Then, the two libraries were combined and aligned to the assembled genome with RepeatMasker22. For the homology prediction, a known repeat library (Repbase23) was employed to identify repeats with RepeatMasker and RepeatProteinMask22. Tandem repeats were detected using Tandem Repeat Finder24. Finally, by integrating these data from both methods, a nonredundant set of repeat elements were obtained.

To predict protein-coding genes, protein sequences form nine representative species including California sea hare (Aplysia californica), nematode (Caenorhabditis elegans), sacoglossan sea slug (Elysia chlorotica), limpet (Lottia gigantea), two-spot octopus (Octopus bimaculoides), invasive apple snail (Pomacea canaliculata), glass anemone (Exaiptasia pallida), starlet sea anemone (Nematostella vectensis), and human (Homo sapiens), were downloaded from Ensembl25, and then they were mapped to our assembled genome with TBLASTn26. Subsequently, gene structures were predicted by GeneWise27. Finally, we integrated all these predicted results using MAKER28 to obtain a consistent gene set.

For functional annotation, BLASTp29 was applied to align the predicted protein sequences against four public databases (including SwissProt10, TrEMBL10, KEGG30 and InterPro8), and then these results were retrieved to obtain GO31 terms.

Data Records

Our final assembly and annotation data have been deposited at the NCBI with accession number JAUJYZ00000000032. Protein and gene coding sequences are uploaded into FigShare depository for public accession33. The raw reads of PacBio and Illumina sequencing were also uploaded at the NCBI with accession numbers SRR25988563- SRR2598856734.

Technical Validation

The genome assembly was 424.3 Mb with a scaffold N50 of 383 kb. For quantitative assessment of this genome assembly, we showed that 83.2% of the reference BUSCO genes (insecta_db9) were successfully identified in the final genome assembly version, suggesting remarkable completeness of this Actinostola sp. genome assembly.