Background & Summary

Batillariidae, also called batillariids or mud-creepers are widely distributed in the north-western Pacific region of Asia along the complex coastline formed in Japan, Korea, eastern China, and America1,2,3,4. Within the Batillaridae family, B. attramentaria (Sowerby, 1855) is characterized by its habitats being limited to narrow intertidal zones consisting of rocks or sandy mud along coastlines and limited dispersal capacity associated with direct larval development5,6. Due to such biological constraints, geographical movement distance is limited, and its population structure is also inferred to be influenced by geographical factors6,7,8. These characteristics hinder them from escaping from their originated habitats. However, in the early 20th century, B. attramentaria was introduced into the Bay of the Northeast Coast in the United States and Canada by commercial shipment of oyster (Crassostrea gigas) aquaculture from Japan9,10. In the new habitat, this invaded species not only flourished but also successfully competed with the native gastropod species such as Cerithidea californica11,12,13,14.

The mitochondrial lineage of B. attramentaria is primarily subdivided into two, and their geographical distribution matches the trajectories of two dominant regional seawater currents, Tsushima and Kuroshio, that flow separately north and south of the Japanese archipelago2. An analysis of the demographic history of B. attramentaria indicates that this species has sharply increased approximately since the last glacial maximum (LGM: 26,000–19,000 years ago), directly influenced by the sea level rise and range expansion of habitat in Asia following climate change1.

Benthic organisms living in the estuarine intertidal zone are subjected to the most dynamic environmental circumstances, with frequently altered salinity and temperature in their habitat due to tidal conditions. Thus, estuarine intertidal organisms are continuously exposed daily to the submerged saline seawater and cold temperature during high tide and to the dry, low salinity and high temperature during the low tide. Subsequently, continuous exposure to such highly variable environmental conditions has shaped intertidal communities’ behavioural and physiological adaptation and genetic variation15,16. Salt stress exposure study on B. attramentaria shows that variation in salinity affects their locomotion activity17, which seems to be a typical response observed in several intertidal gastropods18,19,20. Among several studies of molluscs, a survey on intertidal oyster Crassostrea gigas highlights the pathways and genes involved in responding to and adapting to typical tidal environmental conditions17. In comparison, a study on terrestrial giant African snails shows the expansion of mucus-related gene families to mitigate dry conditions on the land and the doubling of several genes, including haemocyanin (a copper-containing respiratory protein) that helps in transporting oxygen and phosphoenolpyruvate carboxykinase gene families during whole genome duplication21. Adaptation to such typical intertidal and terrestrial environmental conditions was achieved by regulating water balance, air-breathing, nitrogen excretion, neural–immune system interactions, and specific behaviours.

In this context, the genome sequence of B. attramentaria will be beneficial for a deepened understanding of its evolution and invasiveness. It could be a suitable model for studying the combined influence of climate change and palaeoceanographic change on marine gastropods and other coastal taxa in the Northeast Asian region. As well as this study will enrich our knowledge of the genetic features involved in the adaptation to typical intertidal environmental factors.

Here, we present a first draft of reference genome assembly for B. attramentaria constructed using long reads generated by the Pacific Biosciences (PacBio) DNA sequencing platform Sequel and short paired-end reads generated by Illumina. The genome was assembled into 863 scaffolds (N50 = 1.28 Mb), with a total size of 0.715 Gb, with 97.2% assembly completeness analysed by BUSCO. The genome completeness is on par with the mollusc genomes sequenced to date. Structural annotation of the genome yielded 40,596 genes. Of the total genes predicted, 15,755 genes were functionally annotated with InterProScan. Based on phylogenetic analysis of related gastropod species, B. attramentaria diverged from Conus consors during the Early Mesozoic era, i.e., about 230 MYA. We have detected genes responsible for adapting to intertidal environments22 (Supplementary Table 1) such as the Na+/H+ exchanger family, Na+/K+ ATPase (for ionic regulation), acyltransferase, proline dehydrogenase (for osmotic regulation), haemocyanin beta-sandwich, animal haem peroxidase, protein-tyrosine phosphatase (for improving terrestrial respiratory function), and galactosyltransferase, Ependymin, TNF(Tumour Necrosis Factor) family, C1q domain (for immune defense), as observed in terrestrial and marine gastropods in previous studies15,16,21,22.

Methods

Sample collection and purification of DNA

To construct a draft of the reference genome for the Korean batillariids, we collected samples from B. attramentaria (Sowerby, 1855) from Hajeon-ri, Cheollabuk-do, South Korea (on November 2018 at 35°32′N, 126°33′E). The samples were kept alive in seawater during the transportation to the laboratory. To obtain high quality and molecular weight of DNA, we dissected fresh tissues from the foot to muscle part of the alive samples and quickly froze them at −80 °C. We did not include the gut part to avoid the snail’s intestinal microbiome contaminant to the snail DNA. Genomic DNA was extracted using the Dneasy® Blood & Tissue kit (Qiagen, Hilden, Germany), and the integrity was checked using an agarose gel.

Short-read DNA sequencing and genome size estimation

We constructed a library with an insert size of 350 bp using a Truseq Nano DNA Library kit (Illumina, SD, USA) following random fragmentation and adaptor ligation to DNA sequences. Paired-end (PE) sequencing with 101 bp was carried out using the Hiseq. 4000 sequencing system (Illumina, CA, USA), which generated a total of 731,221,132 PE reads (73.9 Gbp) (Supplementary Table 2). The JELLYFISH tool23 was used to estimate the genome size of B. attramentaria, which resulted in approximately 0.64 Gbp based on k-mer distribution value (K = 61). The main peak at k-mer depth 34 was used for genome size estimation (Fig. 1).

Fig. 1
figure 1

B. attramentaria genome size estimation by k-mer distribution.

PacBio sequencing

The genomic DNA was sheared to generate ~20Kb fragments using the Covaris g-TUBE (Covaris) according to the manufacturer’s instructions. Small fragments were removed by the AMpureXP bead purification system (Beckman Coulter). A total of 5 μg DNA for each sample was used to prepare the library using SMRTbell® Express Template Prep Kit v2.0 (Pacific Biosciences, Menlo Park, CA, USA). Small fragments were removed from the library by BluePippin Size selection system for the large-insert library. Then the sequencing primer v4 was annealed to the SMRTbell template, followed by the binding of DNA polymerase to the complex (Sequel Binding kit 3.0). The excess primer and polymerase were removed from the complex using AMPure purification system before sequencing. Finally, the SMRT library was sequenced using the PacBio Sequel System with the Sequel Sequencing Kit v3.0 chemistry. A total of ~53.3 Gbp of subreads were obtained (Supplementary Table 3).

Genome assembly and polishing

Initially, cleaned PacBio long-read sequences were assembled using FALCON-Unzip assembler24, which generated a contiguous assembly of 844 Mbp (N50 = 1.08 Mbp). The larger size of the assembly than the estimated genome size suggested a high number of duplicate haplotypes25. The highly heterozygous genome assembly was curated by Purge Haplotigs26 to generate a de-duplicated haploid genome assembly. Further, the assembled genome was polished by Pilon 1.2.3 (with default parameters)27 by using aligned Illumina PE reads (57.5 Gb), resulting in a final assembly of 863 contigs with a total length of 715 Mb and an N50 length of 1.28 Mb (Table 1).

Table 1 Statistics of genome assembly of B. attramentaria.

The assembled genome is much smaller than the closest sequenced genome of Conus consors (3.025 Gb)28. Due to high heterozygosity levels and repetitiveness, the assembly processes of molluscs are found to be complicated. Such instances were observed in oysters and other invertebrates29. The repeat content was estimated to be (314 Mb) 43.87% of the genome assembled (Table 2). Most invertebrate genomes, including molluscs, exhibit high heterozygosity and repetitiveness, complicating the assembly process. Genome completeness estimated by using BUSCO (Benchmarking Universal Single-Copy Orthologs) v3.0.2 detected a total of 927 (97.2%) of the 954 genes in the metazoan gene set30 (Table 3). Genome completeness is par with other mollusc genome assembly available till date (Fig. 2b)

Table 2 Statistics of repeat elements of B. attramentaria.
Table 3 BUSCO assessment of B. attramentaria genome assembly (Metazoa).
Fig. 2
figure 2

(a) Divergence time tree among gastropods. Divergent times were estimated using RelTime methods with an ML phylogenomic tree, and the unit of time was scaled in millions of years. Bars around each node represent 95% confidence intervals. (b) Genome assembly completeness comparison estimated by BUSCO (metazoan_odb10). The genome assembly accession number is mentioned next to the plot. Assembly of A. purpuratus was analyzed from http://gigadb.org/dataset/100419. Star marks represent the chromosome-level assemblies.

Gene prediction and annotation

Before predicting genes, transposable elements (TEs) in the genome were identified using homology-based (RepeatMasker31, RepeatScout32, RepBase33, and RMBlast34) and by de novo using RepeatModeler35. Tandem Repeats Finder36 was used to predict consensus sequences and to gain classification information for each repeat. Annotation of repetitive elements resulted in 313,966,700 bp of repetitive DNA, amounting to 43.87% of the genome assembly (Table 3). The majority of the repetitive elements were unclassified (20.41%), followed by Simple repeats (7.38%), SINEs (5.55%), LINEs (4.83%), and DNA elements (3.81%). By using the SSRMMD tool37, we identified 1,518,868 simple sequence repeats (SSRs) distributed throughout the genome (Supplementary Table 4). A total of 3,304,085 SNPs has been detected in the B. attramentaria genome (Supplementary Table 5), after aligning sequence reads with the BWA tool38 and using bcftools39 to identify variants. Repetitive elements in the genome were masked before proceeding with the gene prediction. We used EvidenceModeller gene predicting tool for predicting protein-coding genes from the draft genome by combining evidence from ab initio gene predictions, transcripts, and protein homologues. We used Augustus40 for ab initio gene prediction. Additional supports for gene prediction came from two different data sets of transcripts generated by Trinity41 from our previous study by Ho et al.22 and homologous protein sequences of related species to B. attramentaria by PASA42 and Exonerate43. Finally, we used EVidenceModeller42 to merge and improve the ab initio predictions with the evidence of transcripts and protein sequences with weights of evidences. The predicted genes were annotated using InterProScan with Pfam44. A sensitive HMM scanning method on the known Pfam functional domains with an e-value of 0.05 was also used to classify the gene families. Kyoto Encyclopedia of Genes and Genomes (KEGG) annotation was performed using the KEGG Automatic Annotation Server (https://www.genome.jp/kegg/kaas/)45 with the bi-directional best hit (BBH) method. Homology-based and ab initio-based gene prediction resulted in the identification of 40,596 protein-coding genes (i.e., a total of 29.8% of the genome) with an average gene length of 5,248 bp from the B. attramentaria genomes (Table 4). Functional annotation of all predicted protein-coding genes by InterpRoscan resulted in 15,756 (38.8%) genes by Pfam and 17,922 (44.1%) genes by Gene Ontology46. A total of 11,074 (27.3%) genes were annotated by KEGG database46.

Table 4 Statistics of predicted protein-coding genes of B. attramentaria.

Phylogenomics

We performed an extensive comparison of orthologous genes among 19 gastropod genomes (Batillaria attramentaria, Conus consors28, Lanistes nyassanus47, Marisa cornuarietis47, Pomacea canaliculata47, Aplysia californica, Elysia chlorotica48, Plakobranchus ocellatus49, Biomphalaria glabrata50, Bulinus truncates51, Achatina immaculata21, Lottia gigantea52, Chrysomallon squamiferum53, Haliotis rubra54, Crassostrea gigas55, Agropecten purpuratus56, and Octopus bimaculoides57) using OrthoFinder v3.058. With all species present, 3,532 orthogroups were formed, with 36 of those consisting of one-copy genes. With the fasttree tool provided in OrthoFinder, we constructed a tree of rooted species using 573 orthogroups, where at least 81.8% of species had single-copy genes in any orthogroup with Octopus bimaculoides as the outgroup. Divergence time was calculated using the species tree generated by using RelTime methods in MEGA-X59 with the Jones–Taylor–Thornton model (Fig. 2a). The timetree was computed using two calibration constraints with confidence interval (CI) of Haliotis rubraChrysomallon squamiferum (414–596.9 MYA) and of Elysia chloroticaAplysia californica (58.3–278.9 MYA) that were taken from the TimeTree database60 for the calibration of time trees. The divergence time between B. attramentaria and C. consors was approximately 230 MYA, i.e., during the Early Mesozoic era.

Comparative genomic analysis

A comparison of orthologous gene groups shared among related gastropods of C. consors, L. nyassanus, M. cornuarietis, and P. canaliculata analysis by OrthoVenn261 showed a core set of 5,679 gene groups and a unique set of 1,724 gene groups was specific to B. attramentaria (Fig. 3). Gene ontology (GO) enrichment analysis of the gene groups unique to B. attramentaria showed the top five over-representation of GO terms mostly related to protein poly-ADP-ribosylation, GTP binding, and innate immune response (Supplementary Table 6).

Fig. 3
figure 3

Venn diagram showing the amount of common orthologous gene clusters shared among B. attramentaria and its closet relative mollusks including C. consors, L. nyassanus, M. cornuarietis, and P. canaliculata.

Data Records

All DNA and RNA raw reads have been deposited in the NCBI SRA. All short and long read DNA sequences are available under the NCBI SRA accession number SRP26999662, genome assembly with accession number GCA_018292915.163 and the whole genome shotgun sequencing project was deposited in GenBank accession JACVVK00000000064 under the BioProject no. PRJNA640962. Supplementary materials which include all supplementary tables, results of comparative genomics and phylogenomic analysed by OrthoFinder, SNPs and SSRs are deposited to Figshare repository46: https://doi.org/10.6084/m9.figshare.22309195.v4.

Technical Validation

Quality assessment of the DNA and purification

High-quality DNA with bands around and above 10 kb in the agarose gel was selected for sequencing. The quality of the genomic DNA was measured using Bioanalyzer 2100 (Agilent Technologies, CA, USA), and the quantity was measured by a NanoDrop-1000 microspectrophotometer.

Sequencing read quality validation

FastQC quality control (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/) was used to assess the quality of raw high-throughput DNA sequencing datasets. Low-quality sequence PE-reads (<Q20) were filtered out by v.0.3265 before assembly46.

Gene prediction and annotation validation

Final gene model prediction of the B. attramentaria genome assembly were considered by Evidence Modeler and assessed with the BUSCO (metazoa_odb10). The final predicted gene set consisted of 40,596 genes with (Table 4) with BUSCO value of 81.6%.