Background & Summary

The Meghimatium bilineatum (syn. Philomycus bilineatus Benson, 1842) is a member of the Philomycidae family and is a notorious quarantine pest land slug that can cause enormous damage to commercial crops, horticultural crops, grasslands, and forests in East Asia1,2,3,4,5. It has a strong ecological adaptation to terrestrial environments and has been widely distributed in various regions of China6. It does not only feed on stems, leaves, fruits, or juices of plants causing direct economic losses but also secretes mucus and excretes feces contaminating fruits and vegetables. This contamination results in a reduction in the market value of products and transmits diseases. Thus, it poses great harm to local agricultural productivity and ecological security, resulting in substantial economic and ecosystem losses7. However, from another perspective, M. bilineatum also exhibits medicinal properties. For example, its crude extracts are used in the treatment of bacterial-induced infectious diseases, the polysaccharides in slug cell are used as natural antioxidants to prevent cancer, and the antimicrobial peptide derived from the slug is utilized to combat skin infections caused by Candida albicans8,9,10. At present, some researchers have carried out in-depth studies on the pharmacological effects of slug extract, indicating that slugs can be used as a valuable medicinal resource with development and application value9,10. Thus, the study of slug species is very meaningful.

In addition to its ecological threat and medicinal value, M. bilineatum, as a member of 30000 described terrestrial gastropod mollusks with shell-less, has completed the transition from aquatic to terrestrial. Similar to other slug species, they have developed many various robust features, including a pulmonate for breathing air, a sophisticated neural-immune system, and the ability to produce mucus to adapt to the terrestrial environments11,12,13. However, compared with land snails, land slugs display unique life strategy for terrestrial environments, such as defense by secreting mucus including specific chemical compounds and better mobility under predation, because they have no protective shell1,14. Furthermore, shell-less land slugs do not expend energy ingesting large amounts of calcium, enabling them to grow faster. Although land slugs have strong adaptation mechanism, their evolutionary history remains unclear. In recent years, molecular phylogenetics analysis of land slugs of the genus Meghimatium based on the mitogenome and nuclear loci has offered new perspectives into the taxonomic revisions and evolution of these species15,16,17. However, these studies cannot fully explain the molecular mechanism of wide ecological adaptation information and the potential genetic basis of medicinal resource traits of this slug. Furthermore, the Philomycidae slug genomics have yet to be published. Therefore, assembling a genome of this slug species should be urgently assembled.

The study of genomes in certain terrestrial mollusks, has shown advancements, including the release of genomic data for two land snails, Achatina fulica and Pomacea canaliculata. However, thorough investigations into the evolutionary mechanisms associated with terrestrial adaptation remain scant18,19. Recently, one genome study of Achatina immaculata, namely giant African snail has verified that some genes related to respiratory system, dormancy system, and immune system have undergone great expansion to adapt to the terrestrial environments20. However, to date, high-quality genomic resources for land slugs are rarely reported. The land slugs and snails, as terrestrial gastropod mollusks with or without shell protection, have different biological processes related to their terrestrial lifestyle. Hence, assembling a genome of the land slug species would facilitate intensive study of this species’ adaptive evolution.

Herein, we assembled the genome of M. bilineatum by uniting the sequencing techniques of Illumina, PacBio, and Hi-C. Three methods, including ab initio gene prediction, homolog and RNA-Seq-based prediction, were used to perform genomic annotation. In addition, the comparative genomics analysis of M. bilineatum and 11 other distantly related species were performed. This study offers insights for the effective management and utilization of slug populations and provides valuable genome information into the evolutionary history and genetic mechanisms of this important gastropod group.

Methods

Land slug collecting and sequencing

Adult land slugs M. bilineatum were collected from a wild area in Zhoushan, Zhejiang, China (122.212 E, 29.979 N). Total DNA was extracted from whole body of the land slug M. bilineatum using the SDS-based extraction method. Then, the DNA samples were purified using QIAGEN® Genomic kit (QIAGEN, Germany) for genome sequencing. First, Illumina short-read library with insert sizes of 300–350 bp was generated, and was sequenced using the Illumina Novaseq. 6000 platform. Second, PacBio HiFi-read library with insert sizes of 10–40 kb was generated using SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, USA) and sequenced using the PacBio Sequel II platform. Finally, Hi-C short-read library was generated using the purified DNA from the whole body of M. bilineatum according to the previously performed protocol by Belton et al. with given adjustments; it was sequenced using the Illumina Novaseq. 6000 platform21. A total of 250.12 Gb of clean Illumina short-reads, 71.33 Gb HiFi CCS reads and 140.69 Gb clean Hi-C reads were obtained (Table 1).

Table 1 Statistics of sequencing read data.

Total RNA was isolated from whole body of the land slug using TRIzol reagent (Invitrogen, MA, USA) for transcriptome sequencing. The RNA-seq library was generated using NEBNext® Ultra RNA Library Prep Kit (NEB, USA) and sequenced using the Illumina Novaseq. 6000 platform. The RNA-seq reads were used for genome annotation. A total of 21.79 Gb of clean data was obtained (Table 1).

Genome size estimation

Based on 250.12 Gb clean Illumina short-reads, the genome size, heterozygosity and repetitive sequence content was determined using the k-mer analysis with GCE (1.0.0) following the default parameter22. A total of 223,346,880,670 k-mers with a depth of 144 was obtained (Fig. 1). In addition, the genome size of M. bilineatum was approximately 1.5 Gb, with a heterozygosity of 1.05% and proportion of repeat sequences at 43.69%.

Fig. 1
figure 1

K-mer (17-mer) distribution and estimation of genome size of M. bilineatum.

Fig. 2
figure 2

Chromosomal Hi-C heatmap of the M. bilineatum genome assembly.

Chromosomal-level genome assembly

In the initial genome assembly, HiFiasm (v0.16.0) method was used for ab initio to assemble the genome using the HiFi reads from PacBio23. This preliminary assembly yielded a genome size of 1.80 Gb (Table 2). Subsequently, the redundant sequences were filtered out using Purge_Haplotigs (v1.0.4) software with the parameter of cutoff ‘-a 70 -j 80 -d 200’24. Based on PacBio sequencing data, a 1.63 Gb contig-level genome assembly of M. bilineatum was obtained, and 2526 contigs displayed contig N50 and N90 sizes of 1.37 and 320.449 Mb, respectively (Table 2). The chromosome-level assembly of M. bilineatum was conducted using Hi-C technology. Initially, Bowtie2 (v2.3.4.3) following the default parameters was used to match the 140.69 Gb clean Hi-C reads to the contig-level genome to obtain unique mapped paired-end reads25. A total of 185.36 million paired-end reads were uniquely mapped (Table S1), of which 88.02% represented valid pairs (Table S2). Subsequently, contigs were assembled into the chromosome-level scaffolds using the 3D-DNA processes (v180922) (parameters: -r 0) with all valid pairs, and the JuiceBox (v1.11.08) was used to correct the errors in the genome assembly26,27. We anchored and obtained 25 pseudo-chromosomes with seven unanchored scaffolds. The 25 pseudo-chromosomes covering ~99.95% of the final genome with size ranging from 25.66 Mb to 135.71 Mb (Fig. 2; Table 3). Ultimately, we obtained a 1.61 Gb chromosomal-level genome assembly of M. bilineatum with contig N50 size and scaffold N50 size of 1.36 Mb and 68.08 Mb, respectively. Genome assembly results showed that the genome size of M. bilineatum is similar to that of the Spanish slug Arion vulgaris (1.54 Gb) in the previous study28.

Table 2 Number and length statistics for the M. bilineatum genome assembly.
Table 3 Chromosome sizes and assignment for Hi-C scaffolds.

Repeat-content identification and classification

Repetitive sequences, including tandem repeats and interspersed repeats, in M. bilineatum genome were determined using the de novo prediction and homolog-based methods. Based on homology comparison, RepeatMasker (open-4.0.9) (parameters: default) and RepeatProteinMask (parameters: default) software were utilized to find the interspersed repeats against the RepBase database (http://www.girinst.org/repbase)29. On the basis of de novo prediction, TRF (v4.09) software (parameters: default) was used to identify the tandem repeats30. In addition, a repetitive sequence library was constructed using the RepeatModeler (open-1.0.11) with default parameters and LTR-FINDER_parallel (v1.0.7) with default parameters31,32. Then, the RepeatMasker (open-4.0.9) with default parameters was used to identify the repeat element against this repeat library31. After combining the results from de novo prediction and homolog-based methods, we identified and classified 1.18 Gb of repetitive sequences, taking up 72.51% of the assembled genome, mainly including 7.99% DNA elements, 34.08% long interspersed elements (LINE), and 16.35% unknown sequences (Tables 4 & 5). The repeat-content in the M. bilineatum genome is similar to the Spanish slug A. vulgaris (75.09%), and is higher than other studied gastropod species28,33. These results further validate the accuracy of our genome assembly.

Table 4 Repetitive sequences statistics for the M. bilineatum genome.
Table 5 Transposable elements statistics for the M. bilineatum genome.

Identification and annotation of protein-coding genes

First, we used repeat-masked genome sequences to perform ab initio gene prediction, and then used AUGUSTUS (v3.3.2), Genscan (v1.0) and GlimmerHMM (v3.0.4) software to detect the protein-coding genes34,35,36. Second, to conduct homology-based prediction, protein sequences from Candidula unifasciata (GCA_905116865.2), Elysia chlorotica (GCA_003991915.1), Haliotis rubra (GCA_003918875.1), Haliotis rufescens (GCA_023055435.1), Lottia gigantea (GCA_000327385.1), Pakobranchus ocellatus (GCA_019648995.1), and Pomacea canaliculate (GCA_003073045.1) were compared with the M. bilineatum genome utilizing TBLASTN (v2.2.29) (e-value ≤ 1e-5) to determine candidate regions, and further used GenWise (v2.4.1) software to accurately map the screened proteins to the M. bilineatum genome to obtain splice sites37. Third, to perform transcriptome sequencing-based prediction, the RNA-seq reads from Illumina were mapped to the M. bilineatum genome by using the TopHat (v2.1.1) software following default arguments, and the transcripts were assembled using Cufflinks (v2.2.1) software with the “-e 100 -C” parameter38,39, and the protein-coding genes were determined using the PASA (v2.3.2)40. Fourth, using the MAKER2 (v2.31.10) and HiFAP software following default parameters, we combined the three predictions to construct a complete and nonredundant reference gene database41. Finally, in the M. bilineatum genome, 18631 identified protein-coding genes were found. The length of the average gene, including CDS, exon, and intron, is presented in Table 6. These predicted gene structures were also compared with the seven other homologous species (Fig. 3).

Table 6 Statistics on transposable elements in the M. bilineatum genome.
Fig. 3
figure 3

Comparison of protein-coding genes annotation quality. Eight species (M. bilineatum, Haliotis rufescens, Pakobranchus ocellatus, Lottia gigantea, Candidulaunifasciata, Elysia chlorotica, Haliotis rubra, and Pomacea canaliculate) were examined to compare the lengths of the gene, CDS, exon, and intron.

We annotated these protein-coding genes functions through the alignment of gene sequences to the InterPro, GO, KEGG, SwissProt, TrEMBL, TF, Pfam, NR, and KOG database by using BLAST + (2.11.0) software (e-value ≤ 1e-5)42,43,44,45,46,47. In addition, based on InterPro database and Pfam database, the conserved protein domain and motif associated with the function annotated was determined using the InterProScan tool (v5.61-93.0) with the “-seqtype p -formats TSV -goterms -pathways -dp” parameter48. Ultimately, a total of 15569 genes (83.57%) were successfully annotated (Table 7).

Table 7 Putative protein-coding gene functional annotations of the M. bilineatum genome.

Identification of non-coding genes

The tRNA, rRNA, miRNA, and snRNA non-coding RNAs are not translated into proteins. In the annotation process of non-coding RNAs, tRNAscan-SE (v1.3.1) software following the default parameters was used to find the tRNA sequences in the assembled genome according to the structural characteristics of tRNA49. BLASTN was applied to identify rRNA genes in the assembled genome according to the highly conserved characteristics of rRNA. In addition, according to the covariance model of Rfam database (v14.8), we used the INFERNAL program with default arguments to predict the miRNA and snRNA sequences50. Finally, 1424 rRNAs, 941 tRNAs, 588 snRNAs, and 49 miRNAs were annotated (Table 8).

Table 8 Statistics of the noncoding RNA in the M. bilineatum genome.

Comparative genomic analysis

The single-copy ortholog genes of M. bilineatum and 11 other molluscan species (Table S3), including Nautilus pompilius, Octopus minor, Bathymodiolus platifrons, Chrysomallon squamiferum, Elysia chlorotica, Biomphalaria glabrata, Candidula unifasciata, Pomacea canaliculate, Haliotis rubra, Gigantopelta aegis and Lottia gigantea, were determined using the “-l 1.5” parameter of hcluster_sq software from OrthoMCL (v2.0.9) to validate the phylogenetic relationships among the 12 molluscan species51. A total of 29157 gene families were determined, including 671 common orthologous gene families and 135 single-copy gene families, in the 12 molluscan species (Fig. 4; Table S4). The MAFFT (v7.487) software with default parameters was used to compare the single-copy genes52. All conserved sequences in the single-copy genes were extracted using Gblock (v0.91b) software with the “-t = c” parameter53. Subsequently, the ML phylogenetic tree was constructed using the “-f a -N 100 -m GTRGAMMA” parameter of RAxML (v8.2.12)54, with N. pompilius and O. minor as the outgroup. Moreover, the divergence time of the 12 mollusks were estimated using the MCMCtree (v4.4) program in software PAML (v4.9) with “clock = 3; model = 0” parameter according to the calibration times of N. pompilius-B. platifrons (619.1–527.6 MYA), B. platifrons-P. canaliculata (541.7–463.4 MYA), N. pompilius-O. minor (452.6–364.2 MYA), B. glabrata-P. canaliculata (496.0–310.0 MYA) and G. aegis-C. squamiferum (100.0–42.4 MYA) from the Timetree database55. The evolutionary tree showed that M. bilineatum and C. unifasciata were clustered together, and diverged ~231.4 MYA (Fig. 5). We also identified the expanded genes and contracted gene families in the 12 mollusks using CAFE (v5.0.0) with the “-p 0.05 -t 4 -r 10000” parameter56. The result showed that there were 879 expanded gene families and 1385 contracted gene families in the M. bilineatum (Fig. 5).

Fig. 4
figure 4

Distribution of genes in different species.

Fig. 5
figure 5

Phylogenetic analysis of M. bilineatum and 11 other mollusks. The green and red numbers on each branch represent the number of significantly expanded and contracted gene families, respectively. The blue numbers on each branch represent the divergence time (MYA) of these 12 mollusks.

Data Records

All sequencing data from three sequencing platforms have been uploaded to the NCBI SRA database (transcriptomic sequencing data: SRR2586702857, genomic Illumina sequencing data: SRR2590398958, genomic PacBio sequencing data: SRR2591904459 and SRR2591904360, Hi-C sequencing data: SRR2591915561 and SRR2591915462). The final chromosome-level assembled genome file has been uploaded to the GenBank database under the accession JAXGFX00000000063. Genome annotation files (including repeat-content annotation, gene structure annotation, gene functional annotation and non-coding genes annotation) have been uploaded to the Figshare database64.

Technical Validation

Evaluating quality of the DNA and RNA

Prior to the genome sequencing, we used the NanoDrop 2000 Spectrophotometer (Thermo Fisher Scientific, San Jose, CA, USA) and Qubit 3.0 Fluorometer (Thermo Fisher Scientific, San Jose, CA, USA) to determine the quality (OD260/280 and OD260/230) and concentration of the DNA and RNA samples to ensure the accuracy of sequencing data. We also used the agarose gel electrophoresis and Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, California, USA) to determine the integrity of the DNA and RNA samples.

Evaluating quality of the genome assembly

To evaluate the sequence consistency and assembly quality, the BWA (v0.7.17-r1188) and Minimap2 (v2.24_x64-linux) software were used to map the short reads from Illumina and HiFi reads from PacBio to the assembled genome, respectively65,66. After these processes, 99.35% of the short reads from Illumina and 99.62% of the HiFi reads from PacBio were aligned, covering 99.81% and 99.99% of the assembled genome, respectively (Table S5 & S6). Moreover, BUSCO (v5.4.3) analysis was conducted to evaluate the assembly quality based on the mollusca_odb10 database67. A total of 91.70% of the 5295 single-copy orthologs in the assembled genome were determined as complete, including 4015 single-copy (75.80%) and 842 duplicated (15.90%), 0.89% and 7.46% of the total single-copy orthologs were fragmented and missing, respectively (Table 9).

Table 9 Results of BUSCO analysis of the M. bilineatum genome.

Evaluating quality of the genome annotation

BUSCO (v5.4.3) analysis was conducted to evaluate the genome annotation quality based on the mollusca_odb10 database67. A total of 91.60% of the 5295 single-copy ortholog genes in the assembled genome were determined as complete, including 3912 single-copy genes (73.90%) and 939 duplicated genes (17.70%), 1.30% and 7.10% of the total genes were fragmented and missing, respectively (Table 9).