Background & Summary

Giant clams are tropical marine shellfish mainly distributed in the Indian Ocean, Western Pacific, and South China Sea. There are twelve species of giant clams, divided into two genera, with10 species in Tridacna and 2 in Hippopus1. They play a crucial role in coral reef ecosystems, contributing over 60% of the biomass of coral reef ecosystems2. Giant clams support coral reef biodiversity, offer habitats, breeding and feeding grounds to various marine organisms, and have extremely important ecological value3,4. Giant clams are hermaphrodites, initially functioning as males and later developing female gonads and functioning both as male and female5. To avoid the occurrence of self-fertilization, giant clams first release sperm, and then eggs6. Bivalves often form symbiotic associations with bacteria, algae, and other marine fauna7. There is a symbiotic relationship between giant clams and zooxanthellae. Unlike intracellular symbiosis in stony corals, the zooxanthellae in clams are intercellular and live within the mantle8. The symbionts supply nutrients to the host through photosynthesis. While also obtaining some essential nutrients from the host. Notably, symbionts are not transmitted vertically and must be acquired from the environment during the ontogeny of the second larval stage, veliger9. Additionally, some bivalves from deep sea engage in symbiosis with chemosynthetic bacteria, which are the primary producers of deep-sea cold seeps and vents10.

Among Tridacna species, T. crocea is the smallest, with a maximum shell length of no more than 20 cm, growing at a rate of about 4 cm per year, reaching sexual maturity in 1–2 years11. The shell is shallow, with two equal sides and the same shape and size. Despite its slow growth and small size, T. crocea is known for its vibrant colors and beautiful appearance, making it valuable in food markets, the aquarium trade markets and tropical coral reef ecosystems12. Moreover, its photoautotrophic characteristics contribute to oxygen production, benefiting marine organisms13. However, anthropogenic disturbances, such as global warming, habitat destruction and over-harvesting, have led to declining giant clam populations, resulting in giant clams been listed on the IUCN red list (IUCN, 2007).

Despite the ecological importance of giant clams, their genomic features have remained unclear. In fact, previous molecular studies of giant claims have focused on phylogeographical patterns14,15, as well as the expression and functional analysis of specific genes16,17. Limited transcriptome data are available18,19. Recently, a genomic survey and resources for T. crocea were conducted, which involved determining the genome size, predicting unique content, and providing partial annotations, and assemblies20. The lack of genomic information has been a hindrance to the study of the evolutionary and ecological characteristics of giant clam. Recently, the Pacific Biosciences (PacBio) high-fidelity reads (HiFi) have been successfully applied to various complex species and sex chromosomes, such as cultivated apple (high heterozygous)21, cultivated alfalfa (utotetraploid)22, and human X chromosome23. In the present study, the chromosome-level genome of T. crocea was analyzed for the first time using PacBio HiFi reads, Phase genomics Proximo Hi-C technologies, and Illumina short-read sequencing. In order to predict the relationship between T. crocea and other bivalves, gene prediction, functional annotation and phylogenetic analysis were performed. The genome sequence of the giant clam is an important resource for genetic and breeding studies.

Methods

Experimental samples collection and sequencing

T. crocea were sampled from a tropical marine biological research station in Sanya, Hainan province. The giant clams were immediately anaesthetized, and muscle was extracted for DNA isolation using the modified cetyltrimethylammonium bromide (CTAB) method. The quality and quantity of genomic DNA were assessed using a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific) and a Qubit 2.0 fluorometer (Thermo Fisher Scientific). DNA integrity was confirmed using a 0.8% agarose gel.

Three distinct genome libraries were created and sequenced in accordance with the manufacturer’s instructions to produce a chromosome-scale assembly of the giant clam: (i) In accordance with the standard PacBio methodology, PCR-free SMRTbell DNA libraries were created utilizing the BluePippin size selection system. The PacBio Sequel system was used to produce long reads; (ii) Phase Genomic’s Hi-C chromosomal conformation captured reads were prepared with the Proximo Hi-C (Animal) Prep Kit and sequenced; (iii) Purified DNA was sheared using a focused ultrasonicator (Covaris) and then used for 350-bp paired-end library construction with the Next Ultra DNA library prep kit (NEB) for Illumina sequencing, the Illumina NovaSeq. 6000 platform was used to sequence short reads (150 bp in length). RNA was extracted from the giant clam mantle and sequenced on the Illumina NovaSeq platform in order to fully aid gene annotation. To construct a high-quality reference genome for the Tridacna crocea, the whole genome sequencing generated ~167 × Pacbio Sequel long reads (218.24 Gb) (Table 1), ~105 × Hi-C reads (136.70 Gb) and ~45 × Illumina paired-end reads (58.50 Gb) (Table 2).

Table 1 Statistic of Pacbio whole genome sequencing data.
Table 2 Statistic of illumina data.

Genome assembly with Pacbio data and Hi-C data

The Pacbio reads were firstly assembled with Falcon software packages (v2.0.5)24 to build the primary contigs and alternate haplotigs (alternative sequences for regions within the primary contigs where heterozygosity was detectable with the long reads). Tool arrow (v2.2.2) as implemented in SMRTlink6.0 (Pacific Biosciences of California, Inc) was used to polish the contigs. The FALCON-Phase software (v0.2.0-beta) was then used to perform a Hi-C-based contigs phasing, resulting in phased, diploid contigs. The chromosome-scale scaffolds were constructed from the phased contigs using Phase Genomics’ Proximo Hi-C genome scaffolding platform25. Subsequently, Juicebox (v1.8.8)26 was used for a round of polishing to fix minor mistakes in chromosome assignment, ordering, and orientation during chromosomal scaffolding. After a draft set of scaffolds was generated, FALCON-Phase was run again for Hi-C based scaffold phasing. The Illumina sequencing data were further used to improve the assembly by Pilon (v1.22) software27. Finally, the Pacbio reads were initially assembled with Falcon software packages, producing an initial contig assembly, then the assembly was integrated with Phase Genomics Hi-C data to orient and order contigs into chromosome-scale scaffolds. About 78.88% of the 1.30 Gb final Tridacna crocea assembly was assigned to 18 superscaffolds (Fig. 1), with a scaffold N50 and contig N50 of 56.38 Mb and 1.29 Mb, respectively (Table 3). The length distribution of pacbio long reads indicates the peak length is longer than 4 kb (Fig. S1). This result is consistent with the results of other aquatic animals28,29,30,31,32.

Fig. 1
figure 1

Hi-C contact heatmap and Circos plot illustrate the characterization of Tridacna crocea genome, (A) Genome-wide analysis of chromatin interactions in the Tridacna crocea genome. (B) I: Syntenic regions within the Tridacna crocea assembly base on homology searches carried out by conducting with MCscan (version 0.8) requiring at least 10 genes per block; II: GC content in non-overlapping 1 Mb windows; III: Percent coverage of TEs in non-overlapping 1 Mb windows; IV: Gene density calculated on the basis of the number of genes in non-overlapping 1 Mb windows; V: The length of scaffolds in the size of Mb.

Table 3 Features of Tridacna crocea genome.

Repeat annotation

There are a large number of repeat sequences in the Tridacna crocea genome, which can be divided into two categories according to the distribution pattern, namely tandem repeat sequences and interspersed repeat sequences. Tandem repetitive sequences were identified using GMATA33 and Tandem Repeats Finder (TRF, version 4.07b)34 with default parameters. The interspersed repeat contents of the Tridacna crocea genome were identified using two methods, de novo repeat identification and known repeat searching against existing databases. RepeatModeler (v1.0.11) and MITE-hunter35 were used to de novo predict repeat sequences in the genome, the homology-based approach involved applying RepeatMasker (version 1.331) (http://www.repeatmasker.org/) and Repbase database36 to identify TE repeats in the assembled genome. The results showed that 71.60% of the assembly consisted of repetitive sequences (Table 4, Fig. 2). The proportion of repeat elements was higher than that of close relatives of mollusks, such as Patinopecten yessoensis (39%)37, Crassostrea gigas (43%)38 and Sinonovacula constricta (40%)29, given that repetitive sequences are the main drivers of genome amplification, T.crocea presents a larger genome size compared to the three closely related species (Table 5). Among these repetitive sequences, transposable elements (TEs) accounted for 55.83% of the T. crocea genome size, with DNA transposons to be the most predominant type (37.68% of the genome size).

Table 4 Repeat content in the assembled Tridacna crocea genome.
Fig. 2
figure 2

Distribution of divergence rate of each type of TE. The divergence rate was calculated between the identified TE elements in the genome by homology-based method and the consensus sequence in the Repbase.

Table 5 Features of Mollusk assemblies.

Gene prediction and functional annotation

Gene prediction in a repeat-masked genome was performed using reference guided transcriptome assembly, homology search and ab initio prediction. By combining transcriptome alignment, homologous protein prediction and ab initio prediction. In detail, proteins of four mollusks (Crassostrea gigas, Crassostrea virginica, Mizuhopecten yessoensis, Octopus bimaculoides) were downloaded from NCBI DataBase for homolog prediction, GeMoMa39 was used to align the homologous peptides to the assembly and then got the gene structure information. For RNAseq-based gene prediction, filtered mRNA-seq reads were aligned to the reference genome using STAR40. The transcripts were then assembled using StringTie241 and open reading frames (ORFs) were predicted using PASA42. For the de novo prediction, RNA-seq reads were de novo assembled using stringtie and analyzed with PASA to produce a training set. Augustus43 with default parameters were then utilized for ab initio gene prediction with the training set. Finally, EVidenceModeler (EVM)44 was used to produce an integrated gene set of which gene with TE were removed using TransposonPSI package (http://transposonpsi.sourceforge.net/) and the miscoded genes were further filtered. Untranslated regions (UTRs) and alternative splicing regions were determined using PASA based on RNA-seq assemblies. We retained the longest transcripts for each locus, and regions outside of the ORFs were designated UTRs. We predicted 25,440 protein-coding genes with an average gene length of 25,946 bp and an average 8.43 exons per gene. Functional annotation based on public databases (including SwissProt, NR, KEGG, KOG and Gene Ontology) estimated that 23,017 (90.48%) genes could be classified by at least one of the databases (Fig. 3). In addition, we annotated four types of non-coding RNAs in the T. crocea assembly, including micro RNA (miRNA), transfer RNA (tRNA), ribosomal RNA (rRNA), and small nuclear RNA (snRNA). The tRNA genes were predicted by an improved tool for tRNA detection, tRNAscan-SE (version 1.3.1)45 with default paramerters. The rRNA fragments were predicted by aligning to invertebrate template rRNA sequences using BlastN (version 2.2.24) at an E-value of 1e-5. The snRNAs as well as miRNAs were identified using INFERNAL (version 1.1.1)46 to search against the Rfam database (release 12.0). A total of 1,963 non-coding RNA (ncRNA) were determined in the Tridacna crocea genome, including 62 micro RNA (miRNA), 58 small nuclear RNA (snRNA), 83 ribosomal RNA (rRNA), and 1,760 transfer RNA (tRNA) (Table 6).

Fig. 3
figure 3

Intersections of gene families between eleven animals (Tridacna crocea, Aplysia californica, Crassostrea gigas, Crassostrea virginica, Helobdella robusta, Lottia gigantea, Mizuhopecten yessoensis, Octopus bimaculoides, Drosophila melanogaster, Homo sapiens and Nematostella vectensis). The figure was constructed by UpSetR, in which the rows represent the gene families and the columns represent their intersections. Black filled circle represents part of a given intersection; light gray circle represents not a part of the intersection. Bar chart placed on top of the matrix represents the size of the intersections. A second bar chart represents the size of the each set.

Table 6 Non-coding RNAs in the Tridacna crocea assembly.

Comparative genomic and phylogenetic analysis

We clustered the protein-coding genes into gene families for T. crocea Aplysia californica (GCF_000002075.1), Crassostrea gigas (GCF_902806645.1), Crassostrea virginica (GCF_002022765.2), Helobdella robusta (GCF_000326865.1), Lottia gigantean (GCF_000327385.1), Mizuhopecten yessoensis (GCF_002113885.1), Octopus bimaculoides (GCF_001194135.1), Drosophila melanogaster (GCF_000001215.4), Homo sapiens (GCF_000001405.39) and Nematostella vectensis (GCF_000209225.1) (Table 7). 27,422 gene families were identified, of which 3,109 were shared by all eleven species. Comparing with other ten species, there are 347 specific gene families in the T. crocea assembly (Fig. 3), among these T. crocea specific families, 953 genes are supported by evidence of gene functional annotation. These T. crocea specific genes were significantly (P < 0.05) enriched in zinc ion binding, extracellular ligand-gated ion channel activity, integral component of membrane, ion transport related gene ontology (GO) categories (Table 8).

Table 7 Statistic analysis of gene families.
Table 8 GO enrichment of positive selection genes in Tridacna crocea.

A phylogenetic tree was constructed using the eleven animal species (Fig. 4). Protein sequences were extracted from each family and concatenated to form one supergene for each species, and the maximum likelihood method47 was used to reconstruct the phylogenetic tree. The divergence time among the eleven animals was estimated using the MCMCtree program (version 4.4) as implemented in the Phylogenetic Analysis of Maximum Likelihood (PAML) package48, with a correlated rates clock and JC69 nucleotide substitution model. The divergence time between T. crocea and M. yessoensis was predicted to be about 505.7 million years ago (MYA). Compared with the common ancestor of T. crocea, M. yessoensis, C. gigas and C. virginica, Tridacna crocea shows 93 and 15 events of gene family expansion and gene family contraction, respectively. The expanded genes in T. crocea are related with “DNA replication” (GO:0006260), “DNA-directed DNA polymerase activity” (GO:0003887), “nucleotide binding” (GO:0000166), “methyltransferase activity” (GO:0008168), and so on. On the other side, the contracted genes in the T. crocea were significantly (P < 0.05) enriched in GO terms for “iron ion binding” (GO:0005506), “heme binding” (GO:0020037), “oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen” (GO:0016705), and “oxidation-reduction process” (GO:0055114).

Fig. 4
figure 4

Phylogenetic tree with history of orthologous gene families and divergence times. Numbers on the nodes represent divergence times; parentheses represent error range; MRCA represents most recent common ancestor.

Data Records

The raw Illumina, PacBio, Hi-C sequencing and full length transcriptome data are deposited in the NCBI SRA database under the accession numbers SRR1713764449, SRR1713764550, SRR1713764351, and SRR2565102152, respectively. The genome assembly and annotations are available from the Figshare53,54 and the assembly genome are also deposited at the NCBI with accession number GCA_032873355.155.

Technical Validation

Evaluation of the genome assembly

The Hi-C heatmap exhibits the accuracy of genome assembly, with relatively independent Hi-C signals observed between the 18 pseudo-chromosomes (Fig. 1B). To evaluate the quality of the genome assembly, the completeness of the genome assembly was assessed using the conserved metazoan gene set “metazoan_odb10” from the Benchmarking Universal Single-Copy Orthologs (BUSCO) v4.054. The genome assembly was found to have a high level of completeness (94.2%). 74.2% were complete and single-copy, 20% complete and duplicated, 0.6% fragmented, and 5.2% were missing (Table 9). This demonstrates the remarkable completeness and conservation of gene content in giant clam genome assembly, achieving one of the best BUSCO scores observed among reported mollusks. Therefore, these results suggested that the quality of this genome assembly is high.

Table 9 Statistic of the Tridacna crocea assembly gene-space with the 978 BUSCO metazoa gene set.

Genome annotation and phylogenetic analysis

By comparing with public databases including Gene Ontology, KOG, SwissProt, KEGG and NR, gene function information, motifs and domains of their proteins were assigned (Table 10). InterProScan program56 with default parameters was used to identify the GO terms and putative domains of genes. For other four databases, the EvidenceModeler-integrated protein sequences against the 4 public protein database were compared using BLASTp57 with an E value cutoff of 1e−05. Results from the five database searches were concatenated.

Table 10 Functional annotation of the predicted genes in the assembly of Tridacna crocea.

The maximum likelihood method was performed to reconstruct the phylogenetic tree according to47. The divergence time among the eleven animals were predicted by the MCMCtree program (version 4.4) of Phylogenetic Analysis of Maximum Likelihood (PAML) package48, with a correlated rates clock and JC69 nucleotide substitution model. The TimeTree database was used to predict the calibration times of divergence between Octopus bimaculoides and Crassostrea gigas (~554MYA)58.