Dissecting the chromosome-level genome of the Asian Clam (Corbicula fluminea)

The Asian Clam (Corbicula fluminea) is a valuable commercial and medicinal bivalve, which is widely distributed in East and Southeast Asia. As a natural nutrient source, the clam is rich in protein, amino acids, and microelements. The genome of C. fluminea has not yet been characterized; therefore, genome-assisted breeding and improvements cannot yet be implemented. In this work, we present a de novo chromosome-scale genome assembly of C. fluminea using PacBio and Hi-C sequencing technologies. The assembled genome comprised 4728 contigs, with a contig N50 of 521.06 Kb, and 1,215 scaffolds with a scaffold N50 of 70.62 Mb. More than 1.51 Gb (99.17%) of genomic sequences were anchored to 18 chromosomes, of which 1.40 Gb (92.81%) of genomic sequences were ordered and oriented. The genome contains 38,841 coding genes, 32,591 (83.91%) of which were annotated in at least one functional database. Compared with related species, C. fluminea had 851 expanded gene families and 191 contracted gene families. The phylogenetic tree showed that C. fluminea diverged from Ruditapes philippinarum, ~ 228.89 million years ago (Mya), and the genomes of C. fluminea and R. philippinarum shared 244 syntenic blocks. Additionally, we identified 2 MITF members and 99 NLRP members in C. fluminea genome. The high-quality and chromosomal Asian Clam genome will be a valuable resource for a range of development and breeding studies of C. fluminea in future research.

www.nature.com/scientificreports/ of high quality. Compared with the estimated genome size of R. philippinarum (~ 1.32 Gb), that of C. fluminea was larger (1.64 Gb). The genome assembly work for R. philippinarum eventually produced a genome size of 1.12 Gb, which covered 84.85% of the estimated genome. The C. fluminea genome assembled a total of 1.52 Gb of genomic sequences, which covered 92.68% of the estimated genome. The other comparisons, including gene mean length, BUSCO evaluation, and the number of coding genes, etc., showed the genomic characteristics for these two species (Table 1). There were 19 and 18 chromosomes in R. philippinarum and C. fluminea genomes, respectively. The longest chromosome for C. fluminea was the chromosome 01, with a length of 144. 27 Mb, whereas the longest chromosome 19 for R. philippinarum was only 62.15 Mb (Table S10). The longest chromosome 01 for C. fluminea also happened to be the maximum scaffold (144.27 Mb) we assembled. The syntenic analysis generated 244 syntenic blocks between two genomes (Fig. 2b, Table S10 Analysis of protein families. Gene family analysis identified a total of 71,331 gene families among five species of bivalves (Table S11), and we discovered 23,063 gene families clustered by 38,841 protein-coding genes in the Asian Clam genome. Compared with the genome of R. philippinarum, Crassostrea gigas, Crassostrea virginica, and Bathymodiolus platifrons, the C. fluminea genome had 16,170 specific gene families (Fig. 3a). Additionally, Single-copy orthologs, multiple copy orthologs, other orthologs, and unique genes were identified in the all-to-all BLASTP analysis of entries for the reference genomes. The five bivalve species shared 146 single-copy orthologs, and the Asian Clam genome contained 25,878 unique genes (Fig. 3b, Table S12). Combining the phylogenetic relationships, gene family evolution was calculated by comparing the differences between ancestors and C. fluminea. This analysis resulted in 851 gene families being significantly expanded (P < 0.05) and 191 gene families being significantly contracted (P < 0.05) in the Asian Clam genome (Fig. 3c, www.nature.com/scientificreports/ Table S13). The 851 expanded gene families were clustered by 9,967 functional genes (Table S14). The functional enrichment analysis on GO and KEGG of those expanded genes identified 325 significantly enriched (q-value < 0.01) GO terms (Table S15) and 19 significantly enriched (q-value < 0.01) KEGG pathways (Fig. S2, Table S16). Among the significantly enriched KEGG pathways, we found taurine and hypotaurine metabolisms were significantly enriched.

Phylogenetic and gene family expansion analysis.
MITF gene family analysis. The genic tree comprising all MITF family genes was successfully constructed using MUSCLE (Fig. 4a). Most species possessed one or two MITF members, while Lottia gigantea lost MITF members. Crassostrea virginica and L. anatine possessed five and seven MITF members, respectively (Table S17). This result coincides with the result of the above gene family evolution analysis, which showed the MITF gene family expanded in Crassostrea virginica and Lingula anatina, and contracted in Lottia gigantea (Table S18). The genic tree also showed that MITF members originated from the same species were clustered at the nearest genetic distance. MITF members from the same families (family Mytilidae represented by B. platifrons and www.nature.com/scientificreports/ Mytilus coruscus, family Ostreidae represented by Crassostrea gigas and Crassostrea virginica), were clustered more together. The clustering relationships of MITF gene family were similar to those shown by the phylogenetic tree of single-copy orthologs. This finding indirectly corroborates the reliability of the phylogenetic relationship analysis.
In this study, we detected two members from the Asian Clam genome, namely EVM0008002 and EVM0031201, which were identified as MITF genes. Both genes contained an N-terminal domain TFEB_C_3 and a highly conserved functional domain HLH. The EVM0008002 was located at 47.05-47.08 Mb on chromosome 10, with a length of 28,761 bp, and encoded 533 amino acids. The position of EVM0031201 was close to that of EVM0008002, and it was also located on chromosome 10. The EVM0031201 was located at 46.99-47.02 Mb, with a length of 29,767 bp, and it encoded 533 amino acids, too. Both EVM0008002 and EVM0031201 contained 8 exons that comprising 533 amino acids, and 7 introns. The domain TFEB_C_3 of them started with 130 amino acids and ended with 268 amino acids, and was accompanied by 3 exons. The domain HLH of them started with 352 amino acids and ended with 405 amino acids, and was accompanied by 3 exons, too (Fig. 4b).
Among 533 amino acids, the types and sequences of 530 amino acids for these two genes were consistent, only three amino acids showed the differences. The three differences of amino acids were located at position of 87, 89, and 90, respectively. Specifically, the amino acids of EVM0008002 at position of 87, 89, and 90, were Leucine (L), Histidine (H), and Alanine (A), respectively. The amino acids of EVM0031201 at position of 87, 89, and 90, were Histidine (H), Asparagine (N), and Threonine (T), respectively (Fig. 4b).
NLRP gene family analysis. NLRP (Nucleotide-binding oligomerization domain, Leucine rich Repeat and Pyrin domain containing Proteins) is well known for its roles in apoptosis and inflammation. Among all the species involved in the evolutionary analysis, the number of NLRP members in C. fluminea (99) was more than that of most species, except P. imbricata (150) and Capitella teleta (105) (Fig. 5a). Specifically, the number of NLRP members in C. fluminea was more than that shown in B. platifrons (12), Mytilus coruscus (16), Mizuhopecten yessoensis (22), Crassostrea gigas (28), Crassostrea virginica (47). Additionally, we analyzed the domain NACHT of C. fluminea (99) in the table of the expanded gene families in C. fluminea (Table S14), which was significantly expanding compared to its ancestors (10). Among the 99 NLRP members in C. fluminea genome, 45 members possessed domain DUF4559, 12 members possessed domain DUF4062, and 5  Table S19). Meanwhile, we found that all five members (EVM0034661, EVM0036165, EVM0036449, EVM0010937 and EVM0021611) contained 3 domains, and two members (EVM0032343 and EVM0035132) contained 4 domains. The NLRP members of C. fluminea grouped into five subfamilies (subfamily a-e) (Fig. 5c). Subfamily a owned 12 members clustered by the same or similar protein domains, as the same as subfamily b to e possessed 36, 3, 18, and 25 members, respectively (Table S20). Five of 99 members did not cluster into any subfamily.   Hi-C technologies. DNA was extracted from the whole body with the gut removed, and it was cross-linked in situ using formaldehyde with a final concentration of 2% and homogenized with tissue lysis by the restriction enzyme HindIII. The libraries for Hi-C with insert sizes of 300-700 bp were sequenced on an Illumina HiSe q X Ten platform (Illumina, SanDiego, CA, USA). Two Hi-C libraries generated data for chromosomal building.

Methods
Transcriptome sequencing. Using TRIzol (Thermo Fisher, USA), the RNA was extracted from the whole body with the gut removed, and the libraries were generated using a NEBNext Ultra RNA Library Prep Kit for Illumina (NEB, USA) following the instruction manual. The data was used to alignment to the assembled genome for prediction of coding genes.
Genome estimation. Illumina 31 , followed by k-mer analysis to estimate the genomic features.
In this study, we plotted the 21-mer depth distribution (k = 21) to estimate the genome size, heterozygosity, and repeats using Jellyfish (version 2) 32 . Genome size estimation was implemented by the formula G = N21-mer (total number of k-mers)/D 21-mer (k-mer depth of the main peak). The repetitive content was accumulated from where the depth of k-mer was more than two times of the main peak, and the heterozygosity were estimated at where the depth was half of the main peak.
Denovo assembly. Using the long single molecular reads from PacBio, the pipelines of workflow were as follows in the genome assemblies. Firstly, the clean data from PacBio were subjected to error correction using Canu (version 1.5) 33 with the parameter of error correct coverage = 60. Subsequently, the outputs were piped into the workflow of SMART denovo (version 1.0) 34 , and the genomic contigs were automatically generated with the parameters of J = 5000, A = 1000, and r = 0.95. Finally, the preliminary assembly was polished three times by Racon (version 1.32) 35 , resulting in the first correction being successfully realized. Illumina reads specifically for genome estimation were prepared for the second correction, and this round of correction could solve the high error rate of the third generation sequencing. The third round of correction was implemented by Pilon (version 1.22) 36 , and the error correction was run for three times.
Hi-C scaffolding. The contigs generated by the preliminary genome assembly required filling of gaps and anchoring on the putative chromosomes. The initial contigs were piped into the Hi-C assembly workflow, and the signals of chromatin interactions were captured to construct chromosomes. In brief, the putative Hi-C junctions were aligned by the unique mapped read pairs using BWA-MEM (version 0.7.10-r789) 37 . The paired reads uniquely mapped to the assembly were called the valid interaction pairs, and they were used for the Hi-C scaffolding. Other invalid reads included reads of self-ligation and non-ligation; dangling ends were filtered out using HiC-Pro (version 2.10.0) 38 . The Hi-C reassembly broke the contigs into 50 kb fragments, and the regions that were mismatched to the initial assembly or could not be restored were listed as candidate error areas. The genome was subjected to a final round of error correction, and the gaps were filled during this round. The reassembled and corrected contigs were divided into ordered, oriented, and anchored groups by LACHESIS 39 with the parameters CLUSTER_MIN_RE_SITES = 33; CLUSTER_MAX_LINK_DENSITY = 2; CLUSTER_ NONINFORMATIVE_RATIO = 2; ORDER_MIN_N_RES_IN_TRUN = 29, and ORDER_MIN_N_RES_IN_ SHREDS = 29, automatically resulting in putative chromosomes. The gaps generated during the Hi-C assembly were refilled using LR GapCloser (version 1.1) 40 .
Genome quality evaluation. The genome of C. fluminea was aligned to the Mollusca database (OrthoDB10) comprising 5,295 conservative core genes by BUSCO (version 3.0) 41 . The CEGMA Database comprising 458 conserved core genes of eukaryotes was searched in the same way using CEGMA (version 2.5) 42 . The Illumina short-read alignments mapped to the assembled genome of the Asian Clam using BWA-MEM (version 0.7.10-r789) 37 .

Repeats analysis.
There are two main types of repeats, retrotransposons (Class I in our analysis) and transposons (Class II in our analysis).We constructed a specific repeats database for repeat prediction using LTR-FINDER (version 1.05) 43 and RepeatScout (version 1.0.5) 44 , followed by the identification and classification for repeats by PASTEClassifer (version 1.0) 45 . The species-specific repeats library for the Asian Clam genome was successfully generated by aggregating our prediction and Repbase (19.06) 46 . LTR characteristics for the clam were processed by RepeatMasker (version 4.0.6) 47 .
Genome annotation. Gene annotation. We utilized de novo-, homology-, and transcriptome-based methods to predict protein-coding genes. Five tools employed were Genscan (verson3.1) 48  Non coding gene annotation. The other genome features, including pseudogenes and non-coding RNAs, were identified by referring to the miRbase database (version 21.0) 59  Comparative analysis of C. fluminea and R. philippinarum genomes. The genome data of R. philippinarum (https:// doi. org/ 10. 1016/j. isci. 2019. 08. 049, 2019) that is also belonging to the order Veneroida was used to conduct the comparative analysis with the C. fluminea genome. The process of the comparison included genome size, assembly index, evaluation and collinearity, which was helpful to better understand the genome of C. fluminea. For collinearity analysis, we compared the C. fluminea genome with the genome of R. philippinarum using MUMmer (http:// mummer. sourc eforge. net), with the parameter l = 10,000. The genomes of C. fluminea and R. philippinarum were subjected to a synteny analysis to show the connections and syntenic blocks using BLASTP (E < 1e−05) 30 , and the visual graphics were generated by MCScan [https:// github. com/ tangh aibao/ jcvi/ wiki/ MCscan-(Python-version)]. Each syntenic block comprised at least five sequential genes, which were all distributed in two genomes.

Gene family identification.
Protein data from C. fluminea and other representative species (all Bivalve species and some mollusks with assembly and annotation that could be found in NCBI or other databases), including Capitella teleta, Lingula anatina, Octopus vulgaris, Lottia gigantea, R. philippinarum, Crassostrea gigas, Crassostrea virginica, P. imbricata, Mizuhopecten yessoensis, Mytilus coruscus, and Bathymodiolus platifrons, were retrieved in the corresponding databases and aligned using BLAST (version 2.2.31, https:// ftp. ncbi. nlm. nih. gov/ blast/ execu tables/ blast+/ LATEST/) 30 with a maximum e-value of 1e−5. Proteins with sequence lengths > 100 amino acids were searched against the Pfam (https:// pfam. xfam. org) database by Pfam scan 70 . The domain of gene feature was made by the Gene Structure Display Server -GSDS (version2.0) 71 . Protein sequences were clustered using CD-HIT 72 , with a length difference cutoff of 0.7, and finally concatenated to a single fasta file. The ortholog groups for gene families were generally clustered using OrthoMCL (version 2.0.9) 73 . The R package (version 4.1.0, https:// mirro rs. bfsu. edu. cn/ CRAN/) was used to generate the column chart. Four selected bivalves (R. philippinarum, Crassostrea gigas, Crassostrea virginica, and B. platifrons) and C. fluminea were grouped together to conduct the analysis for gene family characteristics, and the venn was generated by the R package (version 4.1.0, https:// mirro rs. bfsu. edu. cn/ CRAN/).

Phylogenetic tree reconstruction and divergence time estimation.
The single-copy orthologs from all involved species were statistically analyzed using the longest transcripts for each gene. The single-copy orthologous genes shared by the above 12 species (including C. fluminea) were aligned using MUSCLE (version 3.8.31) 74 . The super-alignment of nucleotide sequences provided a reference tree topology using PhyML (version 3.3) 75 . The divergence times among species were roughly estimated by the MCMC Tree program of the PAML package (version 4.7a) 76 with the approximate likelihood calculation method. We utilized molecular clock data from the TimeTree (http:// www. timet ree. org/) 77 database as the calibration times. The phylogeny tree was optimized by iTOL (version 6 https:// itol. embl. de/). Gene family evolutionary analysis. According to divergence times and phylogenetic relationships, CAFÉ (version 4.2) 78 was used to analyze gene family evolution. The gene family expansion and contraction were analyzed by comparing the differences between the ancestor and involved species. The expanded family genes for C. fluminea were extracted and aligned to the functional enrichment on GO and KEGG to detect their functions.
Prediction of specific protein domains. Pfam database provided protein domains, and the specific proteins with sequence lengths > 100 amino acids, were searched against it for specific gene families analysis. GSDS (version2.0) and R package (version 4.1.0) was used to generate the visual gene feature and the column chart, respectively. The MITF gene family consisted of three domains, namely TFEB, TFEC, and TFE3 79  www.nature.com/scientificreports/ we utilized protein-coding sequences from the representative species to analyze the members of MITF gene family, especially the structure and amino acid composition of members in the C. fluminea genome. The core domain of NALP family was NACHT 80 , which was used to analyze the structure and distribution of NALP family members in C. fluminea genome.

Discussion
In this study, we assembled a chromosome-level Asian Clam genome using a combination of PacBio and Hi-C technology. Generally, a complex genome is defined as a heterozygosity ratio greater than 0.8% and a repeat ratio greater than 60%. The high repeats (69.66%) and heterozygosity rate (2.41%) of C. fluminea genome bring great difficulties to assembly, we still assembled and obtained a high-quality and chromosomal genome. The 1.52 Gb of genome data distributed across 18 chromosomes, with a contig N50 of 521.06 Kb and a scaffold N50 of 70.62 Mb. The scaffolding process for the Asian Clam genome showed a high level of efficiency (more than 99% genomic sequences and more than 97% contigs were located on chromosomes). The 18 chromosomes of C. fluminea covered 92.68% of the whole genome, and the longest chromosome 01 was 144.27 Mb. These data results are strong evidence of our ultra-high quality genome. In present study, the phylogenetic relationship suggested that the ancestors of C. fluminea and its closest relative R. philippinarum diverged from the common ancestors of other six bivalves, ~ 492.00 million years ago. It is consistent with the origin time of Heterodonta from the Paleozoic 81,82 . The genetic distance between the two species and other marine bivalves is relatively far. However, despite C. fluminea and R. philippinarum share over 240 syntenic genome blocks, there are still great habitat and adaptation differences between them. The majority of C. fluminea is living in typical freshwater ecosystem, while the brackish water species R. philippinarum is mainly distributed in the coastal area 83 . The phylogenetic relationship showed that C. fluminea and R. philippinarum diverged at an early stage of ~ 228.89 million years, coinciding with the divergency event of Veneroida occurring in the Mesozoic and Cenozoic eras 84 . This evidence suggests that as a freshwater bivalve, C. fluminea had been diverged from other bivalves million years ago. A long-term divergency and evolutional process resulted in the unique survival mechanism or environmental adaptation of the Asia Clam. Thus, the ancestors of C. fluminea might have invaded and migrated to freshwater from the ocean since millions of years ago, and they have evolved to fill various freshwater habitat.
On account of short sexual maturity time, rapid growth, short life cycle and planktonic veliger stage, C. fluminea has strong diffusion ability 85 , and it is considered as an alien species in America and Europe [11][12][13] . The strong reproductive capacity and a powerful immune system might be bound to play an important role. In this study, we identified two gene families, MITF and NLRP, which were respectively related to the immune and reproductive adaptability of C. fluminea. It has been reported that microphthalmia-associated transcription factor (MITF) plays an important role in immune defense and shell color formation in molluscs 86,87 . We identified two MITF genes (EVM0008002 and EVM0031201) in the Asian Clam genome. They both encoded 533 amino acids, only three of which were different. These two genes were located on chromosome 10, and their physical distance was very close. Specifically, EVM0031201 was located at 46.99-47.02 Mb, and EVM0008002 was located at 47.05-47.08 Mb. The EVM0031201 and EVM0008002 were so close to each other, which may be a duplication of the genome region, and this duplication may include one or more genes. Except for functions in apoptosis and inflammation, several NLRPs have been indicated as being involved in reproduction as well 88 . The 99 members of NLRP family in C. fluminea genome were significantly more than that of most of the candidate species, and the NLRP gene family was significantly expanded comparing to its ancestors, with 10 NLRP members. We infer the expansion of NLRP family may be related to the strong reproductive function of C. fluminea. The genomic information presented in our analysis will help to better understand, develop, and improve C. fluminea as well as establish a strong foundation for genome-assisted breeding programs in the future.

Data availability
Raw sequencing reads for PacBio and Illumina are available at GenBank as BioProject PRJNA657911. Raw sequencing data (Illumina, PacBio, and Hi-C data) have been deposited in the SRA (Sequence Read Archive) database as SUB7507164. The data including assembly and annotation that supported the findings of this study have been deposited in the in the FigShare database, (https:// doi. org/ 10. 6084/ m9. figsh are. 12805 886. v1).