Background & Summary

The application of genomics approaches to study non-model organisms is deemed a key approach to assess biodiversity and guide conservation1,2,3,4. Whole genome assemblies provide access to a species’ “entire genetic code”, thus representing the most comprehensive framework to efficiently decipher a species’ biology5,6. Genomic resources allow accurate definition of conservation units, identification of genetic elements with conservation relevance, inference of adaptive potential, assessment of population health, as well as provide predictive assessments of the impact of human-mediated threats and climate change3,5,7,8. Consequently, assembled genomes and other genomic tools are key resources to study and guide conservative actions and management planning.

Bivalves of the Order Unionida (known as freshwater mussels) are commonly found throughout most of the world’s freshwater ecosystems, where they play key ecological roles (e.g., nutrient and energy cycling and retention)9,10,11 and provide important services (e.g., water clearance, sediment mixing, pearls, and other raw materials)9,10,12. Despite their indisputable importance for freshwater ecosystems, freshwater mussels are among the most threatened taxa, with many populations worldwide having well-documented records of continuous declines over the last decades, as well as of many local and global extinctions13,14,15. Threatened species with limited distributions, such as the dolphin freshwater mussel U. delphinus Spengler, 1793 (Unionida: Unionidae) only found in the western Iberian Peninsula region (Fig. 1), represent particularly urgent but challenging targets for conservation16. The dolphin freshwater mussel, only recently recognised as a valid species17, has been strongly affected by a series of human-mediated actions over the last decades, including habitat destruction, dams or barrier construction, pollution, poor river management, water depletion, the introduction of invasive species, among others16,18. All these pressures are further augmented by the effects of climate change, especially the increasing interannual variability of water cycles over the last decades, which is particularly evident in the Mediterranean region19,20. As a consequence, the area of occurrence of the dolphin freshwater mussel has been reduced by almost one-third from its historical distribution18. This concerning trend has triggered an unprecedented effort to understand the threats and promote and implement conservation policies. These are critically dependent on the understanding the multiple aspects of the species’ biology, such as its life history, reproductive demands, ecological requirements, and its abiotic and biotic interactions13,16,18,21.

Fig. 1
figure 1

Top left: The Unio delphinus specimen used for the whole genome assembly. Top Right: The map of the potential distribution of Unio delphinus generated by overlapping points of recent presence records (obtained from13) with the Hydrobasins level 5 polygons72. Bottom Left: An Unio delphinus individual in its natural habitat. Bottom Right: A population of Unio delphinus in its natural habitat (Photos by Manuel Lopes-Lima).

Recent efforts have focused on providing a thorough characterization of the species’ genetic diversity, population structure, and evolutionary history21,22,23. Despite the undeniable achievements of these early molecular studies, the availability of large-scale and more biologically informative genomic resources is almost inexistent, not only for U. delphinus but also for all freshwater mussels. In fact, for approximately 1000 known species, only four whole genome assemblies24,25,26,27 and less than 20 transcriptomes are currently available28,29,30,31,32,33,34,35,36,37,38,39,40,41. Recently, the first transcriptome assemblies of five threatened European freshwater mussel species have been published, including the gill transcriptome of the dolphin freshwater mussel41. This transcriptome was a fundamental tool to begin studying the evolutionary and adaptive traits of the species. However, single tissue RNA-seq approaches only capture a small fraction of the genetic information. Conversely, whole genome sequence assemblies represent a highly informative and fruitful resource to investigate and decipher multiple aspects of the species’ biology.

Here, we provide the first whole genome assembly of the dolphin freshwater mussel, U. delphinus. This is the most contiguous freshwater mussel genome assembly available, and the first Unionidae genome assembly from a European species. This genome provides a unique tool for an in-depth exploration of the many molecular mechanisms governing the biology of this species, which will ultimately guide conservation genomic studies to protect the critically declining population trend.

Methods

Animal sampling

One individual of Unio delphinus was collected in the Rabaçal River in Portugal (Table 1) and transported alive to the laboratory, where tissues were separated, flash-frozen, and stored at −80 °C. The whole shell and preserved tissues of the individual are deposited at CIIMAR tissue and mussels’ collection, under the voucher code: BIV7592.

Table 1 MixS descriptors for the Unio delphinus specimen used for whole genome sequencing.

DNA extraction, library construction, and sequencing

For PacBio HiFi sequencing, mantle tissue was sent to Brigham Young University (BYU), where high-molecular-weight DNA extraction and PacBio HiFi library construction and sequencing were performed, following the manufacturer’s recommendations (https://www.pacb.com/wp-content/uploads/Procedure-Checklist-Preparing-HiFi-SMRTbell-Libraries-using-SMRTbell-Express-Template-Prep-Kit-2.0.pdf). Size-selection was conducted on the SageELF system. Sequencing was performed on four single-molecule, real-time (SMRT) cells using Sequel II system v.9.0, with a run time of 30 h, and 2.9 h pre-extension. The circular consensus analysis was performed in SMRT® Link v9.0 (https://www.pacb.com/wp-content/uploads/SMRT_Link_Installation_v90.pdf) under default settings (Table 2).

Table 2 General statistics of raw sequencing reads used for the Unio delphinus genome assembly.

For short read Illumina sequencing, extracted genomic DNA was sent to Macrogen Inc. where a standard Illumina Truseq Nano DNA library preparation and whole genome sequencing of 150 bp paired-end (PE) reads was achieved using an Illumina HiSeq X machine (Table 2).

Pre-assembly processing

Illumina PE short read quality was evaluated using FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and after, reads were quality trimmed with Trimmomatic v.0.3842, specifying the parameters “LEADING: 5 TRAILING: 5 SLIDINGWINDOW: 5:20 MINLEN: 36”. The quality of the clean reads was re-validated in FastQC. The clean reads were used to estimate genome size, heterozygosity and repetitive content using Jellyfish v.2.2. and GenomeScope v2.043 specifying a k-mer length of 25.

Mitochondrial genome assembly

PacBio HiFi reads were used to retrieve a whole mitochondrial genome (mtDNA) assembly by applying a pipeline recently developed by our group44. Briefly, all Unionida mtDNA assemblies available on NCBI were retrieved (Fasta format; Supplementary_File1) and used as a reference mitogenome database. All the raw PacBio HiFi reads were mapped to the mitogenome database using Minimap2 v.2.1745, specifying parameters (-ax asm20). The output sam file was converted to bam and sorted using Samtools v.1.946, with options “view” and “sort”, respectively. Samtools “view” was also used to retrieve only the mapped reads with parameter (-F 0 × 04) and after the bam file was converted to fastq format using the option “bam2fq”. The resulting PacBio HiFi mtDNA reads were corrected using Hifiasm v.0.13-r30847,48 with parameters (–write-ec). The corrected reads were assembled using Unicycler v.0.4.849, a software package optimised for circular assemblies, with default parameters. Mitogenome annotation was produced using MitoZ v.3.450 with parameters (--genetic_code 5--clade Mollusca), using the PE reads for coverage plotting.

Genome assembly

The overall pipeline used to obtain the genome assembly and annotation is provided in Fig. 2.

Fig. 2
figure 2

Bioinformatic pipeline applied for the whole genome assembly and annotation. Representative figures created with BioRender.com.

Firstly, PacBio HiFi reads were assembled using multiple software optimized for PacBio HiFi reads, i.e., Hifiasm 0.16.1-r37547,48 with default parameters, Flye v.2.8.351 with parameters (–pacbio-hifi), NextDenovo v.2.4.0 (https://github.com/Nextomics/NextDenovo) with parameters (read_type = hifi) and Peregrine-2021 v0.4.352 with default parameters. After, the overall quality of each assembly was assessed using Benchmarking Universal Single-Copy Orthologs (BUSCO) v.5.2.253 with Eukaryota and Metazoa databases and Quality Assessment Tool for Genome Assemblies (QUAST) v.5.0.254 (Fig. 2). Hifiasm 0.16.1-r375 produced the best results of the tested assemblies and thus was selected for further analyses. Since the genome size was larger than predicted by the GenomeScope report, several new assemblies were produced with this Hifiasm 0.16.1-r375, testing a range of parameters (l = 3; s = 0.50, 0.45, 0.35), following the authors’ recommendations (https://hifiasm.readthedocs.io/en/latest/faq.html#p-large). Given that reducing the similarity threshold for duplicate haplotigs (i.e., parameter -l and -s) had little impact on the overall statistic, the assembly with default parameters was chosen for further analysis. To separate poorly resolved pseudo-haplotypes, purge_dups v.1.2.555 was applied, first with default parameters and after by manually adjusting the transition between haploid and diploid cut-off (i.e., parameter -m of option calcuts) to 30, 32 and 25 in three independent runs. In all the runs the lower and upper bound for read depth were always maintained, i.e., 5 and 87, respectively. All the cutoff values were determined by inspection of the k-mer plot produced by the K-mer Analysis Toolkit (KAT) tool56. The influence of purge_dups v.1.2.5 was evaluated using BUSCO v.5.2.2 with Eukaryota and Metazoa databases and QUAST v.5.0.2. Since purge_dups v.1.2.5 did not remove any duplicates (neither with the default nor adjusted cutoffs) the Hifiasm 0.16.1-r375 default assembly was selected as the final assembly. To evaluate the quality of the final assembly, several metrics and software were used. Besides BUSCO v.5.2.2 and QUAST v.5.0.2 metrics, completeness, heterozygosity, and collapsing of repetitive regions were evaluated using a k-mer distribution with KAT56. Moreover, read-back mapping was performed for the PE using with Burrows-Wheeler Aligner (BWA) v.0.7.17-r119857, for long reads with Minimap2 v.2.17 and for RNA-seq (SRR1926176441) with Hisat2 v.2.2.058. To inspect the genome for possible contamination, we used BlobTools v.1.1.159. Briefly, a blast search of the final genome assembly was conducted against the RefSeq60 database, using the BLASTX function from DIAMOND v.2.0.11.14961, following authors’ instructions59. The blast output, as well as the alignment of PE short reads against the genome performed with BWA v.0.7.17, were used as input to run BlobTools, with contamination screening at Phylum level.

Masking of repetitive elements, gene models predictions and annotation

Before masking repetitive elements, a de novo library of repeats was created for the final genome assembly, with RepeatModeler v.2.0.13362. Subsequently, the genome was soft masked combining the de novo library with the ‘Bivalvia’ libraries from Dfam_consensus-20170127 and RepBase-20181026, using RepeatMasker v.4.0.73463.

The masked assembly was used for gene prediction, performed using BRAKER2 pipeline v2.1.664. First, RNA-seq data from U. delphinus was retrieved from GenBank (SRR1926176441) (the same individual used for the genome assembly), quality trimmed with Trimmomatic v.0.3839 (parameters described above) and aligned to the masked genome, using Hisat2 v.2.2.0 with the default parameters. Moreover, the complete proteome of 14 mollusc species and three reference Metazoa genomes (Homo sapiens, Ciona intestinalis, Strongylocentrotus purpuratus), were used as supplementary evidence for gene prediction, downloaded from public databases (Table 3). BRAKER2 pipeline was applied, specifying parameters “–etpmode; –softmasking;”. The resulting predictions file (braker.gtf) was filtered to retain only predictions with RNA-Seq and/or protein evidence (using auxiliary scripts selectSupportedSubsets.py) and after converted to.gff3 using the Augustus auxiliary script gtf2gff.pl. Gene predictions were processed using a series of auxiliary scripts from Another Gtf/Gff Analysis Toolkit (AGAT) v.0.8.06365. Briefly, gene predictions were clean with agat_convert_sp_gxf2gxf.pl, renamed with agat_sp_manage_functional_annotation.pl, overlapping prediction corrected with agat_sp_fix_overlaping_genes.pl and coding sequence regions (CDS) with <100 amino acid and incomplete gene predictions (i.e., without start and/or stop codons) were corrected and/or removed with agat_sp_add_start_and_stop.pl and agat_sp_filter_incomplete_gene_coding_models.pl, respectively. Finally, the overall statistics of the processed predictions were retrieved using agat_sp_statistics.pl and the predicted genes, protein, CDS and exon sequences were retrieved using agat_sp_extract_sequences.pl. The protein sequences were next used for functional annotation, using InterProScan v.5.44.8066, as well as BLASTP searches against the RefSeq database60. BLASTP homology searches were obtained using DIAMOND v.2.0.11.14961, specifying the parameters “-k 1, -b 20, -e 1e-5,–sensitive,–outfmt 6”. To validate the set of proteins obtained, the BUSCO scores were estimated based on the protein set, using the Eukaryota and Metazoa databases, as described previously.

Table 3 List of proteomes used for BRAKER2 gene prediction pipeline.

Data Records

The raw reads sequencing outputs were deposited at the NCBI Sequence Read Archive with the accession’s numbers: SRR23060683, SRR23060685, SRR23060678 and SRR23060675 for PacBio CCS HiFi; SRR23060686 for Illumina PE67. The Genome assembly is available under accession number JAQISU00000000068. BioSample accession number is SAMN32554582 and BioProject PRJNA91785569. The remaining information was uploaded to figshare. In detail, the files uploaded to figshare70 include the final unmasked and masked genome assemblies (Ude_BIV7592_haploid.fa and Ude_BIV7592_haploid_SM.fa), the two pseudohaplotypes genome assemblies generated by Hifiasm assembler (Ude_BIV7592_pseudohaplotype_1.fas.gz and Ude_BIV7592_pseudohaplotype_2.fas.gz), the annotation file (Ude_BIV7592_annotation_v4.gff3), predicted genes (Ude_BIV7592_genes_v4.fasta), predicted messenger RNA (Ude_BIV7592_mrna_v4.fasta), predicted open reading frames (Ude_BIV7592_cds_v4.fasta), predicted proteins (Ude_BIV7592_proteins_v4.fasta), as well as full table reports for Braker gene predictions and InterProScan functional annotations (Ude_BIV7592_annotation_v4_InterPro_report.txt) and RepeatMasker predictions (Ude_BIV7592_annotation_v4_RepeatMasker.tbl).

Technical Validation

Raw datasets and pre-assembly processing quality control

Raw sequencing outputs general statistics are provided in Table 2. GenomeScope2 estimated genome size was ~2.31 Gb and heterozygosity levels of ~0.64% (Fig. 3a), both within the values observed for other Unionidae genomes available24,25,26,27.

Fig. 3
figure 3

Left: GenomeScope2 k-mer (21) distribution displaying estimation of genome size (len), homozygosity (aa), heterozygosity (ab), mean k-mer coverage for heterozygous bases (kcov), read error rate (err), the average rate of read duplications (dup), k-mer size used on the run (k:), and ploidy (p:). Right: KAT spectra-cn plot for the Unio delphinus genome assembly, to compare the PacBio HiFi k-mer content within the genome assembly. Different colours represent the read k-mer frequency in the assembly.

Genome assembly metrics

Hifiasm produced the overall most contiguous and complete (accessed under BUSCO scores) genome assembly of all the tested assemblers (Table 4). Both Flye and Peregrine-2021 were very inefficient in collapsing haplotypes, resulting in unexpectedly large assemblies with high levels of duplicated BUSCO scores (Table 4). Conversely, Hifiasm and NextDenovo efficiently resolve duplicates while ensuring high complete BUSCO scores (Table 4). Additionally, Hifiasm produced a much more contiguous genome assembly, with an almost 5-fold increased N50 length (Table 4). Although the BUSCO scores of the Hifiasm assembly had residual percentages of duplicated sequences, considering the increased genome size compared with GenomeScope estimation, as well as the genome sizes of other Unionidae assemblies (Table 5), we tested several similarity thresholds for duplicates in Hifiasm. The impact of the resulting assemblies on the overall statistics was limited, i.e., -s 0.50-0.35, or had no impact at all, i.e., -l 3 (Table 4). Although two of the assemblies, i.e., -s 0.50 and -s 0.45, show a slight increase in the N50 length (Table 4), given the overall little impact in the final genome size, we opted to use the Hifiasm default assembly as the final assembly. Moreover, purg-dups software did not remove any additional sequences from the Hifiasm default assembly, suggesting that reducing the similarity threshold for duplicate haplotigs (option -s) might be over-purging the assembly.

Table 4 Unio delphinus genome assemblies tests’ general statistics.
Table 5 General statistics of the Unio delphinus final genome assembly (p_ctg); Unio delphinus alternative haplotypes genome assemblies (hap1 and hap2); other published freshwater mussels genome assemblies.

The final genome assembly has a total length of ~2.5 Gbp, which is relatively larger than the GenomeScope size estimation, i.e., ~2.31 Gbp (Table 5, Fig. 3a). Although unexpected, the fact is that from all the primary assemblies here produced (from different software and Hifiasm parameters), none had a total length close to those estimated from GenomeScope (Tables 45). The alternative haplotypes assemblies produced by Hifiasm have a total length similar to the GenomeScope estimations, however, the complete BUSCO scores were reduced for these assemblies with no significate impact on duplicates (Table 5). On the other hand, purge-dups did not report any duplicated sequences in the assembly, which further support that Hifiasm efficiently resolved the haplotype variants. Moreover, the few genome assemblies available for freshwater mussels, show considerable distinct genome sizes (up to 696Mbp difference in size), even within the family Unionidae (Table 5). Consequently, the discrepancies between GenomeScope and the final genome size are likely a consequence of short read-based k-mer frequency spectrum analyses inaccurate estimation of the genome size.

The assembly here presented also shows, the most contiguous freshwater mussel genome assembly available to date, with a contig N50 length of ~ 10 Mbp, which represents a ~5-fold increase in N50 length regarding the only other PacBio-based genome assembly, i.e., from P. streckersoni25 (Table 5). The levels of completeness reported by BUSCOs scores are also within those observed for other freshwater mussel genome assemblies, with nearly no fragmented nor missing hits for both the eukaryotic and metazoan curated lists of near-universal single-copy orthologous (Table 5). The KAT k-mer analyses revealed a low level of k-mer duplication (blue, green, purple, and orange in Fig. 3b), with a high level of haplotype uniqueness (red in Fig. 3b) and a similar k-mer distribution to GenomeScope2 (performed with Illumina PE reads Fig. 3a,b). Both short-read, RNA-seq and long-read back-mapping percentages resulted in an almost complete mapping (Table 5). Finally, BlobTools Read Coverage Plots (ReadCovPlot) shows a dominance of hits with Mollusca (41.68%), followed by two groups with a similar hit percentage, i.e., undefined (27.41%) and Arthropoda (22.81%) (Fig. S1). The high values of undefined hits are expected given the overwhelming low number of closely related species with annotated genomes available on NCBI. Only 16 bivalves’ genomes have annotations available of NCBI, none of which belong to freshwater mussels or Palaeoheterodonta. In fact, annotations are only available for two higher-level bivalve clades, the vast majority for Pteriomorphia (n = 12) and the remaining for Imparidentia (n = 4). Moreover, this low and biased representation of annotated references most likely also explains the apparent contamination with Arthropoda (Fig. S1), as unspecific hits with unrelated taxa have been observed in other recent freshwater mussel genome assemblies24. Nevertheless, to deeply scrutinize for possible contaminations, the percentage of phyla representation was also quantified from the U. delphinus predicted proteins, using the RefSeq BLASTP search outputs (Fig. S2, Supplementary File 2). The results show the dominance of hits with Mollusca, with other taxa having residual representation and low percentages of identity, thus unlikely to represent contaminations (Fig. S2, Supplementary File 2).

Overall, these general statistics validate the high completeness, low redundancy, and quality of the final genome assembly.

Repeat masking, gene models prediction, and annotation

RepeatModeler/RepeatMasker masked 52.83% of the genome, a value within those observed for other Unionida genome assemblies and close to the GenomeScope estimation (Table 6, Fig. 3a). Unlike the results observed in previous freshwater mussel’s genome assemblies24,25, most repeats are classified as DNA elements (21.92%, ~ 549 Mgp), rather than unclassified (16.32%, ~ 408 Mgp), with the remaining categories having similar percentages (Table 6). These results might be a consequence of PacBio HiFi reads efficiency in resolving repetitive regions thus facilitating their classification. BRAKER2 gene prediction identified 44,382 CDS, which is close to the predictions of the other freshwater mussel assemblies (Table 5). BUSCO scores for protein predictions showed almost no missing hits for either of the near-universal single-copy orthologous databases used (Table 5). The number of functionally annotated genes was 32,089, which is similar to the number of annotated genes for the Margaritifera margaritifera genome assembly (Table 5)24. Overall, the numbers of both predicted and annotated genes are within the expected range for bivalves (reviewed in71), as well as within the records of other freshwater mussel assemblies (Table 5)24,25,26,27.

Table 6 RepeatMasker report of the content of repetitive elements in the Unio delphinus genome assembly.