Background & Summary

Gammarid amphipods are animals that typically measure a few millimetres long and present in a wide range of aquatic habitats1. In freshwater ecosystems, they are often the most dominant macro-invertebrates, representing a significant proportion of the total biomass, and they also play a central role within food webs. Indeed, they are a prey for many species, but are also predators for many invertebrate species. Amphipods are also scavengers and shredders, and detritivores involved in leaf litter breakdown, playing a central role in the decomposition of organic matter in general. Thus, they modulate the composition of freshwater communities of invertebrates2. Thanks to these essential roles, they have been the subject of many recent studies investigating their sensitivity to pollutants3,4,5,6,7.

Marine and freshwater resources are of the utmost importance for Life. Human-made chemical contaminants released into aquatic environments compromise the quality of water bodies, threatening the resident biodiversity, and the utility of such ecosystems. The quality of these environments should be evaluated not only by measuring the concentrations of pollutants present, but also by monitoring how Life is affected by the bioavailable pollutants present and their synergistic/antagonist effects8. To do this, biomonitoring with caged representative sentinel species has proved to be a valuable tool for efficient ecotoxicological studies9,10,11,12,13. Specific traits such as moult delay, growth impairment, or reproductive defects can be monitored on sensitive animals exposed to toxic environments. These data can be then integrated into a quantitative water quality index that can be used by stakeholders in charge of the aquatic ecosystem and water resource management14. Because of their abundance and central ecological roles, invertebrates are commonly employed as test organisms in marine and ecotoxicological assessments. Specifically, gammarids have been successfully used as sentinel species for freshwater ecosystems following investigations of their physiological responses to toxicants15,16,17,18,19,20,21,22,23 and biomonitoring in caging systems9,12. Specific biomarkers have been proposed and can be monitored by innovative methods such as tandem mass spectrometry19,24,25,26. Next-generation proteomics contributed to improving our knowledge of the molecular responses of gammarids to toxicants, and led to the proposal of a broad panel of appropriate biomarkers27,28,29,30. This approach was successful after developing a protein sequence database from an RNA-Seq transcriptome translated in all the possible reading frames. This proteogenomics concept was used to establish an extensive catalogue of protein sequences comprising 1873 mass-spectrometry-certified proteins, thus representing a significant amphipod proteomic resource29.

Despite this progress, molecular resources relating to gammarids remain scarce31. No gammarid whole genome sequence was available until very recently, when a first-draft genome of Gammarus lacustris was released comprising 443,304 scaffolds32. The genomes of two related amphipods, Parhyale hawaiensis33 and Hyalella azteca34, have also been sequenced. RNA-Seq datasets are now available for P. hawaiensis35,36,37, Echinogammarus marinus38, Eogammarus possjeticus39, Gammarus fossarum29, Gammarus chevreuxi40, Gammarus pulex41, and Gammarus minus42. However, these datasets are not of equal quality in terms of mRNA sequence coverage, which is a crucial parameter for proteogenomics interpretation43. They are assembled from mRNAs extracted from a pool of several animals or from specific tissues, and in some cases are no longer accessible as it is the case for E. marinus because the repository used no longer exists44.

The data presented in this article consist of assembled transcriptome sequences for 14 different gammarids, seven males and seven females, namely Gammarus fossarum A (Müller type A), G. fossarum B (Müller type B), G. fossarum C (Müller type C), Gammarus wautieri, Gammarus pulex, Echinogammarus berilloni, and Echinogammarus marinus. These transcriptomes were assembled and translated using the same pipelines (full length whole-organism mRNAs), and thus are of equivalent sequencing depth and quality across the different taxa studied. Starting material was extracted from single animals to avoid sequence heterogeneity. The transcriptomes have been annotated to serve as reference protein sequence databases for proteogenomics studies involving these sentinel animals that will be soon conducted to gain more basic knowledge and thus improve how aquatic environmental risks are assessed. For these future studies, an interesting strategy could be to interpret MS/MS shotgun data first on the most appropriate specific single-organism database, and then perform a follow-up search on a multi-organism database. The transcriptomes presented here will also serve in comparative analyses to better define the molecular diversity amongst gammarids and will be a valuable sequence resource for future ecotoxicological studies.

Methods

Experimental design

Freshwater gammarids were collected in four geographically-distant French rivers (Table 1). One population of Gammarus fossarum was sampled in north-eastern France (Seebach river), which was previously shown to harbour the cryptic type A subspecies according to the three types defined in Müller et al.45, Westram et al.46, and Weiss et al.47. The second river (Pollon River) situated in the mid-eastern area of France, corresponding to a sympatric situation, supplied organisms belonging to Gammarus fossarum type B, type C, and Gammarus pulex species. Gammarus wautieri were collected in the Galaveyson river in the Dauphiné region, and Echinogammarus berilloni organisms from a fourth river in south-western France (Saucats river). These freshwater gammarids were all collected using a hand net following kick-sampling, and subsequently transported to the laboratory. After maintaining them for 1 week in the laboratory – at 12 °C with a constant aeration, under a 16/8-h light/dark photoperiod in buckets containing water sampled from their respective rivers of origin, and with conditioned alder leaves as food source – couples in amplexus were isolated for species determination before RNA extraction. Pairs where the females had well-developed ovaries were selected. Embryos were removed from the marsupial pouch of females for five of these couples. Based on the description of the reproductive cycle in Gammarus fossarum48, for RNA extraction, we were able to select one couple per species in the last stage of the reproductive cycle (pre-moulting stage for the female) by retaining pairs where the females were carrying embryos at the end of their embryonic development stage (stage 4 or 5). For the marine species, E. marinus were collected from beneath seaweed in the intertidal zone in Portsmouth, southern England. These species correspond to the same population as used in a previous study38. After maintaining them for 1 month in the laboratory – at 10 °C under a 12 h light/12 h dark photoperiod in buckets with filtered natural seawater and fed with fucoid seaweed - organisms were transported live in damp seaweed from United Kingdom to France (one-day travel). They were subsequently maintained for a few hours in aquaria containing reconstituted seawater (salinity 30‰) before organism selection. For this species, it was not possible to recover couples in amplexus. One free-swimming male and one free-swimming female were isolated from the batch of organisms available. Stage 1 embryos were recovered from the female marsupium, indicating that this female was in a post-moulting stage.

Table 1 Sampling information and number of reads for each sample before and after filtering by mean quality for the 14 transcriptomes.

Species were first determined based on morphological criteria49. To distinguish between the three cryptic lineages, A, B, C, within the G. fossarum species, a molecular species assignment was carried out by amplifying the 5’ part of the mtDNA cytochrome c oxidase subunit I (COI) using universal primers (LCO1490 [GGT CAA ATC ATA AAG ATA TTG G] and HCO2198 [TAA ACT TCA GGG TGA CCA AAA AAT CA])50. Briefly, DNA was extracted from one or two pereopods (depending on individual size) cut from organisms before conditioning for RNA extraction. DNA was extracted using the Nucleospin tissue XS kit (Macherey-Nagel), and 10 ng of DNA for each organism was amplified. The PCR conditions consisted in 45 cycles of denaturation at 95 °C for 30 sec, annealing at 50 °C for 30 sec, and elongation at 72 °C for 1 min. PCR products were purified by ultrafiltration using the Nucleofast kit (Macherey-Nagel). Purified amplicons were prepared for sequencing using the BigDye Terminator v3.1 kit (ThermoFisher), and then sequenced on a DNA analyser ABI 3730XL (ThermoFisher). Sequencing data were analysed using the Sequencher 5.4.6 program (Genecodes). COI sequences (freely available from figshare, YC02_COI sequences and phylogenetic tree51) were aligned to build a phylogenetic tree including reference sequences from Weiss et al.47 and Lagrue et al.52. Using this phylogenetic tree (freely available from figshare, YC02_COI sequences and phylogenetic tree51) it is possible to position the COI sequences of the Gammarus organisms selected for RNA sequencing in relation to the published reference sequences (SeaView software53; BioNJ method based on J-C distance). The robustness of the different groupings was evaluated by a bootstrapping procedure (100 iterations). COI sequences were obtained for all Gammarus individuals, except for the female G. fossarum C as this individual was in precopulatory amplexus with the male COI-genotyped as G. fossarum C. However, in the same location (Pollon River), we also obtained the COI genotypes for 15 additional pairs, all of which were found to be non-heterospecific (4 G. fossarum B, 3 G. fossarum C, 8 G. pulex). Westraam et al.46 reported similar findings in the Glovelier river which harbours G. fossarum A and B, with only one heterospecific pair for a total of 64 genotyped pairs. Lagrue et al.52 also observed that mixed pairs are rare in the field for Gammarus lineages with a COI distance greater than 4%. Considering that the divergence between the COI-genotyped G. fossarum B and C specimens is about 17% in the Pollon River, it is very unlikely that this female does not belong to the G. fossarum C species.

Dataset generation

Gammarids were placed in RNAlater (Sigma) and stored at 4 °C overnight. The RNAlater was then removed, and the organisms were snap frozen in liquid nitrogen and stored at −80 °C until RNA was extracted. Organisms were first homogenized in lysis buffer using a bead homogenizer and then RNAs were extracted using the Qiagen fibrous tissue kit (Qiagen). RNA quantity, quality and integrity were assessed by Nanodrop (Thermo Fisher) and Bioanalyzer (Agilent) analysis. RNA-Seq libraries were generated using the TruSeq stranded mRNA Sample Prep kit (Illumina). mRNA was purified using poly-(T) beads from 2 µg of each total RNA sample, then cleaved in segments of 155 bp on average (120–210 bp range). Subsequently, cleaved RNA fragments were primed with random hexamers and reverse-transcribed into first-strand cDNA. A second strand of cDNA was consecutively synthetized, and double-stranded cDNA was purified on beads. The 3′ ends of the blunt fragments obtained were then adenylated. Indexed adapters were ligated to the PCR-enriched cDNA fragments (11 cycles). Libraries were purified and quality-assessed using a Fragment Analyzer (Advanced Analytical Technologies). The 16 libraries were quantified by qPCR using the Kapa Library Quantification Kit (Roche). Their concentrations were normalized, multiplexed in a single pool. Libraries were then sequenced on two lanes of Hiseq3000 (Illumina) using a paired-end read length of 2 × 150 bp with the HiSeq Reagent Kits (Illumina). The two HiSeq lanes produced an average of 40.0 ± 8 million read pairs per library. Quality control of reads was performed by FastQC version V0.11.2 (Babraham Bioinformatics). Detailed results are freely available from figshare (YC02_QC data51). The data records are stored in 14 folders, each containing four folders per transcriptome.

De novo assembly

For each sample, the forward or reverse reads were merged from two separate lanes. Data were filtered based on the mean Qphred score, with a threshold set at 16.99, and any remaining unpaired reads were removed using a homemade script. The numbers of reads for each sample before and after this filtering step are presented in Table 1. Trinity v2.454 was used to assemble reads for each sample considering pair-end and strand orientation (-SS_lib type RF); all other Trinity parameters were set to their default values, with k set to 25, and minimum contig length to 200 bp.

Assessing assembly quality

Transcriptome quality was assessed using Transrate v1.0.155, which generates standard metrics and remapping statistics. No reference protein sequences were used for the assessment with Transrate. The main metrics are shown in Table 2. To validate the quality of all the assemblies, BUSCO v2.056 was used. The database used for BUSCO analyses was Arthropoda_odb9 which contains 1066 orthologous genes at the nearest taxon level (i.e., Arthropods) available for Gammarus.

Table 2 Assembly quality metrics.

Annotation

For each sample, the transcripts were annotated using the Trinotate v3.1.1 annotation pipeline54. The Swissprot database was used as the main database, and amphipod proteins referenced on Uniref were used as a custom database. Similarity searches were performed with Blastx and Blastp, with an e-value cutoff set at 1e-2. Results from these searches were then used to generate the annotation report with the same e-value cutoff.

Data Records

Reads

Read sequences for each sample were deposited in the NCBI Sequence Reads Archive under accession Numbers SRR808972057, SRR8089722–SRR808972558,59,60,61, and SRR8089727–SRR808973562,63,64,65,66,67,68,69,70, as indicated in Table 3 alongside the corresponding Bioproject and Biosample codes. The FastQC results for the 14 samples are freely available from figshare (YC02 _QC data)51. The data records are stored as 14 folders, each of which contain four folders per transcriptome.

Table 3 Accessions for the 14 transcriptomes.

Transcriptomes

Transcriptome assemblies were deposited in the NCBI Transcriptome Shotgun Assembly Sequence Database. These data have been deposited in GenBank under identifiers GHCN0100000071, GHCP01000000-GHCR0100000072,73,74, GHCT01000000-GHCZ0100000075,76,77,78,79,80,81, GHDA01000000-GHDC0100000082,83,84, as indicated in Table 3 alongside the corresponding Bioproject and Biosample codes.

Proteogenomics databases

Translations of coding sequence regions were produced for each transcriptome from stop to stop codons by Transdecoder v3.0.154, analysing only the top strand. The 500 longest ORFs were used for training, retaining 600-bp ORFs and only proteins with a minimum length of 50 amino acids. The 14 translations are freely available for download as FASTA files from figshare (YC02_Transcriptome translated ORFs51).

Annotation

Annotations of each assembly are freely available for download as Excel files from figshare (YC02_Transcript annotations51). The folder contains 14 Excel files.

Technical Validation

Transrate

Transrate analyses showed good remapping of results, with more than 80% of reads remapped and most assemblies with more than 70% were classed as well mapped. Raw results from Transrate are freely available through figshare (YC02_ Transrate results)51.

BUSCO

A high level of single-copy ortholog retrieval was noted for the 14 assemblies, with at least a 75% ratio, as shown in Fig. 1. Furthermore, fewer than 8% of orthologs were missing in the worst case, and fewer than 5% were missing in 11 transcriptomes.

Fig. 1
figure 1

BUSCO assessment results for the 14 assembled transcriptomes.