De novo transcriptomes of 14 gammarid individuals for proteogenomic analysis of seven taxonomic groups

Gammarids are amphipods found worldwide distributed in fresh and marine waters. They play an important role in aquatic ecosystems and are well established sentinel species in ecotoxicology. In this study, we sequenced the transcriptomes of a male individual and a female individual for seven different taxonomic groups belonging to the two genera Gammarus and Echinogammarus: Gammarus fossarum A, G. fossarum B, G. fossarum C, Gammarus wautieri, Gammarus pulex, Echinogammarus berilloni, and Echinogammarus marinus. These taxa were chosen to explore the molecular diversity of transcribed genes of genotyped individuals from these groups. Transcriptomes were de novo assembled and annotated. High-quality assembly was confirmed by BUSCO comparison against the Arthropod dataset. The 14 RNA-Seq-derived protein sequence databases proposed here will be a significant resource for proteogenomics studies of these ecotoxicologically relevant non-model organisms. These transcriptomes represent reliable reference sequences for whole-transcriptome and proteome studies on other gammarids, for primer design to clone specific genes or monitor their specific expression, and for analyses of molecular differences between gammarid species.


Background & Summary
Gammarid amphipods are animals that typically measure a few millimetres long and present in a wide range of aquatic habitats 1 . In freshwater ecosystems, they are often the most dominant macro-invertebrates, representing a significant proportion of the total biomass, and they also play a central role within food webs. Indeed, they are a prey for many species, but are also predators for many invertebrate species. Amphipods are also scavengers and shredders, and detritivores involved in leaf litter breakdown, playing a central role in the decomposition of organic matter in general. Thus, they modulate the composition of freshwater communities of invertebrates 2 . Thanks to these essential roles, they have been the subject of many recent studies investigating their sensitivity to pollutants [3][4][5][6][7] .
Marine and freshwater resources are of the utmost importance for Life. Human-made chemical contaminants released into aquatic environments compromise the quality of water bodies, threatening the resident biodiversity, and the utility of such ecosystems. The quality of these environments should be evaluated not only by measuring the concentrations of pollutants present, but also by monitoring how Life is affected by the bioavailable pollutants present and their synergistic/antagonist effects 8 . To do this, biomonitoring with caged representative sentinel species has proved to be a valuable tool for efficient ecotoxicological studies [9][10][11][12][13] . Specific traits such as moult delay, growth impairment, or reproductive defects can be monitored on sensitive animals exposed to toxic environments. These data can be then integrated into a quantitative water quality index that can be used by 1  stakeholders in charge of the aquatic ecosystem and water resource management 14 . Because of their abundance and central ecological roles, invertebrates are commonly employed as test organisms in marine and ecotoxicological assessments. Specifically, gammarids have been successfully used as sentinel species for freshwater ecosystems following investigations of their physiological responses to toxicants [15][16][17][18][19][20][21][22][23] and biomonitoring in caging systems 9,12 . Specific biomarkers have been proposed and can be monitored by innovative methods such as tandem mass spectrometry 19,[24][25][26] . Next-generation proteomics contributed to improving our knowledge of the molecular responses of gammarids to toxicants, and led to the proposal of a broad panel of appropriate biomarkers [27][28][29][30] . This approach was successful after developing a protein sequence database from an RNA-Seq transcriptome translated in all the possible reading frames. This proteogenomics concept was used to establish an extensive catalogue of protein sequences comprising 1873 mass-spectrometry-certified proteins, thus representing a significant amphipod proteomic resource 29 .
Despite this progress, molecular resources relating to gammarids remain scarce 31 . No gammarid whole genome sequence was available until very recently, when a first-draft genome of Gammarus lacustris was released comprising 443,304 scaffolds 32 . The genomes of two related amphipods, Parhyale hawaiensis 33 and Hyalella azteca 34 , have also been sequenced. RNA-Seq datasets are now available for P. hawaiensis [35][36][37] , Echinogammarus marinus 38 , Eogammarus possjeticus 39 , Gammarus fossarum 29 , Gammarus chevreuxi 40 , Gammarus pulex 41 , and Gammarus minus 42 . However, these datasets are not of equal quality in terms of mRNA sequence coverage, which is a crucial parameter for proteogenomics interpretation 43 . They are assembled from mRNAs extracted from a pool of several animals or from specific tissues, and in some cases are no longer accessible as it is the case for E. marinus because the repository used no longer exists 44 .
The data presented in this article consist of assembled transcriptome sequences for 14 different gammarids, seven males and seven females, namely Gammarus fossarum A (Müller type A), G. fossarum B (Müller type B), G. fossarum C (Müller type C), Gammarus wautieri, Gammarus pulex, Echinogammarus berilloni, and Echinogammarus marinus. These transcriptomes were assembled and translated using the same pipelines (full length whole-organism mRNAs), and thus are of equivalent sequencing depth and quality across the different taxa studied. Starting material was extracted from single animals to avoid sequence heterogeneity. The transcriptomes have been annotated to serve as reference protein sequence databases for proteogenomics studies involving these sentinel animals that will be soon conducted to gain more basic knowledge and thus improve how aquatic environmental risks are assessed. For these future studies, an interesting strategy could be to interpret MS/MS shotgun data first on the most appropriate specific single-organism database, and then perform a follow-up search on a multi-organism database. The transcriptomes presented here will also serve in comparative analyses to better define the molecular diversity amongst gammarids and will be a valuable sequence resource for future ecotoxicological studies.

Methods
Experimental design. Freshwater gammarids were collected in four geographically-distant French rivers (Table 1). One population of Gammarus fossarum was sampled in north-eastern France (Seebach river), which was previously shown to harbour the cryptic type A subspecies according to the three types defined in Müller et al. 45 , Westram et al. 46 , and Weiss et al. 47 . The second river (Pollon River) situated in the mid-eastern area of France, corresponding to a sympatric situation, supplied organisms belonging to Gammarus fossarum type B, type C, and Gammarus pulex species. Gammarus wautieri were collected in the Galaveyson river in the Dauphiné region, and Echinogammarus berilloni organisms from a fourth river in south-western France (Saucats river). These freshwater gammarids were all collected using a hand net following kick-sampling, and subsequently transported to the laboratory. After maintaining them for 1 week in the laboratory -at 12 °C with a constant aeration, www.nature.com/scientificdata www.nature.com/scientificdata/ under a 16/8-h light/dark photoperiod in buckets containing water sampled from their respective rivers of origin, and with conditioned alder leaves as food source -couples in amplexus were isolated for species determination before RNA extraction. Pairs where the females had well-developed ovaries were selected. Embryos were removed from the marsupial pouch of females for five of these couples. Based on the description of the reproductive cycle in Gammarus fossarum 48 , for RNA extraction, we were able to select one couple per species in the last stage of the reproductive cycle (pre-moulting stage for the female) by retaining pairs where the females were carrying embryos at the end of their embryonic development stage (stage 4 or 5). For the marine species, E. marinus were collected from beneath seaweed in the intertidal zone in Portsmouth, southern England. These species correspond to the same population as used in a previous study 38 . After maintaining them for 1 month in the laboratory -at 10 °C under a 12 h light/12 h dark photoperiod in buckets with filtered natural seawater and fed with fucoid seaweed -organisms were transported live in damp seaweed from United Kingdom to France (one-day travel). They were subsequently maintained for a few hours in aquaria containing reconstituted seawater (salinity 30‰) before organism selection. For this species, it was not possible to recover couples in amplexus. One free-swimming male and one free-swimming female were isolated from the batch of organisms available. Stage 1 embryos were recovered from the female marsupium, indicating that this female was in a post-moulting stage.
Species were first determined based on morphological criteria 49 . To distinguish between the three cryptic lineages, A, B, C, within the G. fossarum species, a molecular species assignment was carried out by amplifying the 5' part of the mtDNA cytochrome c oxidase subunit I (COI) using universal primers (LCO1490 [GGT CAA ATC ATA AAG ATA TTG G] and HCO2198 [TAA ACT TCA GGG TGA CCA AAA AAT CA]) 50 . Briefly, DNA was extracted from one or two pereopods (depending on individual size) cut from organisms before conditioning for RNA extraction. DNA was extracted using the Nucleospin tissue XS kit (Macherey-Nagel), and 10 ng of DNA for each organism was amplified. The PCR conditions consisted in 45 cycles of denaturation at 95 °C for 30 sec, annealing at 50 °C for 30 sec, and elongation at 72 °C for 1 min. PCR products were purified by ultrafiltration using the Nucleofast kit (Macherey-Nagel). Purified amplicons were prepared for sequencing using the BigDye Terminator v3.1 kit (ThermoFisher), and then sequenced on a DNA analyser ABI 3730XL (ThermoFisher). Sequencing data were analysed using the Sequencher 5.4.6 program (Genecodes). COI sequences (freely available from figshare, YC02_COI sequences and phylogenetic tree 51 ) were aligned to build a phylogenetic tree including reference sequences from Weiss et al. 47 and Lagrue et al. 52 . Using this phylogenetic tree (freely available from figshare, YC02_COI sequences and phylogenetic tree 51 ) it is possible to position the COI sequences of the Gammarus organisms selected for RNA sequencing in relation to the published reference sequences (SeaView software 53 ; BioNJ method based on J-C distance). The robustness of the different groupings was evaluated by a bootstrapping procedure (100 iterations). COI sequences were obtained for all Gammarus individuals, except for the female G. fossarum C as this individual was in precopulatory amplexus with the male COI-genotyped as G. fossarum C. However, in the same location (Pollon River), we also obtained the COI genotypes for 15 additional pairs, all of which were found to be non-heterospecific (4 G. fossarum B, 3 G. fossarum C, 8 G. pulex). Westraam et al. 46 reported similar findings in the Glovelier river which harbours G. fossarum A and B, with only one heterospecific pair for a total of 64 genotyped pairs. Lagrue et al. 52 also observed that mixed pairs are rare in the field for Gammarus lineages with a COI distance greater than 4%. Considering that the divergence between the COI-genotyped G. fossarum B and C specimens is about 17% in the Pollon River, it is very unlikely that this female does not belong to the G. fossarum C species.

Dataset generation. Gammarids were placed in RNAlater (Sigma) and stored at 4 °C overnight. The
RNAlater was then removed, and the organisms were snap frozen in liquid nitrogen and stored at −80 °C until RNA was extracted. Organisms were first homogenized in lysis buffer using a bead homogenizer and then RNAs were extracted using the Qiagen fibrous tissue kit (Qiagen). RNA quantity, quality and integrity were assessed by Nanodrop (Thermo Fisher) and Bioanalyzer (Agilent) analysis. RNA-Seq libraries were generated using the TruSeq stranded mRNA Sample Prep kit (Illumina). mRNA was purified using poly-(T) beads from 2 µg of each total RNA sample, then cleaved in segments of 155 bp on average (120-210 bp range). Subsequently, cleaved RNA fragments were primed with random hexamers and reverse-transcribed into first-strand cDNA. A second strand of cDNA was consecutively synthetized, and double-stranded cDNA was purified on beads. The 3′ ends of the blunt fragments obtained were then adenylated. Indexed adapters were ligated to the PCR-enriched cDNA fragments (11 cycles). Libraries were purified and quality-assessed using a Fragment Analyzer (Advanced Analytical Technologies). The 16 libraries were quantified by qPCR using the Kapa Library Quantification Kit (Roche). Their concentrations were normalized, multiplexed in a single pool. Libraries were then sequenced on two lanes of Hiseq3000 (Illumina) using a paired-end read length of 2 × 150 bp with the HiSeq Reagent Kits (Illumina). The two HiSeq lanes produced an average of 40.0 ± 8 million read pairs per library. Quality control of reads was performed by FastQC version V0.11.2 (Babraham Bioinformatics). Detailed results are freely available from figshare (YC02_QC data 51 ). The data records are stored in 14 folders, each containing four folders per transcriptome.
De novo assembly. For each sample, the forward or reverse reads were merged from two separate lanes.
Data were filtered based on the mean Qphred score, with a threshold set at 16.99, and any remaining unpaired reads were removed using a homemade script. The numbers of reads for each sample before and after this filtering step are presented in Table 1. Trinity v2.4 54 was used to assemble reads for each sample considering pair-end and strand orientation (-SS_lib type RF); all other Trinity parameters were set to their default values, with k set to 25, and minimum contig length to 200 bp.
Assessing assembly quality. Transcriptome quality was assessed using Transrate v1.0.1 55 , which generates standard metrics and remapping statistics. No reference protein sequences were used for the assessment with Transrate. The main metrics are shown in Table 2. To validate the quality of all the assemblies, BUSCO v2.0 56 was www.nature.com/scientificdata www.nature.com/scientificdata/ used. The database used for BUSCO analyses was Arthropoda_odb9 which contains 1066 orthologous genes at the nearest taxon level (i.e., Arthropods) available for Gammarus.
Annotation. For each sample, the transcripts were annotated using the Trinotate v3.1.1 annotation pipeline 54 .
The Swissprot database was used as the main database, and amphipod proteins referenced on Uniref were used as a custom database. Similarity searches were performed with Blastx and Blastp, with an e-value cutoff set at 1e-2. Results from these searches were then used to generate the annotation report with the same e-value cutoff.

Data Records
Reads. Read sequences for each sample were deposited in the NCBI Sequence Reads Archive under accession Numbers SRR8089720 57 , SRR8089722-SRR8089725 58-61 , and SRR8089727-SRR8089735 62-70 , as indicated in Table 3 alongside the corresponding Bioproject and Biosample codes. The FastQC results for the 14 samples are freely available from figshare (YC02 _QC data) 51 . The data records are stored as 14 folders, each of which contain four folders per transcriptome.

Technical Validation
Transrate. Transrate analyses showed good remapping of results, with more than 80% of reads remapped and most assemblies with more than 70% were classed as well mapped. Raw results from Transrate are freely available through figshare (YC02_ Transrate results) 51 .

BUSCO.
A high level of single-copy ortholog retrieval was noted for the 14 assemblies, with at least a 75% ratio, as shown in Fig. 1. Furthermore, fewer than 8% of orthologs were missing in the worst case, and fewer than 5% were missing in 11 transcriptomes.

Code Availability
Filtering before assembly was performed with an in-house Pythonv2.7 script, which is freely available (https:// github.com/YannickCogne/Qfiltering). The script was automated with a bash script for each sample.