Daily transcriptomes of the copepod Calanus finmarchicus during the summer solstice at high Arctic latitudes

The zooplankter Calanus finmarchicus is a member of the so-called “Calanus Complex”, a group of copepods that constitutes a key element of the Arctic polar marine ecosystem, providing a crucial link between primary production and higher trophic levels. Climate change induces the shift of C. finmarchicus to higher latitudes with currently unknown impacts on its endogenous timing. Here we generated a daily transcriptome of C. finmarchicus at two high Arctic stations, during the more extreme time of Midnight Sun, the summer solstice. While the southern station (74.5 °N) was sea ice-free, the northern one (82.5 °N) was sea ice-covered. The mRNAs of the 42 samples have been sequenced with an average of 126 ± 5 million reads (mean ± SE) per sample, and aligned to the reference transcriptome. We detail the quality assessment of the datasets and the complete annotation procedure, providing the possibility to investigate daily gene expression of this ecologically important species at high Arctic latitudes, and to compare gene expression according to latitude and sea ice-coverage.

Since the identification of circadian clock genes in C. finmarchicus 12 , studies have shown that this species possesses a functional clock that might be involved in the timing of both diel and seasonal events, such as the ecologically and biogeochemically important diel vertical migration (DVM) 13 or diapause 14 . However, the Arctic environment is characterized by dramatic seasonality resulting in permanent illumination during Midnight Sun and permanent darkness during Polar Night 15 . As the circadian clock is entrained and synchronized by daily light/dark cycles, the persistence of daily biological processes in Arctic organisms during the absence of those remains uncertain 16,17 , as well as the consequences for newcomer species due to global warming 8,9 . Moreover, the Arctic is characterized by strong fluctuations in sea ice-cover, reflecting on biotic and abiotic factors, such as species communities and interactions or light penetration 18,19 .
Copepods are among the important non-model invertebrates for which genomic resources are still limited, one barrier being that many species, including C. finmarchicus, have large genomes 20,21 . The de novo transcriptome of C. finmarchicus 22 represents a useful resource for assessing the impact of global warming in this species of high ecological interest. In addition to differential gene expression analyzes, RNA sequencing has increased the ability to study the expression of rhythmically expressed mRNAs [23][24][25] . Indeed, at the molecular level, the endogenous clock machinery drives the rhythmic expression of downstream genes whose rhythmic translation and function ultimately underlie daily oscillations at cellular and organismal levels 25 . Note that in the field, environmental cycles also directly generate rhythms independently from the clock. Thus, temporal transcriptomic studies allow a major breakthrough in the understanding of daily dynamics of biological processes in the field.
In this study, we performed RNA sequencing on temporally collected in situ samples to generate a daily transcriptome of C. finmarchicus in the high Arctic during summer solstice period when the sun remains high above the horizon with minimal altitude variation. Sampling of C. finmarchicus stage V copepodites was performed at 4 h intervals within a 24 h cycle at two ocean stations along a latitudinal gradient. The northern station (82.5 °N, Nansen Basin) was characterized by sea ice-coverage, while the southern one (74.5 °N, Barents Sea) was sea ice-free. In addition to providing the raw data, we describe its quality assessment and the alignment to the reference transcriptome to verify reliability and determine transcript quantification. Finally a complete annotation is performed and two normalized datasets are provided for further transcriptomic data exploration of this species.

Methods
Sampling design. The sampling strategy was specifically designed for the detection of rhythmic transcripts 25,26 although it does not exclude classic differential expression analysis 27 . Sampling design and analysis strategy are presented in Fig. 1 2)). Sampling at "North" station, JR85, started on 18th June (3 days before the summer solstice) at 10-11 h and ended on 19th June at 10-11 h. Sampling at "South" station, B13, started on 30th June (9 days after the summer solstice) at 14-15 h and ended on 1st July at 14-15 h. At each timepoint the water column was sampled from 200 m to the surface with vertical hauls of a WP2 plankton net (opening ∅: 57 cm, net length: 236 cm, mesh size: 200 µm) with a meshed bucket cod end (mesh size: 200 µm) at a speed of 0.5 m*s −1 . Transferring the animals from the net into the stabilization solution was done within less than 12 minutes for all samplings. A 12 h period of incubation at 2-4 °C was allowed to soak the samples thoroughly with the RNAlater stabilization solution (Ambion, UK) before they were transferred to −80 °C for further transport and storage.
Sites description. Sampling has been conducted during Cruise JR17006 of the RRS James Clark Ross in summer 2018 at two stations along a latitudinal gradient. The station "North" was sea ice-covered and located in the Nansen Basin (JR85; 82.56°N, 30.85°E). The station "South" was sea ice-free and located in the southern Barents Sea (B13; 74.5°N, 30°E). Water depth at "North" was 3700 m and at "South" was 360 m. The sun's altitude was always above the horizon but still showed diel oscillations of altitude above the horizon from 16 ° at midnight to 30.9 ° at midday at "North", and from 7.7 ° at midnight to 38.6 ° at midday at "South" at the times of sampling (all times noted in local time (UTC + 2)). Sites were exposed to semidiurnal tide regimes, i.e., 2 tides per day, with a maximum amplitude of ± 0.47 m at JR85 and ± 0.36 m at B13 at times of sampling. Maps with the location of the sea ice edge at the time of sampling at "North" are available from the meereisportal 28 (https://data.meereisportal. de/gallery/index_new.php?active-tab1=method&ice-type=satellite&satellite=A&region=n&resolution=dai-ly&minYear=2018&minMonth=6&minDay=18&maxYear=2018&maxMonth=6&maxDay=19&show-Maps=y&dateRepeat=n&submit2=display&lang=en_US&active-tab2=satellite). Modeled data of sun altitude were obtained from the United States Naval Observatory (https://aa.usno.navy.mil/data/docs/AltAz.php, USNO, USA). Information on the tidal dynamics have been drawn from the TPX08 model 29 by using the OTPS package (Tidal Prediction Software, http://www-po.coas.oregonstate.edu/~poa/www-po/research/po/research/tide/ index.html), via the mbotps program 30 (MB-System). Solar altitude, tidal height and sea-ice cover during the sampling campaign at both latitudes are detailed in Supplementary Table 2. Temperature, pressure (depth), conductivity (salinity), oxygen saturation (SBE 43, Sea-Bird Electronics) and Chlorophyll a (Chl a) fluorescence (Aquatracka III fluorometer, Chelsea Technologies Group, UK) were measured from the surface to 200 m depth and are available in Hueppe et al. 31 .
Copepod sorting. Copepods were sorted at 2 °C under a stereo microscope for species (C. finmarchicus) and stage (CV). To distinguish C. finmarchicus from its closely related congener C. glacialis, morphological indicators were used, in particular the redness of the antenna, which has been shown to be a good indicator in the regions of sampling 32 ; see also the molecular validation of morphological identification, below. For each timepoint and station, 3 replicates of 15 C. finmachicus CV were sorted. The choice to pool 15 individuals was made to (1) get the sufficient amount of RNA required for RNA sequencing and quantitative real-time PCR analyses and (2) www.nature.com/scientificdata www.nature.com/scientificdata/ increase the number of individuals (315 copepods per station in total) thereby decreasing the effect of individual variability.
RNA extraction. Each replicate was distributed to a 2 ml Precellys ® homogenization tube (Bertin Instruments, France), containing a mix of 1.4 mm and 2.8 mm ceramic beads and homogenized in 600 µl of TRIzol ® reagent (ThermoFisher Scientific, USA) with a Precellys ® 24 Tissue Homogenizer (Bertin Instruments, France), using two times 15 sec. of homogenization at 5000 rpm with a 10 sec. break between. For RNA extraction, a Phenol/Chloroform based single-step extraction in combination with a spin column based solid phase extraction (Direct-zol ™ RNA MiniPrep Kit, Zymo Research, USA) was used. Genomic DNA was removed by DNase I digestion on column as part of the RNA extraction kit and total RNA was eluted in ultra-pure water. A portion of the RNA of each of the samples was used to investigate relative expression of 8 candidate genes with SYBRGreen Protocol 1 Sampling at 4h intervals in the water column from 200 m to the surface, to cover a 24h cycle, at two high Arctic stations.

Protocol 3
RNA extraction of 42 samples.

Protocol 2
Sorting of 3 replicates of 15 C. finmarchicus stage CV per time and per station. www.nature.com/scientificdata www.nature.com/scientificdata/ based quantitative real-time PCR (qPCR) on candidate genes, using the 2 −∆Ct method 32 and the geometric mean of elongation factor 1α and 16 s rRNA as reference, as described by Hueppe et al. 31 . Another portion of each samples was send to GeT-PlaGe core facility in dried-ice for RNA sequencing. Reads alignment and quantification. 42 RNA sequencing libraries were obtained (Fig. 1, Table 1, and   Supplementary Table 1). The number of paired reads per library was between 74 million and 276 million with an average of 126 ± 5 million (mean ± SE) reads. The RNA sequencing libraries reads quality were evaluated using FastQC 33 . Contamination was checked by aligning reads against E. coli, Yeast and PhiX genomes.
The Calanus finmarchicus de novo transcriptome 22 , based on different life stages and deposited to Bioproject PRJNA236528, was used as the reference transcriptome. It is composed of 206,012 contigs and presents good results of quality assessment, with a nearly complete BUSCO set 22,34 . Reads were aligned to the de novo transcriptome with BWA-MEM (http://bio-bwa.sourceforge.net/bwa.shtml). Quantification was performed with SAMtools 35 idxStats to generate the quantification matrix. The matrix was filtered with edgeR 36 and only contigs with more than 1 CPM (Count Per Million) in at least one sample were kept, providing a matrix of 76,550 contigs. Information on the datasets resulting from this study is available in Table 2.
Annotation. We provided different annotations for all further analysis. Contigs were aligned with DIAMOND 37 on NR (2019-09-29), Swissprot and Trembl (2018-12) to retrieve corresponding best annotations. An annotation matrix was then generated by selecting the best hit for each database if: i) the percent of the query length covered by the alignment was higher than 60%; ii) the percent of the subject length covered by the alignment was higher than 40%; iii) the percent of identity of the alignment was higher than 40%. Contigs were also processed with InterProScan 38 to scan InterProScan signatures. A GO was assigned to each contig with an InterProScan hit containing a GO annotation. Information on the datasets resulting from this study is available in Table 2. Note that a previous annotation of Calanus finmarchicus reference transcriptome 22 against Non-redundant (NR) protein database is also available at https://doi.org/10.5061/dryad.11978.

Normalization. Two normalizations are proposed (down-sampling normalization and RLE normalization)
but the choice of normalization depends on the analysis required downstream. For a rhythmic analysis, we suggest down-sampling the mapped reads to the lowest number among the 42 samples (down-sampling normalization), i.e. to 70.4 million properly mapped reads per sample for all samples (after filtering), in order to adjust for differences in sequencing depth among samples 23,25,39 . This was performed with StreamSampler.jar (https:// github.com/shenkers/sampling). EdgeR 36 was used to perform RLE normalization, since it is more appropriate for differential expression analysis. Information on the datasets resulting from this study is available in Table 2.

Data Records
Raw reads were gathered in the NCBI BioProject PRJNA628886 40 which includes all BioSamples used for the study ( Table 2, Supplementary table 1). We also provide the following in figshare collection 5127704 41 (Table 2): the quantification matrix for the 206,012 contigs; the list of identifiers corresponding to the 76,550 contigs after filtering; the two suggested normalization matrices (down-sampling and RLE) and; the datasets annotations (DIAMOND annotation matrix, InterProScan annotation, GO association).

technical Validation
Molecular validation of morphological identification. Since C. finmarchicus' Arctic congener C. glacialis also occurs in the region of sampling and differences between the species can be very subtle 42 , morphological identification was validated by molecular species identification on a subset of samples from the same stations 21,43 . DNA was extracted from individual copepods using the HotShot method 44 , and the species-specific nuclear insertion/deletion (InDel) marker G-150 was amplified using a modified protocol from Smolina et al. 45 . Identification was done by accessing the size of the resulting amplicon via electrophoresis on a 2% agarose gel. Results have shown that 99% of the individuals identified as C. finmarchicus by the morphological identification method were also clearly identified as C. finmarchicus by the molecular identification method, while 0.1% were not clearly identified and 0.7% were identified as the Arctic congener Calanus glacialis (n = 305 individuals).
Extraction and RNA integrity. RNA extraction procedures were performed with randomization of samples to ensure reliable and unbiased data production. RNA purity was assessed by OD measurements with a NanoDrop 8000 spectrophotometer (ThermoFisher Scientific, USA), and all 260/280 and 260/230 OD ratio was superior to 1.9. RNA integrity was evaluated with a Fragment Analyzer (Advanced Analytical Technologies, Inc., Iowa, USA; RNA Kit (15nt) Standard Sensitivity, Agilent). Due to a non-conventional 28 S/18 S ribosomal ratio in this species, sample quality was evaluated on the electropherogram 46 . No degradation in the inter region was observed. Total RNA samples were stored at −80 °C.
Raw reads assessment and quantification overview. All samples passed the FastQC 33 "base quality control". No relevant contamination hit was found after the alignment against E. coli, Yeast and PhiX. The mapping rate against the reference transcriptome 22 of 206,012 contigs was higher than 72.4% for properly paired reads and higher than 93.6% considering both paired and single mate reads, validating the raw reads quality (Fig. 2,  Supplementary Table 3). Furthermore, over the 42 samples, the maximal percentage of multi-mapped alignment is of 3.31% (Fig. 2, Supplementary Table 3).
For an overview of the quantification matrix, a principal component analysis (PCA) was generated on the raw pseudo-count (log2 (count + 1)) non-normalized matrix (Fig. 3). Results showed a clear separation between samples from "North" and "South" stations, indicating environmental variations that might be due to latitude and/or sea ice-coverage.
Filtering. Of the 206,012 transcripts, 37% (76,550) were expressed above the threshold of 1 CPM. This result corroborates previously observed results on the C. finmarchicus transcriptome 22 . Thus a large proportion of the whole contigs (63%) exhibited an extremely low level of expression, representing only 1.32 ± 0.04% of total aligned reads at "North", and 1.27 ± 0.05% at "South" (Table 3, Supplementary Table 4).

Contigs annotation.
By selecting the best hit for each database, the annotation matrix generated with Diamond 37 has led to 36,274 and 22,527 contigs with an annotation in at least one database out of the 206,012 and 76,550 contigs respectively (Table 4). Moreover, the number of unique hits for each database is always lower than the number of contigs annotated by the respective database, highlighting the contigs' functional redundancies.
The InterProScan annotation provided annotations from many protein signature databases. The main results are presented in Table 5 and Supplementary Table 5. A GO was attributed to 65,924 contigs over the whole transcriptome (206,012 contigs), while 33,057 contigs out of the 76,550 contigs with an expression level higher than 1 CPM in at least one sample had a GO annotation (Table 5, Supplementary Table 5). www.nature.com/scientificdata www.nature.com/scientificdata/ Quantitative real-time pCR data for normalization verification. The relative expression of six core circadian clock genes (clock, cycle, period1, timeless, cryptochrome2, vrille) and 2 circadian clock-related genes (cryptochrome1 and doubletime2) was investigated by quantitative real-time PCR and are available in Supplementary Table 6, allowing the verification of RNA sequencing normalization for further investigations. Regarding the two normalizations, the down-sampling normalization was selected for a rhythmic analysis based on concordant temporal expression profiles with qPCR data (using RAIN algorithm 47 ), while the RLE   www.nature.com/scientificdata www.nature.com/scientificdata/ normalization has been validated for differential expression analysis of the mean level of expression between stations, using the 21 samples of each stations as replicates.

Usage Notes
We present here the first in situ daily transcriptomes from the high Arctic, where molecular investigations of biological rhythms are exceptionally limited 15,16 . The samplings have been realized during drastic Polar photic conditions, i.e. the summer solstice, when daily oscillations of the Sun are minimal, high in the sky and always above the horizon 15 . The proposed datasets are thus novel and of interest due to the unique geographical location and time of year, the ecological importance of C. finmarchicus, and the rigorous temporal sampling strategy. Another strength of this dataset is the high depth of the RNA sequencing, with an average of 126 ± 5 million of reads (mean ± SE) per sample, which optimizes the detection of rhythmic transcripts 25 in a species with a large genome 20,21 . Finally, the elaborate annotation of the large transcriptome is now publicly available and is thus accessible for further research.
The sampling strategy is optimized for rhythmic analysis, and particularly adapted for RAIN algorithm analysis 23,25,47 . Moreover, dataset allows powerful differential gene expression analysis using the 21 samples per station as replicates providing time-integrated detection of differentially expressed genes in C. finmarchicus with latitude/ sea ice-cover. With climate driven environmental changes, this dataset ultimately constitutes new insights into transcriptomic regulation in the northward migrating copepod C. finmarchicus.

Code availability
Parameters to software tools involved are described in the following paragraph.