## Background & Summary

For historic reasons, certain organisms, such as Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Danio rerio or Mus musculus have gained the status of model organisms, due to their biological attributes and to technical advantages1. The study of these organisms has benefited from the concerted effort of a large research community allocating vast resources, resulting in a significantly larger body of genetic and genomic information concerning these organisms than any other. However, many other organisms have unique traits that make them valuable models for the study of specific biological processes. Until recently, genetic and genomic information for this latter group was scarce due to limitations of cost and labour intensity.

Next Generation Sequencing (NGS) tools have made the production of genetic and genomic information more accessible. NGS has made such research feasible for the study of all organisms, including those for whom no reference genetic data is available, the so-called non-model organisms1. Moreover, the innovative approaches to the analysis of these new sequencing data turned previously perceived obstacles into potentially surmountable challenges.

In the study presented here, we chose to focus on the transcriptome of the medicinal leech, Hirudo medicinalis CNS, which is a non-model organism, but serves as a well-studied model in neurobiology, specifically in neuronal development, regeneration and repair2,3. The leech CNS is composed of 6 fused ganglia at the head, 21 highly similar body ganglia and 7 fused tail ganglia4. Each ganglion contains approximately 200 pairs of neurons and is linked to its neighbours by thousands of axons5. This valuable model offers an interesting platform for use of molecular and cellular scientific methods for the evaluation of the involvement of specific cells in the regenerative processes6. Previous work has yielded characterization of specific genes in leech CNS7,8, and genes of interest were studied using the candidate gene approach912. Recently, an expression sequence tag (EST) database was constructed and is now available to the scientific community13. Yet, functional genomic studies in the Hirudo medicinalis are in their infancy13,14.

While several previous studies have shed some light on particular genes and gene expression patterns1315, the full transcriptomic data of the Hirudo medicinalis CNS is still limited. In our related work at Bioinformatics we provided an in-depth spatial regulation analysis of the CNS transcriptome data and showed the potential of combining expression distribution patterns to produce a spatio-transcripto map along the ganglia chain15. As illustrated in Fig. 1, we collected RNA from three distinct locations along the leech CNS (ganglia 2, 10 and 19). To achieve the single ganglion resolution, we collected the RNA content of these organs and had to amplify the RNAs, using the NuGEN amplification kit which known to be beneficial in low amounts of RNA16, before sequencing. In total, we sequenced 221.1 million high-quality short reads 50 bp single-end from the leech CNS. Then, using the de-novo assembly program, Trinity, we reconstructed these reads to produce the first model of the leech CNS transcriptome. By combining those three distinct points along the leech CNS we assumed that our data reveals most of the transcripts that are expressed in steady state of neuronal cells in the leech CNS.

In this Data Descriptor, we provide the full assembly and annotation datasets, aimed at making our data accessible to others for use in their research and for expanding the community understanding of this data. This study complements previous approaches to address similar questions in the leech CNS9,10,17. Sequencing the transcriptome is a prerequisite to the expansion of our knowledge on the nervous systems (physiological and pathological conditions). Utilizing these assembly data through an annotation dataset for these new transcripts may help in the accessibility and understanding of this data. The use of a simple model, the leech CNS, together with a novel assembly and analysis approaches, combine the transcriptome with a spatial configuration, thus producing a novel transcript database of the leech CNS. Furthermore, these leech databases can be used to define the underlying conserved genetic modules controlling the equivalent patterning processes along the CNS as well as serving to cross-validate each other. Similarity, these data will offer insights into the molecular mechanism that underpin the fundamental patterning differences between leech and related organisms.

## Methods

These methods have been presented in an abbreviated form in the journal Bioinformatics15.

### Animal conditions

The experiments were performed on the Hirudo medicinalis leech. All leeches were obtained from an adult Hirudo medicinalis colony grown in France at Ricarimpex Farm. Further to the transportation from the farm, leeches were maintained in our animal facility in tanks populated with about 20 leeches in a controlled environment, at 16 °C and 12 h/12 h day/night cycle.

### Experimental design

Twelve samples with a focus on the Hirudo medicinalis CNS were taken from six different leeches for this experiment. Before use, leeches were placed on ice for 30 min and then dissected dorsally. Three ganglia (2, 10 and 19) were harvested from three leeches (Fig. 1). For technical replicas, ganglion number 10 was harvested from three additional leeches, pooled together for RNA isolation and separated into three samples for RNA-seq.

### CNS/Ganglia collection

Total RNA was extracted from each ganglion using RNeasy Lipid Tissue (Qiagen). The quality and quantity of each RNA sample was assessed by Agilent’s 100 Bioanalyzer pico chip (Fig. 2a,b).

### RNA amplification

The initial RNA yield was low, requiring amplification of RNAs using a specific kit prior to the use of the mRNA-TruSeq preparation kit. RNA was amplified using Ovation Kit v2.0 (NuGEN) (Fig. 2c,d). Before amplification, all samples were lyophilized using a SpeedVac instrument and then suspended in 5 μl of nuclease-free water. This was the starting volume of NuGEN kit. Then, (as suggested by NuGEN), the 2 μg (in 100 μl) of cDNAs were fragmented by a Bioruptor instrument 3 cycles of 10 s of sonication and 90 s of pause. Library preparation proceeded with an ‘END-REPAIR’ reaction of NEB kit, then with TruSeq DNA/RNA libraries preparation according to Illumuna protocol. The 12 stock libraries were loaded on a High Sensitivity Chip and quantified on a QuBIT instrument, in order to prepare the two 6-plex pools were separated into two pools. The two pools were quantified (molarity) on Bioanalyzer HighSensitivity as stocks. In order to better balance the single libraries inside the pools, we decide to quantify libraries with qPCR, following the NuGEN suggestion for pool preparation.

### Illumina sequencing

The cDNA libraries were generated using messenger RNA-seq (mRNA-seq) assay for transcriptome sequencing on Illumina Hiseq2000 (Data Citation 1 and Table 1). Three cDNA libraries were generated from the total RNA of ganglion number 19 and three cDNA libraries were generated from the pooled total RNA of ganglion 10 in equal amounts, and sequencing was performed in one lane to generate 50 bp single end (SE) reads. A similar procedure was carried out for ganglia numbers 2 and 10. Library construction and sequencing was performed by a commercial service provider (IGA, Applied Genomics Institute).

### De novo assembly

Due to the fact that the genome/transcriptome of the Hirudo medicinalis is not available yet, we used de-novo tools for the reconstruction the transcriptome. The tools we used are Trinity18 (version trinityrnaseq_r2012-03-17) and Trans-ABySS19 (version 1.3.2). Here we report only the procedure and the results from Trinity, the full process and considerations can be seen in our analysis paper15. Trinity has been developed for assembly of short reads using de Bruijn graph algorithm by single k-mer. Trinity was executed in the inchworm method and the minimum contig length set to 200 nucleotides. The other parameters we used are default for Trinity single-end assembly (Trinity.pl—seqType fq—kmer_method inchworm—single seq.input—output seq.output—min_contig_length 200) (Data Citation 2 and Table 2).

### Annotation dataset

In general, the creation of an annotation dataset is derived from a draft genome or transcriptome20,21. For the Hirudo medicinalis none of these are available. Therefore, we used BLASTX (version 2.2.23) to identify sequence conservation and to create an annotated dataset for the 73,493 contigs, generated by Trinity, against Swiss-Prot database downloaded from the National Center for Biotechnology Information (NCBI). We used blast commands setting the minimum e-value to be e−10 for maximal confidence of the contigs (blastall -p blastx -i Trinity.fasta -d Swiss-prot e−10 −o Trinity.fasta.out). Next, to improve the readability of blast output we kept only the first result for each contig by using Linux common command (grep -E ‘Query\=|No\ hits\ found|^gi\|’ Trinity.fasta.blastx.out|grep -A1 ‘Query\=\contig_name’| sed ‘/^--/d’| sed ‘s/path.*//’ | paste - - -d’;’Trinity.fasta.blasx.out.best-hit). Applying the described pipeline has led to a set of reliable annotated contigs (Data Citation 3).

## Data Records

In this study we deposited four datasets. The first dataset is the RNA-Seq raw reads (Data Citation 1 and Table 1). This dataset contains 12 samples in total. Three biological replicates from each ganglia (numbers 2, 10 and 19), and three more technical replicates from ganglion number 10. The second dataset is the expression value of each of the Trinity assembly contigs (Data Citation 2 and Table 2). The third dataset contains the actual contigs (Data Citation 2 and Table 3). The fourth dataset is the annotation file (Data Citation 3). The annotation file is a comma separated value (CSV) format file with all of the annotated contigs generated by Trinity. The annotation file deposited in figshare depository. In general, there are 7 columns. Column #1 is the name of the contig, #2 is the length, #3 is the gi accession number, #4 is the sp accession number, #5 is the entry name, #6 is the BLASTX score and #7 is the e-value. The first two datasets described above (Data Citation 1 and Table 1, and Data Citation 2 and Table 2) were previously published in our related work in the journal Bioinformatics15, and the third dataset in Data Citation 2 (Table 3) and the forth dataset in Data Citation 3 are the core of this work and have not been published before.

## Technical Validation

### CNS collection quality control

Prior to processing, after samples were harvested, we assessed the quality and quantity of each sample using Agilent’s 100 Bioanalyzer pico chip. Only RNA samples which contained a concentration of 45–170 pg/μl were used for this experiment (Fig. 2a,b). Since the harvesting was only from single ganglions (~400 cells), we expected low concentrations of total RNA.

### RNA amplification quality control

Following amplification, cDNAs were quantified using Nanodrop and Bioanalyzer DNA 1000 Chip (Fig. 2c,d). After balancing the two pools for a similar total pool volume, we proceeded with the procedure. The first pool volume was 47.2 μl and the second pool was 45.9 μl.

### Sequencing quality control

We used multiple steps for testing sequencing quality. The first step included a count of total reads and total bases for each of the samples to ensure that the amounts are approximately of the same order of magnitude. These amounts were 10–26 million reads. As a second step, we tested samples to pass FastQC22 for basic statistics as quality estimation, per base sequence quality, per sequence quality score, length distribution and raw reads quality control (Table 4). As a third step, we estimated the sequencing depth of coverage by estimating the size of the transcriptome. We used the know information on a related organism, Helobdella robusta, which its transcriptome size is estimated to be 29032248 bp (http://metazoa.ensembl.org/Helobdella_robusta/Info/Annotation/#assembly). Thus, we estimated the sequencing depth of coverage to be $\frac{11000000000\mathit{bases}}{29032248}\mathbf{=}378$ (the number of sequenced bases divided by the predicted transcriptome size).

### Assembly quality control

Following the use of Trinity, to make sure that the produced contigs are correct, we compared our model of transcriptome also to Helobdella robusta published transcriptome. We start with comparing basic statistics such as average length of contigs (Table 4). We calculated the average contig length for the Hirudo medicinalis and we found it to be 1124 (Table 5), which is very similar to the average transcript size for the Helobdella robusta which is 1239 according to the Joint Genome Institute (http://genome.jgi-psf.org/Helro1/Helro1.home.html). Next, we used Bowtie23 to map reads back to the contigs to test the mapping rate for each sample (Fig. 3a). Finally, we correlated the technical replicates between themselves, which showed a strong correlation (Fig. 3b) (Table 3). Moreover, for examining the completeness of the data we tested full-length over the innexin genes family to see if the transcripts are really full without any bias to particular side, and we found that 11 out of the 21 innexin reconstructed perfectly, and the other 10 almost in fully15. Moreover, for strengthening the completeness evaluation of our data we estimated the transcriptome size according to the EnsemblMetazoa known information of the Helobdella robusta (see above), and found that the number of coding genes for the Helobdella robusta is 23432 which is very similar to the number of contigs with high e-value that when aligning to the Swiss-Prot (Fig. 3c).

### Annotation quality control

To ensure the quality of our annotation we set the blast e-value threshold to e−10. We found that the blast scores were high (Fig. 3c). By counting the number of transcripts that passed the threshold, we saw that the approximate predicted size of the Hirudo medicinalis CNS transcriptome is 22,604. This result is very similar to size of the Helobdella robusta and the Capitella teleta transcriptomes (19,487 and 37,908, respectively).

## Usage Notes

The data provided in this experimental set can be used for several purposes. First, it is possible to use the raw reads for executing a new experiment, with different analysis approaches. Second, each analysis step can be performed differently as all the technical experimental information is publicly available.

### De novo assembly

By using our non-redundant set of 73,493 transcripts we generated using Trinity, the search in the dataset for genes of interest can be easily done by homologs using blast, or text based search if using the annotation table. Moreover, the raw reads provided here make it possible to use alternative methods for de novo assembly that may assemble the reads into a different set of transcripts than the ones we constructed. The de novo assembly can be performed using SOAPdenovo-Trans24 or Velvet/Oases25,26.

### Differential expression

The combination of raw reads and transcripts allows the re-calculation of expression values for each transcript for each sample. There are different approaches for discovering differentially expressed genes. This can be further refined into new groups of genes expressed in each individual ganglion, or combinations of ganglia. This can be performed by using, for example, free Bioconductor packages such as DESeq27 or edgeR28 (http://www.bioconductor.org/).

### Downstream analysis

The datasets provided here represented by 3 biological replicates for each condition and technical replicates for one condition. Comparing the ganglia sequenced here to other ganglia to determine the set of genes statistically significantly differentially expressed along the whole CNS, and not only three points.