An improved de novo assembling and polishing of Solea senegalensis transcriptome shed light on retinoic acid signalling in larvae

Senegalese sole is an economically important flatfish species in aquaculture and an attractive model to decipher the molecular mechanisms governing the severe transformations occurring during metamorphosis, where retinoic acid seems to play a key role in tissue remodeling. In this study, a robust sole transcriptome was envisaged by reducing the number of assembled libraries (27 out of 111 available), fine-tuning a new automated and reproducible set of workflows for de novo assembling based on several assemblers, and removing low confidence transcripts after mapping onto a sole female genome draft. From a total of 96 resulting assemblies, two “raw” transcriptomes, one containing only Illumina reads and another with Illumina and GS-FLX reads, were selected to provide SOLSEv5.0, the most informative transcriptome with low redundancy and devoid of most single-exon transcripts. It included both Illumina and GS-FLX reads and consisted of 51,348 transcripts of which 22,684 code for 17,429 different proteins described in databases, where 9527 were predicted as complete proteins. SOLSEv5.0 was used as reference for the study of retinoic acid (RA) signalling in sole larvae using drug treatments (DEAB, a RA synthesis blocker, and TTNPB, a RA-receptor agonist) for 24 and 48 h. Differential expression and functional interpretation were facilitated by an updated version of DEGenes Hunter. Acute exposure of both drugs triggered an intense, specific and transient response at 24 h but with hardly observable differences after 48 h at least in the DEAB treatments. Activation of RA signalling by TTNPB specifically increased the expression of genes in pathways related to RA degradation, retinol storage, carotenoid metabolism, homeostatic response and visual cycle, and also modified the expression of transcripts related to morphogenesis and collagen fibril organisation. In contrast, DEAB mainly decreased genes related to retinal production, impairing phototransduction signalling in the retina. A total of 755 transcripts mainly related to lipid metabolism, lipid transport and lipid homeostasis were altered in response to both treatments, indicating non-specific drug responses associated with intestinal absorption. These results indicate that a new assembling and transcript sieving were both necessary to provide a reliable transcriptome to identify the many aspects of RA action during sole development that are of relevance for sole aquaculture.


Scientific Reports
| (2020) 10:20654 | https://doi.org/10.1038/s41598-020-77201-z www.nature.com/scientificreports/ resources including genomes and transcriptomes were reported and have been successfully used to understand flatfish asymmetry in Japanese flounder 1 , metamorphosis regulatory pathways in Atlantic halibut 2 , sex chromosome evolution and sex determination in half-smooth tongue sole 3 and gene mapping in turbot 4 . It fact, fish genomics has been called to support genetic improvement of production traits and breeding programmes in aquaculture 5 . Solea senegalensis (Senegalese sole) is the most important flatfish cultivated in Southern Europe by market price. Although its culture has faced specific challenges and some bottlenecks still persist, such as the lack of reproduction success for males cultured in captivity, the high dispersion of sizes or the high incidence of malformations. Nevertheless, more than 1600 tonnes are currently produced in Europe under advanced recirculation systems 6 . Key advances in hatchery and nursery protocols occurred in parallel with the development of important genomic resources including transcriptomes 7 and genome drafts 8,9 . All this information enabled the design of useful mid-and high-density arrays and successful high-throughput transcript sequencing (RNA-seq) studies providing relevant information about lipid physiology in larvae 10 , olfactory communication between breeders 11 , immune response associated to experimental diets 12 , osmoregulation 7 or pigmentation disorders in post-larvae 13 . However, the complex regulatory pathways that regulate metamorphosis in sole are not yet fully understood. The impressive plasticity of the individuals during this period is evidenced by their ability to imprint new properties as a function of the environmental factors. In fact, behavioral and metabolic changes lead to the acquisition of a functional digestive tract, competent immunity responses, mature neuro-muscular skeletal system, skin and pigmentation development, visual performance, and gonad development during this period 10,[14][15][16] . Metamorphosis is triggered by a surge of thyroid hormones (TH) 17 that govern a TH-responsive asymmetric centre in the anterior head region for eye migration and head remodeling 18 . Even so, these signalling cascades are highly modulated by environmental factors such as vitamin A that can interact with TH levels to modify skeletal morphogenesis leading to skeletal deformities 19 . In the Japanese flounder, retinoic acid (RA), the active metabolite of vitamin A, also appears to be critical in establishing asymmetric pigmentation and in modulating eye migration 1 . In sole, a few key genes involved in RA metabolism have been described 20 , and further research is required to understand the action of retinoids during metamorphosis in this species.
High-quality assembled and annotated transcriptomes are essential to achieve a comprehensive picture of cell or tissue physiology 21 , which is especially critical when genome sequence is not established. Transcriptomes are highly dynamic and show important differences depending on the cell genetic background, regulatory programming during organism development and cell differentiation, and post-transcriptional modifications 22 . As a result, assembling a de novo transcriptome, an essential approach for high-throughput gene expression studies, is prone to overestimate the number of transcripts due to immature mRNAs, intermediary spliced forms, sequencing errors, fragmented transcripts, insufficient sequencing depth and biological variability 23 . Previous assembled transcriptomes in sole reported a high number of transcripts 7,8,24,25 that clearly exceeded the expected number of predicted genes as reported in closely related flatfish (about 21,000 protein-coding genes 1,3,26 ). Accurate transcript quantification for gene expression studies is hindered by over-represented transcriptomes 27 , resulting in biased transcript discovery, over-estimation of family-collapsed contigs, and under-estimation of redundant contigs 11,27 . Bioinformatic strategies to reduce artefacts and redundancy have been implemented in sole 7 , but the result was still far from being optimal, and further polishing to increase tissue representativity, transcript completeness and annotations is required.
The release of better de novo transcriptomes is challenging that evolves quite fast 22,28,29 . The current study describes a new and improved version of the S. senegalensis transcriptome named SOLSEv5.0 using only a representative selection of sequencing libraries as input to a fine-tuned de novo assembling workflow based on TransFlow 28 that combines multiple assemblers. The best assembly was then polished to remove low confidence transcripts. Finally, its suitability for RNA-seq studies was demonstrated using an experimental study that investigates the role of RA on metamorphic larvae by using a specific RA-receptor agonist or a RA synthesis blocker. The bioinformatic analysis was performed using an improved version of DEGenes Hunter 30 .

Results
De novo assembling for new "raw" transcriptomes. It is well established that more input reads do not produce better assemblies 31,32 . According to this observation, 30 Illumina and 1 GS-FLX libraries were adopted as the most informative ones from the original dataset of 111 libraries 7 following criteria detailed in "Methods" section. These libraries accounted for 454 million Illumina pair-end reads and 5.66 millions GS-FLX single-end reads that after pre-processing rendered 361 and 3.1 millions of useful reads, respectively (Supplementary File 1). To validate species-specific reads, useful reads were then mapped onto a draft genome of S. senegalensis 8,9,33 . Supplementary File 1 shows that GS-FLX reads presented the lowest mapping rate (10.59 %, maybe due to their long length or the presence of too many polymorphisms), while Illumina libraries H634, H638 and H639 have distinctly higher mapping rates ( > 70 % maybe due to an overrepresentation of immature transcripts containing introns). Since libraries H634, H638 and H639, together with H632, belonged to the same experimental batch, they were discarded, resulting in a subset of 26 Illumina (275,501,704 paired-end reads) and 1 GS-FLX libraries for assembling the new transcriptome.
"Raw" transcriptomes were assembled using a fine-tuned TransFlow workflow detailed in "Methods" section. Since transcriptome evaluation was independent of the TT sequence but based on 15 quality parameters calculated by TransFlow, zebrafish transcriptome was selected as reference because it is the best known fish transcriptome. The 15 assemblies with the lowest PCA distance to the zebrafish reference transcriptome are presented in Table 1. scOases_k45 (Table 1), where only Illumina reads were assembled using Oases with k-mer = 45, was the first choice, and its name was simplified for convenience as "Oases_k45" from now on. Since GS-FLX libraries were prepared using a different set of tissues 7 , aaMin2/scOases_cat_cd/454Cap3 in Table 1  www.nature.com/scientificreports/ "Min2_Oases_Cap3" from now on for convenience) was also considered. It was built using the following steps: (i) assembling of GS-FLX reads using Mira3 and Euler-SR, and the resulting contigs were combined with CAP3; (ii) assembling Illumina reads with Oases using k-mers of 45 and 55, and redundancy decreasing using CDHIT-EST; and (iii) reconciliation of non-redundant contigs from step (ii) with CAP3 contigs from step (i) using Minimus2. Interestingly, the Oases_k45 was produced in step (ii), which makes it a part of Min2_Oases_Cap3. Both Oases_k45 and Min2_Oases_Cap3 were functionally and structurally annotated with full-length proteins from Actinopterigii taxon and then completed with full-length proteins from the vertebrate division, and were then compared to the published transcriptome (referred as v4.0) 7 . Table 2 shows that new "raw" transcriptomes have less tentative transcripts (TTs), smaller overall transcriptome length (but similar between Oases_k45 and Min2_Oases_Cap3), a two-to four-fold increase of N50 and N90, and less artefacts than v4.0 obtained with the same reads. Therefore, the proposed fine-tuned assembling workflow generated more accurate transcriptomes. Table 1. The 15 assembling approaches with the lowest PCA distance to the reference transcriptome of D. rerio (zebrafish) generated after fine-tunning TransFlow are presented. The 8 top-ranked ones were based only on Illumina libraries and the remaining were combinations of Illumina and GS-FLX reads. "Raw" transcriptomes selected for further investigation are marked with "*". a Names are derived from the resulting TransFlow approach as described in 28 .   Table 2 displays the functional annotation and coding statuses of the three transcriptomes. It was striking that the new "raw" transcriptomes had less TTs 'Without orthologue' and without predictable coding region ('Unknown nature' in Table 2) than v4.0. Furthermore, Min2_Oases_Cap3 had 7 353 additional orthologues ('With orthologue' in Table 2), 8574 additional unique othologues and 3546 additional unique complete transcripts compared to Oases_k45. Therefore, Min2_Oases_Cap3 appeared as a more comprehensive "raw" transcriptome than Oases_k45 despite having a high rate of unknown TTs, which indicated that further polishing was advisable.
Transcriptome polishing. The first cleansing step was conducted by mapping the TTs from the three "raw" transcriptomes (see Table 2) on the Senegalese sole genome, where v4.0 was included as a baseline control. As a result, Min2_Oases_Cap3 and Oases_k45 transcriptomes contained less unmapping, atypical and low quality TTs than v4.0 (see Table 3 for values and definitions). Hence, the three kinds of low confidence TTs were removed from their respective "raw" transcriptome to produce "definitive" transcriptomes that were then evaluated for intron-exon patterns. A higher proportion of introns per TT were found in Min2_Oases_Cap3 and Oases_k45, particularly for TTs with > 3 exons (Table 3). Most removed TTs were very short, or the consequence of chimaeric or misassembled contigs (Supplementary File 2). Therefore, polishing steps improved "definitive" transcriptomes, even though it hardly removes low confidence TTs in v4.0 due to a suboptimal assembling approach.
To confirm that polishing did not remove informative TTs, structural statuses of "raw" and "definitive" transcriptomes, as well as the set of "raw" quality" TTs, were compared ( Fig. 1). 'Unknown' status was always the most abundant and accounted for huge TT numbers in "raw" transcriptomes and was markedly decreased after polishing (Table 3). Artefacts-abundant in "raw" v4.0 (125,606)-were nearly absent in Min2_Oases_Cap3 and Oases_k45 (841 and 90, respectively), and most of them (99.17%: 25,939 not mapping onto the genome draft Table 3. Low confidence TTs identified in "raw" transcriptomes that were removed to constitute "definitive" transcriptomes whose TTs were counted by the number of exons identified in their sequence. a 'Unmapping' refers to TTs that were not mapped onto the genome draft of Senegalesse sole, suggesting that they come from artefactual assembling. b ' Atypical' refers to TTs that were not gapped by introns after mapping onto the genome draft, suggesting that they are closer to a genomic sequence than to a transcript. c 'Low quality' refers to TTs that mapped < 70 % of their length with < 70 % identity onto the genome draft, suggesting that their sequences are chimaeric or contain too many errors.   Table 2. "Low quality" results from removal of unmapped and atypical TTs in Table 3. "Definitive" are also devoid of low quality TTs in Table 3. Structural statuses for TTs are ' Artifacts' (not supported by mapped paired-end reads), 'Unknown' (without any sequence similarity in databases), 'Complete' (with a complete CDS (coding sequence) region), 'C-terminal' (with a C-terminal part of a CDS), 'N-terminal' (with a N-terminal part of a CDS), 'Internal' (internal CDS identified, lacking both N-and C-terminus), 'New coding' (can code for an unidentified protein), and 'ncRNA' (identical to ncRNA precursor in databases). Intriguingly, the decrease in the other structural statuses after polishing was moderated (Fig. 1). This was explained by the TT redundancy (Fig. 2), where v4.0 was the most redundant compared to Oases_k45 and Min2_Oases_Cap3. TT decreases from "low quality" to "definitive" transcriptomes may be explained by redundancy too. Overall, these data demonstrate that "definitive" Min2_Oases_Cap3 was less redundant and contained more different orthologue IDs than the other transcriptomes and hence it should be selected as SOLSEv5.0. The SOLSEv5.0 transcriptome. Usefulness of SOLSEv5.0 transcriptome for gene expression analyses was demonstrated by mapping 18 Illumina libraries corresponding to a study of RA signalling in Senegalese sole larvae (Supplementary File 3) onto the three "definitive" transcriptomes and the "raw" v4.0 as a baseline control (Fig. 3). Columns "Mapping reads" and "Mapping TTs/[v4.0]" indicated that the amount of reads mapping on a transcriptome decreased with respect to the number of TTs in the transcriptome. The ratio of mapping reads in "definitive" Oases_k45 was lower (50 %) than in Min2_Oases_Cap3 (75 %), that could be due to the tissuespecific TTs provided by the GS-FLX reads, suggesting that they should be retained in SOLSEv5.0. The fact that most of the TTs ( > 80 %) in "definitive" Min2_Oases_Cap3 and Oases_k45 were supported by at least one pair of reads ("Mapped TTs" columns in Fig. 3) reinforced again that polishing was not associated with information loss but a redundancy reduction, as suggested above (Fig. 2). Finally, "definitive" Min2_Oases_Cap3 showed the highest amount of "Reads per TT" (Fig. 3), strongly suggesting that it should be the Senegalese sole transcriptome SOLSEv5.0 34 . When SOLSEv5.0 34 and "definitive" Oases_k45 (Table 4) were compared with their "raw" transcriptome versions ( Table 2, Fig. 1), it is noticeable the strong reduction of TTs without orthologue and, more interestingly, that most of them were now predicted as coding sequences by Transdecoder (which is included in our Full-LengtherNext annotation tool) instead of unknown nature. It is also observed an increase in N50 and N90 of each assembly, and a decrease in artefacts, in spite of the decrease in the number of orthologues and unique complete transcripts due to polishing (even if this reduction is mainly due to redundancy, as observed in Fig. 2). The main features of SOLSEv5.0 when compared to "definitive" Oases_k45 (Table 4) revealed that (i) the aggregate length of SOLSEv5.0 was shorter in spite of having more TTs; (ii) artefactual TTs (Supplementary File 2, Fig. 1) were absent; (iii) SOLSEv5.0 had more orthologues, more unique IDs and more complete and unique complete TTs in spite of having shorter N50 and N90; and (iv) although SOLSEv5.0 had 6088 more TTs without orthologue, a higher proportion (5098, 64.2 %) was predicted to code for a protein that can be the result of new genes, genes with highly divergent sequences, or new splicing isoforms that should be taken into account. Note that the longest TT observed in Min2_Oases_Cap3 in Table 2 disappeared in Table 4, and both "definitive" transcriptomes have the same longest transcript. Since it is reported that the overall quality of assemblies must rely on more than one statistics 35,36 , results about mapping of Fig. 3 as well as the increase of N50/N90, the contrast in number of orthologues, unique orthologues, complete transcripts, and even the ratio of coding sequences among the TTs without orthologue (Table 4) are also reinforcing the choice of the SOLSEv5.0 transcriptome.   30 . It was designed for dealing with de novo assembled transcriptomes with improved functionalities for differential expression and new capabilities for functional interpretation. A highly conservative approach applying default expression thresholds based on "prevalent" DETs (differentially expressed TTs) was followed. Untreated control (CTRL) larvae were compared to those conditioned with DEAB (a blocker of RA synthesis that inhibits the enzyme retinaldehyde dehydrogenase) and TTNPB (a specific RA-   Figure 3. Mapping ratios of the 18 libraries described in Supplementary File 3 when "definitive" transcriptomes of v4.0, Oases_k45 and Min2_Oases_Cap3, as well as the "raw" v4.0 transcriptome as a baseline control, were used as reference. Ratios were calculated as the amount of sequences that align to a transcriptome with respect to the number of TTs in the transcriptome and are given as mean and standard error for the 18 samples. "Mapping reads" refers to reads aligned to the corresponding reference transcriptome; "Mapped TTs" refers to TTs that contain at least one read-pair aligned on them; "Mapped TT/[v4.0]" is as "Mapped TTs" but divided by the amount of TTs in the "raw" v4.0; and "Mapped reads per TT" refers to the amount of mapped reads on the transcriptome per total number of TTs.  (Fig. 4A). Interestingly, both drug treatments modified the expression of a common set of 755 transcripts (mainly at 24 h, 650 DETs). The PCA (principal component analysis) using the 4 730 "prevalent" DETs across the four comparisons displayed a transient effect more intense at 24 h than at 48 h. The three experimental groups (CTRL, DEAB and TTNPB) showed separation at 24 h (Fig. 4B, circles). However, at 48 h (Fig. 4B, squares) the DEAB-treated larvae appeared closer and overlapping with the CTRL group although TTNPB still remained slightly separated, suggesting a more sustained response with TTNPB than with DEAB.

Mapped reads per TT
To assess the robustness of SOLSEv5.0 regarding differential expression, the same analysis was performed using "raw" v4.0 as transcriptome reference and the results are given Supplementary File 5. In agreement with Fig. 3, the average number of normalized counts per TT for SOLSEv5.0, ranging from 40 % (48 h) to 90 % (24 h), is higher than for "raw" v4.0 (Supplementary File 6A). As expected, more DETs were identified in "raw" v4.0 (Supplementary File 6B) due to redundancy (Fig. 2) and the total number of TTs (697 125 for "raw" v4.0 ( Table 2) and 51 348 for SOLSEv5.0 [ Table 4]). The mapping of 48.4 % of TTs in "raw" v4.0 produce more common DETs than using SOLSEv5.0 (compare Supplementary File 6B with Fig. 4A). However, sample clustering by PCA in Supplementary File 6C was similar to Fig. 4B. Taken together, it can be stated that the highly polished SOLSEv5.0 was less redundant, less confounding and at least as informative than the original "raw" v4.0 transcriptome.

Functional analysis of DEAB and TTNPB effects on RA signalling. The new version of DEGenes
Hunter extended and facilitated the functional data interpretation based on zebrafish orthologues (given as functional_report.html in every CTRL_vs folder in Supplementary File 4). Overall, main GOs of biological pathways modified by drug treatments are depicted in Fig. 5. DEAB and TTNPB treatments shared eight over-represented pathways related to "lipid metabolism", "transport and homeostasis" and "carboxylic acid metabolism". Pathways more represented in DEAB-treated larvae were "regulation of cholesterol esterification" and those associated with immune system ("antigen processing and presentation", "antigen processing and presentation of peptide antigen via MHC class I", "antigen processing" and "presentation of peptide antigen", the latter was observed both at 24 and 48 h). In TTNPB-treated larvae, specific over-represented pathways were mainly related to morphogenesis at 24 h and those related to retinoids and other metabolic pathways at 48 h.
To gain more insights about RA-signalling, the set of shared DETs and those specific for DEAB and TTNPB treatments were separately analyzed for enrichment of GO categories. In shared DETs, the "terpenoid metabolic process" pathway that is directly involved in RA metabolism was significantly over-represented and mainly up-regulated. Moreover, genes in categories previously identified and associated with "lipids", "immune system", "carbohydrate homeostasis" and "proteolysis" appeared mainly down-regulated (Fig. 6, 'Shared DETs'). For DEAB-specific DETs (1220), main over-represented pathways contain down-regulated genes at 24 h (1023 DETs). Some of them were previously observed in the set of shared DETs (those related to "peptidase activity", "immune system" and "lipids") and only "mitotic cell cycle" and "organelle fission" pathways were specific (Fig. 6, 'DEAB-specific 24h'). When the set of TTNPB-specific transcripts was analyzed at 24 h (2496 specific DETs, 83.5% down-regulated) main over-represented GO categories were related to "morphogenesis", "phosphorus and organonitrogen metabolism", "vesicle-mediated transport", "dephosphorylation", "enzyme activator activity" and "synaptic vesicle cycle" (Fig. 6, 'TTNPB-specific 24h'). At 48 h (327 DETs, 84.7% up-regulated) there was a  www.nature.com/scientificreports/ shift in significantly enriched categories with the "regulation of carbohydrate catabolism", "extracellular matrix organization", "isoprenoid catabolic process", "intracellular receptor signalling pathway" and "retinol metabolic process", among others, pathways (Fig. 6, 'TTNPB-specific 48 h'). The same functional interpretation was carried out using the results Supplementary File 5 obtained with "raw" v4.0 transcriptome as reference. In spite of the different number and DETs and zebrafish orthologues (Supplementary File 7, Venn diagrams), with about 60.4 % matching in zebrafish IDs in both sets of DETs), the GO analysis clearly demonstrated that categories specially related with RA metabolism of TTNPB at 48 h were missed (Supplementary File 8). In fact, "raw" v4.0 transcriptome was biased toward metabolic pathways with a high redundancy through the different comparisons. These categories were also observable using SOLSEv5.0 (Fig. 6), supporting that SOLSEv5.0 can provide a clearer and more comprehensive functional interpretation, avoiding the negative consequences of high redundancy.
A detailed analysis of DETs from Supplementary File 4 identified several genes related to RA and TH metabolism that are indicated in Table 5. RA receptors, such as rargb, raraa and rxrba, that mediated RA actions were differentially expressed only in the TTNPB-treated larvae. The rbp4 (retinol binding protein) was down-regulated and crpb1a (cellular RA binding protein) was up-regulated in both DEAB and TTNPB treatments, whereas crabp2b was up-regulated only in TTNPB treatments. When enzymes implicated in RA biosynthesis and degradation were monitored, cytochromes P450 26A1 (cyp26a1) and 26B1 (cyp26b1) as well as lecithin retinol acyltransferases (lrata and lratb.1) were strongly up-regulated in TTNPB treatments (cytochromes were downregulated in DEAB treatments but not significantly). Also, dehydrogenase/reductase 3 (dhrs3b) was significantly up-regulated in TTNPB and down-regulated in DEAB treatments. Enzymes aldehyde dehydrogenase 1 family member A2 (aldh1a2) and trans-retinol 13,14-reductase (retsat) appeared only significantly down-regulated in TTNPB group. Enzymes all-retinol dehydrogenase 7 (rdh7) and β, β-carotene 9' ,10'-oxygenase-like (bco2l) were down-regulated in both treatments and the β, β-carotene 15,15'-dioxygenase-like (bco1) and β, β-carotene 9' ,10'-oxygenase (bco2a) were up-regulated in DEAB treatments. In addition to these genes, Table 5 also presents a wide set of DETs related to thyroid axis, phototransduction signalling in retina, pigmentation-related genes, and matrix-related genes. Moreover, several transcription factors were also differentially expressed with DEAB and TTNPB treatments (Supplementary File 4).

Discussion
The wide use of RNA-seq and the increasing number of studies focused on transcript-discovery or expression profiles are paving the way to obtain better transcriptomes and new assembling pipelines combining different bioinformatics tools, such as CAFE 22 , TransFlow 28 or RefShannon 29 . These new pipelines have to deal with errors due to the high number of reads, weakly expressed or 'uncommon' transcripts, circular RNAs, gene fusions, and isoform discrimination. It was well established that assembly quality improves with longer paired-end reads 28 but not increasing the input sequencing dataset 31 due to, for example, the increase in low abundance transcripts and a plethora of unannotated transcripts, most of them containing intronic regions 32 . Hence, small sequencing datasets are desirable to decrease computational requirements and produce robust transcriptomes, since manual curation is excluded, even though it has been proved to render the best results 37 . Unfortunately, there is no consensus on the most efficient RNA-seq analysis protocols for characterizing and quantifying transcripts, which can result in a high number of TTs, such as the case of the Senegalese sole transcriptome 7 , with nearly 700,000 TTs and mapping rates higher than 90% 7,11 . These figures indicate an over-estimation of the real number of transcripts provoked by the huge amounts of reads in datasets used as input in the pipeline 31,32 and the suboptimal assembling approach ( Table 2). A partial solution to the challenging de novo assembling arrived from using only 27 high-quality sole libraries (Supplementary File 1) instead of all (111) replicates 7 , that represent about 23 % of the total number of reads, reduction that will not have any significant impact on the representativity of transcript sequences, even though a small overall decrease in biological variability might occur. Then, de novo assembling reduced RAM demands and time requirements, and reduced to 96 the former 180 assemblies produced by TransFlow workflow 28 after considering only scaffolded sequences with a minimal coverage of 10 paired-end reads and increasing the k-mer sizes. The resulting transcriptomes had lower amounts of TTs with a lower proportion of "Unknown nature" ( Table 2) in agreement with a recently published study where depth effects on de novo RNA-seq assembling was demonstrated 32 . The resulting high-quality, annotated transcriptome named SOLSEv5.0 34 contained Illumina and GS-FLX libraries from different tissues ( Table 1) and was consistent with other species where the presence of GS-FLX reads improved the final transcriptome 28 , as can be seen in Fig. 3 and Table 4. Therefore, the new assembling pipelines with less input datasets contributed to obtain more accurate "raw" transcriptomes than the original v4.0 approach ( Table 2) 7,22,28 .
The numbers shown in Table 2 indicated that additional polishing was required besides published criteria 7 of sequence similarity, gene orthology or coding prediction that improved the original v4.0 10,11 . In the present study, removal of low confidence TTs based on structural annotation and their mapping patterns (Table 3) diminished the huge amount of artefactual single-exon and uninformative TTs (Fig. 1), as well as TT redundancy (Fig. 2), which resulted in more robust mapping when using "definitive" Min2_Oases_Cap3 (Fig. 3). The new transcriptome SOLSEv5.0 (Table 4) contained 22,684 TTs coding for 17,429 different proteins, a reasonable number compared with phylogenetically near species such as the 21,516 different proteins described on Cynoglossus semilaevis genome 3 or the 21,787 in Japanese flounder 1 . Based on SOLSEv5.0 features, it is proposed that any de novo transcriptome should rely on selecting a reasonable number of paired-end reads to be assembled with an optimised workflow, and a final removal of low confidence TTs (Fig. 3). In fact, a suboptimal assembly presents many low confidence TTs that are difficult to remove (Fig. 1), and a good assembling approach without polishing still retains many redundant TTs (Fig. 2  www.nature.com/scientificreports/ But, even if de novo assembled transcriptomes are acceptable, bioinformatic tools for RNA-seq are not specifically designed for them due to the low annotation rates and biases in FDR corrections than are exacerbated by redundant transcripts 27 . This is why improved R scripts of DEGenes Hunter were developed in this study to analyse differential expression patterns using SOLSEv5.0 34 as reference. The analysis relied on "prevalent" DETs (those TTs predicted as differentially expressed in all algorithms applied) that were then used for subsequent functional analyses (Supplementary File 4) RA synthesis, degradation and cellular transport are highly conserved in vertebrates 38 . A coordinated action of these three pathways have revealed as essential to fine-tune RA levels gradients that in turn control morphogenesis and many other developmental processes. A previous study in sole using DEAB an TTNPB drugs demonstrated that RA levels modify the expression of some enzymes involved in retinol-retinal-RA conversion and RA degradation by establishing a negative feedback mechanism 39 . The present study demonstrates that acute exposure of larvae to these two drugs triggered an intense, specific and transient response at 24 h but with hardly any differences after 48 h. Although these treatments modified the expression of a common set of transcripts, the overall number of DETs were representative of drug-specific regulatory activities as demonstrated by PCA analysis (Fig. 4). Remarkably, both DEAB an TTNPB treatments activated a homeostatic response related to disturbance of retinoid metabolism (Figs. 5, 6). Key enzymes such as dhrs3b (Table 5), involved in the biosynthetic pathway and controlling the reduction of retinaldehyde back to retinol, appeared regulated through a negative feedback, promoting the synthesis of retinal in DEAB and retinol storage in TTNPB treatments. Moreover, the RA-receptor activation by TTNPB modified several pathways related to RA biosynthesis and degradation, retinol storage, carotenoid metabolism and visual cycle ( Table 5, Fig. 7). Cytochromes cyp26a1 and cyp26b1 mRNAs were highly up-regulated at 24 and 48 h, concomitant with the induction of cellular RA binding proteins 1 and 2 (genes crabp1 and crabp2a in Table 5, Fig. 7) to prevent excessive amounts of cellular all-trans retinoic acids (atRAs) [39][40][41] . Moreover, down-regulation of aldh1a2 expression (Table 5, Fig. 7) indicated a switch-off of the second step of RA biosynthesis. Simultaneously with controlling the balance between RA production and degradation, other closely related pathways were modified to promote the retinol storage by increasing lrat, reducing adipogenesis by retsat (Table 5, Fig. 7) 42 , shifting β-carotene transformation toward apocarotenals instead of retinal production, and promoting 11-cis retinoids synthesis 43,44 . Unlike TTNPB, DEAB treatments modified specifically only those pathways related to retinal supply (carotenoids) and transformation (visual cycle). On the whole, TTNPB and DEAB treatments triggered a wide homeostatic response beyond biosynthesis and degradation, as a feedback mechanism to control atRA-mediated actions. www.nature.com/scientificreports/ Concerning morphological aberrations, skeletogenesis disorders and malpigmentation, they were systemically associated with an excess of dietary vitamin A levels and external treatments with atRA. The present study indicates that increased RA signalling activated some pathways related to morphogenesis and collagen fibril organisation (Table 5). This is consistent with the fact that several transcription factors increased (hoxb1, hoxb5, Hox-B6b, Hox-C5a, hoxc6, intestine-specific homeobox and six2) or reduced (hoxa13, lhx2, hnf1b, hmbox1, cux1, zfhx3 and meis2) their expression levels (Supplementary File 4). Going into detail, RA-mediated dysregulation of hox genes in zebrafish was teratogenic due to a modification of the anteroposterior developmental patterning 45,46 . High levels of RA that induced overexpression of hox5b provoked severe cardiac abnormalities 46,47 . Moreover, a zebrafish defective in aldh1a2 that phenotypically lacks forelimbs (pectoral fins) and posterior branchial arches reduced the expression of hoxb5a, hoxb6a and hoxb6b along the entire length of its spinal cord expression domain, decreased hoxb4a expression in the hindbrain and failed to express dlx2, an early marker of apical ectodermal ridge activity in the fin bud 48 . In addition to transcription factors, TTNPB also modified the expression of 11 collagen-encoding genes ( Table 5) and osteocalcin-2 (bglap). Although our experimental design was intended to evaluate the effects under a short-time drug exposure, this wide response of regulatory transcriptional factors and collagens does not preclude major structural alterations and agrees with the morphological aberrations found in larvae fed high levels of vitamin A 19,49 .
RA has also been associated with malpigmentation in flatfish during metamorphosis, such as in Japanese flounder where high doses of 9-cis-RA promoted development of adult chromatophores when supplied at the beginning of metamorphosis 50 . More recent studies demonstrated that an RA gradient is required for the generation of asymmetric pigmentation in flatfish with higher concentrations of atRA and 9-cis-RA in the ocular side than blind side 1 . In sole, previous studies failed to demonstrate an association between dietary levels of vitamin A and pseudoalbinism 49,51 , more related to dietary arachidonic acid levels 52 . In the current study, TTNPB results in up-regulation of GTP cyclohydrolase I (gch1; Table 5), the first and rate-limiting enzyme of pteridine synthesis and specifically expressed in xanthoblasts and to some extent in melanoblasts 53,54 . Moreover, TTNPB also modified enzymes modulating hypoxanthine biosynthesis (the main pigment in iridophores) such as nt5c2 and pnp5a, suggesting disrupted pigmentation patterns in sole by modifying xanthophores and iridophores physiology 13,55 . High expression of gch1 and disruption of iridophores, contrary to the pattern observed in pseudoalbino 13 , could be behind of the high rates of ambicolouration disorders currently found in the sole industry. Further experiments with longer evaluation periods will be necessary to demonstrate this hypothesis.
RA and TH establish a cross talk to regulate morphogenesis and cell development and differentiation in vertebrates mediated by the competition of the thyroid receptor (TR) and RA receptor (RAR) to form a heterodimer with the retinoid X receptor (RXR) for binding to gene promoters 1,56 . This interaction is crucial in flatfish metamorphosis governed mostly by THs 17,18,25 . Recent molecular findings in Japanese flounder have demonstrated that disorders in eye migration during metamorphosis are mediated by the RA-TH interaction: atRA reduced the proliferation of cells in the suborbital area of the blind side eye by up-regulating rara and down-regulating thraa and thrb1 1 . In sole, dietary vitamin A levels also modified their thyroid follicle number and size, T3 and T4 immunoreactive staining, skeletogenesis and mineralisation 19,49 . Moreover, thrb and thrab expression was reduced in metamorphic larvae fed with high levels of vitamin A 19 , and hormonal treatments demonstrated that atRA and TTNPB increased thrb mRNA levels and reduced thrab transcript levels in post-metamorphic larvae but not in pre-metamorphosis 39 . In the present study, metamorphic larvae exposed to TTNPB down-regulated the expression of tg and thrab (Table 5), but did not modify the expression of dio2 and thrb, two genes that play a pivotal role in the asymmetric remodelling of sole head 18 . These data indicate a time-sensitivity window to hormonal treatments that might explain the minor effects on settlement and eye migration in sole, and the differences in gene expression patterns 19,49 .
DEAB and TTNPB modified the expression of a common set of 755 TTs (Fig. 4A), most of them at 24 h after drug treatments. These TTs were mainly related to "lipid metabolic process", "lipid transport" and "lipid homeostasis", processes that play a key role in the absorption and distribution of vitamin A. It is already known that high levels of retinal or RA inhibit adipogenesis when administered at early stages of adipocyte development 57 . External supply of retinal or Raldh1 −/− mice mutants that accumulate retinal down-regulate the expression of adipogenic target genes and the transport protein rbp4 (also observed in the present study) through inhibiting the PPAR-γ:RXR dimer as a defensive mechanism to limit a RA excess in tissues 57 . Similarly, liganded RA-receptor or agonists inhibit adipogenesis by blocking C/EBPβ-mediated induction of downstream genes 58 . Involvement of PPAR-γ and C/EBP in adipogenesis regulation seems to be a conserved setting in fish 59,60 . The regulation of lipid transport, including several apolipoproteins involved in chylomicrons and very-low-density lipoprotein formation and lipases, is a key regulatory mechanism in the intestine and liver of pelagic and metamorphosing sole larvae to fit nutrient levels 10,[61][62][63][64] . Therefore, our data suggest that DEAB and TTNPB treatments provided sensing signals to activate mechanisms that limit the availability of this fat-soluble vitamin, high levels of retinal in DEAB and enhanced RA-signalling in TTNPB treatments, which is consistent with the capacity of sole larvae to mobilize and store dietary vitamin A surplus 20 . They also support a negative feedback for vitamin A intestinal absorption and storage as previously suggested 44 and highlight non-specific RA effects associated with lipid metabolism in DEAB and TTNPB treatments that should be taken into consideration when these drugs are used in fish trials.
In conclusion, even if new sequencing data using long read technologies (Oxford Nanopore or PacBio Sequel) would benefit any transcriptome assembling, re-analysis of already existing data 65 is essential to provide optimised genomic tools for gene expression estudies and to exploit previous research investments. Hence this study demonstrates the usefulness of in silico strategy to generate the new SOLSEv5.0 as reference transcriptome for RNA-seq studies in sole. Also, new DEGenes Hunter capabilities have provided full relevant information for gene expression analyses. Both of them facilitated investigation into RA signalling and metabolism in larvae, revealing that DEAB and TTNPB drugs trigger a wide, coordinated and specific homeostatic response to Scientific Reports | (2020) 10:20654 | https://doi.org/10.1038/s41598-020-77201-z www.nature.com/scientificreports/ maintain physiological RA levels. Moreover, expression changes of transcripts involved in morphogenesis and pigmentation also support a RA role in tissue remodeling that could be behind the morphological abnomalities associated with the supply of vitamin A in sole larvae. The identification of DETs shared by both drug treatments demonstrate non-specific drug effects related to lipid metabolism and gut absorption that it is relevant for the design of future studies using these drugs.

Methods
Computational infrastructure. Picasso

De novo assembling and annotations. Pre-processed (useful) reads from libraries in Supplementary
File 1 were assembled using the automated and modular framework TransFlow 28 customised as follows: (i) increased k-mer length to 45 and 55 for Illumina reads (TransFlow variable $kmers = [45;55]), and to 29 for GS-FLX reads; (ii) removal of Illumina scaffolds with low coverage ( < 10× , new TransFlow variable $NT_COVERAGE_IN_CONTIG=10); (iii) primary, intermediary assemblies are not considered, retaining only scaffolded assemblies; and (iv) additional CD-HIT-EST 71 execution at the end of module 3 was included using options -c 0.95 -s 0.7 to reduce contig/scaffold redundancy at 95 % identity and 70 % overlap. The customised workflow is available at GitHub (https ://githu b.com/JoseC orCab /Trans Flow) and resulted in up to 96 assemblies. The Danio rerio GRCz11 build was used as reference transcriptome to select the best Senegalese sole "raw" transcriptome based on PCA distances. Sequencing reads ERR216329 from the SRA repository were used for quality assessment of the D. rerio transcriptome within TransFlow.
Removal of low confidence transcripts. The rationale of the following processes is to sieve abnormal TTs. Firstly, a "raw" transcriptome was mapped against the preliminary S. senegalensis genome draft 33 developed in our laboratory 8,9 using Minimap2 72 in splice mode, including option -uf for finding canonical splicing GT-AG sites, and option -C5 for a penalty of 5 for non-canonicals. The resulting SAM (Sequence Alignment Map) file was processed using SAMtools view -h -F 2052 to retain the best mapped position of multimapping transcripts without unmapped ones. According to Patterson et al. 32 , unspliced, fragmented, or missasembled, transcripts, as well as contaminant DNA, were then removed 32 , as well as TTs with ungapped mapping with > 90 % identity.
The last polishing step was intended to retain those TTs having > 70 % identity covering > 70 % of its length to avoid fragmented or chimeric TTs containing too many intronic sequences. Since coverage and identity are not given in SAM files, SAM format was converted to PAF (Pairwise mApping Format; https ://githu b.com/lh3/minia sm/blob/maste r/PAF.md) using sam2paf option of paftools.js (a script provided by Minimap2 authors in https ://githu b.com/lh3/minim ap2/tree/maste r/misc). CIGARs (Concise Idiosyncratic Gapped Alignment Report) were also transfered using an in-house script written in Ruby (v2.4.1) merge_paf_cigar.rb (this and other postprocessing scripts are available at GitHub https ://githu b.com/JoseC orCab /Trans Flow_postp roces sing). Minimap2 artefacts produced when aligning transcripts to genomes were trimmed using paf_report. rb, which also calculates coverage, identity and exon number for each transcript.

Fish trial and drug treatments.
To investigate the effects of retinoic acid (RA) on sole larvae, fertilized eggs were obtained from naturally spawning Senegalese sole broodstock (IFAPA Centro El Toruño). The eggs were collected in the morning (8.00 am) and separated by buoyancy to get the fecundated fraction. Eggs were incubated in 15 L cylindro-conical tanks using a flow-through sea water circuit with gentle aeration at an initial density of 3000 embryos L −1 and full exchange of sea water every hour. Newly hatched larvae (1 day post-hatch Scientific Reports | (2020) 10:20654 | https://doi.org/10.1038/s41598-020-77201-z www.nature.com/scientificreports/ (dph)) were transferred to a 400 L tanks at an initial density of 45 to 50 larvae L −1 and maintained in the dark until the onset of external feeding (3 dph). Detailed protocol for larval rearing was as previous described 73 . Shortly, larvae were supplied rotifers (Brachionus plicatilis) enriched for 3 h with Tisochrysis lutea (T-iso strain) from 3 till 9 dph. Moreover, Artemia metanauplii enriched for 24 h with T. lutea were supplied from 7 dph until the end of the experiment. Live microalgae Nannochloropsis gaditana and T. lutea were also added directly to water to enriched the live preys in the tank. During the trial, a 16 h:8 h light:dark photoperiod was used (light intensity 500 lux) and the mean water temperature and salinity was 21.1 ± 0.3 • C and 34.2 ± 0.2 ppt, respectively. When larvae had just started the metamorphosis (12 dph), they were distributed into nine 15 L cylindroconical tanks at the same initial larval density. Four days later (16 dph), when larvae were at the metamorphosis climax (metamorphic stages S2-S3) 17  , an inhibitor of aldehyde dehydrogenase (ALDH) enzymes that prevents RA synthesis. Doses were selected according to previous data in sole 39,74 . Larvae were sampled at 24 and 48 h after the drug treatments, euthanized using MS-222, washed using diethylpyrocarbonated water, frozen in liquid nitrogen, and stored at − 80 • C until analysis. The study has been carried out in accordance with EC Directive 86/ 609/EEC for animal experiments and Spanish regulations on animal welfare. All procedures were approved by the Animal Ethics Committee of IFAPA.
RNA preparation. Pools of larvae (n = 5) from each tank (n = 3) were homogenised using the Fast-prep FG120 instrument (Bio101) and Lysing Matrix D (Q-Bio-Gene) for 40 s at speed setting 6. The numbers of embryos/larvae in the pools were always similar between conditions and replicates. Total RNA was isolated using the RNeasy Mini Kit (Qiagen) and treated twice with DNase I using an RNase-Free DNase kit (Qiagen) for 30 min to avoid genomic DNA contamination. Illumina libraries were constructed using mRNA-Seq sample preparation kit and sequenced using TruSeq SBS Kit v3-HS, in paired end mode, 2 × 76 bp in a fraction of a lane (1/6) of a HiSeq2000 sequencing system (Illumina, Inc) following the manufacturer's protocol as previously reported 7 .
Gene expresion analyses. Useful reads from Supplementary File 3 were mapped onto a reference transcriptome using Bowtie2 with default parameters and -no-unal -no-mixed options to only retain appropriately mapped paired-end reads in the SAM file. The SAM file was sorted and converted to BAM by SAMtools sort and reads mapped per transcript were counted by sam2counts (https ://githu b.com/vsbuff alo/sam2c ounts ). Then, differential expression analysis was performed using an improved version of DEGenes Hunter 30 available at https ://githu b.com/seoan ezonj ic/DEgen esHun ter. The first script degenes_Hunter.R is for differential expression using edgeR, DEseq2, limma and NOISeq, and generates files with quality control, expression data and a final report (DEG_report.html) where analysis details and plots are given. A DET is qualified as "prevalent" when it is differentially expressed in all algorithms used, and "possible" when not all algorithms qualify it as DET. By default, thresholds for a DET are absolute fold-change (FC) > 2 and false discovery rate (FDR) < 0.05 in every algorithm. The output includes a mean logFC and a combined FDR. More detail is given at its GitHub page.
Functional interpretation is launched with script functional_Hunter.R based mainly on topGO and clusterProfiler 75 and then a final report (functional_report.html) is generated. Gene Ontology terms, and Reactome and KEGG databases for pathways are inspected to graphically display KEGG over-representation and clustering, GO over-representation analysis for the three hierarchies, and Reactome over-representation. SOLSEv5.0 DETs were interpreted using their zebrafish orthologues. Supplementary File 4 contains the most relevant results produced by DEGenes Hunter for the RA signalling study.
Additional biological interpretation of functional information based on Gene Ontology terms and KEGG/ BioCarta pathways of D. rerio orthologues were carried out using the ClueGO plug-in 76 for Cytoscape 76 to visualise functionally grouped terms. PCA analyses with normalised, prevalent DETs in at least one of the four comparisons were performed using ClustVis 77 . Venn diagrams were performed with InteractiveVenn at http:// www.inter activ enn.net.

Code availability
RNA-seq datasets are available at accession numbers SRR4897845, SRR1030352, SRR1282039, SRR2072478, DRR003148, SRR954861, SRR1282039, SRR100067, SRR2005826 as well as BioProjects 392999, PRJNA287107, 392587 and PRJNA392589. The genome draft of a female Senegalese sole is available at FigShare 33 ; it includes five files: Sosen1_genome_scaffolds.fasta containing every contig and scaffold identifier and sequence in fasta format; Sosen1_genome_annotation.gff3 corresponding to a tentative annotation of genome contigs and scalffolds using MAKER2 and transcript sequences in SOLSEv5.0 34 ; Sosen1_maker. transcripts.fasta containing the deduced transcripts from the gff3 annotation file; Sosen1_maker. proteins.fasta containing the deduced amino acid sequence for all deduced transcripts; and Sosen1_ maker.proteins_annotation.tsv containing more annotations for deduced transcripts and proteins, such as transcript and protein lengths, best UniProtKB orthologue with identity % and E-value, structural status, open reading frame location in the transcript, description, GOs, KEGG codes, InterPro IDs, Pfam, EC and Unipathway, as tab-separated values (tsv format). Full-LengtherNext can be executed at http://www.scbi.uma. es/fulll ength ernex t; instructions and off-line installation can be obtained in http://www.scbi.uma.es/site/scbi/ downl oads/313-full-lengt herne xt. TransFlow can be downloaded from https ://githu b.com/seoan ezonj ic/Trans