De novo assembly and characterization of the liver transcriptome of Mugil incilis (lisa) using next generation sequencing

Mugil incilis (lisa) is an important commercial fish species in many countries, living along the coasts of the western Atlantic Ocean. It has been used as a model organism for environmental monitoring and ecotoxicological investigations. Nevertheless, available genomic and transcriptomic information for this organism is extremely deficient. The aim of this study was to characterize M. incilis hepatic transcriptome using Illumina paired-end sequencing. A total of 32,082,124 RNA-Seq read pairs were generated utilizing the HiSeq platform and subsequently cleaned and assembled into 93,912 contigs (N50 = 2,019 bp). The analysis of species distribution revealed that M. incilis contigs had the highest number of hits to Stegastes partitus (13.4%). Using a sequence similarity search against the public databases GO and KEGG, a total of 7,301 and 16,967 contigs were annotated, respectively. KEGG database showed genes related to environmental information, metabolism and organismal system pathways were highly annotated. Complete or partial coding DNA sequences for several candidate genes associated with stress responses/detoxification of xenobiotics, as well as housekeeping genes, were employed to design primers that were successfully tested and validated by RT-qPCR. This study presents the first transcriptome resources for Mugil incilis and provides basic information for the development of genomic tools, such as the identification of RNA markers, useful to analyze environmental impacts on this fish Caribbean species.

Industrialization and urbanization are the main causes of aquatic ecosystem degradation, discharging a wide variety of pollutants that promote deleterious effects on organisms, with typical responses such as inhibition of growth, alterations in sexual maturation (reproductive capacity) and immunodeficiency [1][2][3] . The problem is magnified when aquatic organisms like fish, accumulate pollutants directly from water or sediment, and indirectly through the food chain 4 . Therefore, fish can serve as useful indicators of aquatic pollution as they play multiple roles across the trophic web, bioaccumulate chemical substances, and respond to low concentrations of toxicants 5,6 .
Mugilidae (mullets) is a fish family widely distributed in tropical and subtropical waters around the globe, particularly in coastal and estuarine areas where they play an important ecological role, and provide biomass to support fisheries [7][8][9] . Over the past years, Mullets have been proposed as pollution bioindicators for environmental degradation 10 . Mugilids, in particular species of genus Mugil, such as Mugil cephalus and Mugil incilis, have been extensively employed on environmental monitoring programs, as well as in toxicological studies in coastal zones impacted by human activities 1,[10][11][12][13][14][15][16] . The use of these mugilids as sentinel species in these coastal systems arises from their wide geographical distribution; great abundance, salinity tolerance and bioaccumulation of landbased pollution, a feature largely enhanced by their consumption of benthic sediments along with their food 1,17 .
Mugil incilis, also known as mullet, lisa and lisa rayada, is one of the most abundant fish in the Caribbean. It is found in the western Atlantic Ocean, from Panama and Haiti to southeastern Brazil 18 . It can be found both in brackish estuaries and in marine and hyper-saline waters 19 . Juvenile fish (< 25 mm) are primarily planktonic or carnivorous feeders, whereas larger specimens switch their diet to detritus and benthic microalgae, ingesting large www.nature.com/scientificreports/ the software MEGAN, contigs were filtered according to its classification and only those belonging to Chordata were kept for further analysis. This filtered contig dataset that includes 40,296 contigs was considered from Mugil incilis (Table S1, Fig. S1). The completeness of the M. incilis transcriptome assembly was quantified by comparing against a dataset of 3,640 actinopterygii single-copy orthologs (actinopterygii_odb10) using the BUSCO gene mapping method (BUSCO, RRID:SCR 015,008) for transcriptome assembly validation. This assessment reveals 61.0% completeness with the BUSCO Vertebrata gene set (48.9% complete single-copy BUSCOs, and 12.1% complete duplicated BUSCOs), 8.8% fragmented BUSCOs, and 30.2% of the 3,640 single-copy orthologs were classified as missing from our assembly (Table S2).
Go annotation of the lisa liver transcriptome. GO terms were assigned to assemble contigs and provided defined ontologies to express gene product properties. Among the 40,296 contigs detected in the liver transcriptome, a total of 7,301 sequences had GO term assigned, which were subsequently categorized into 93 functional groups (Fig. S3). GO analysis assigned 4,695 (40%) contigs to biological process, 5,069 (44%) to cellular component, and 1869 (16%) to molecular function. Briefly, for biological processes, genes involved in cellular process (GO:0009987) and multicellular organismal process (GO:0032501) were the most dominant categories; for molecular functions, binding (GO:0005488) were the most represented GO term, followed by catalytic activity (GO:0003824); while the principal categories observed in cellular component were cells (GO:0005623) and organelles (GO:0043226). We also identified genes involved in other major functions, such as developmental and metabolic processes, biological regulation, response to stimulus, and localization. The top 10 groups in the three main categories are shown in Fig. 1.
Validation of primers for real-time qpcR assays. The de novo sequencing of the hepatic transcriptome of M. incilis generated sequence knowledge for a species with virtually no available genomic information. PCR primers of a set of known target genes related to responses to xenobiotic exposure were generated from the transcriptome sequences. Twenty primers pairs were analyzed, including gene markers of heavy metal exposure, xenobiotic metabolism, nuclear receptor modulation, oxidative stress, DNA damage, inflammation, and lipid metabolism. Accession numbers and primer sequences for the target genes are listed in Table 2.
The C T (Threshold cycle) is the cycle number at which the fluorescence generated within a reaction crosses the threshold line. When C T values were used to generate a log-linear regression plot, the standard curve for the target genes showed a strong relationship (R 2 ranging from 0.95 to 1; PCR efficiency ranging from 89 to 116%). A correlation coefficient greater than 0.99 shows good primer efficiency, and indicates a successful real-time PCR experiment. It is considered that a slope of the regression curve between − 3.9 and − 2.9 corresponds to PCR efficiencies ranging from 80 to 120% 40,41 . For all the genes evaluated in this work we found efficiency percentages ranging from 90 to 110%. A melting curve analysis showed no indication of primer-dimerization for the tested candidate oligonucleotides. PCR efficiencies for each primer pair are shown in Table 3. Furthermore, the primers specificity for target genes was confirmed by gel electrophoresis of the PCR products. All the primer sets amplified one single band of the expected fragment, and most of them showed specificity for targets from M. incilis. Some of the primers, including housekeeping genes, amplified fragments from related fish species such as P. magdalenae and H. malabaricus. The electrophoresis of PCR products for the evaluated genes is presented in Fig. 3.

Discussion
The accelerated development of high-throughput sequencing technology has revolutionized our way of studying genetic diversity and gene expression. In this road, the RNA-Seq technology has become the method of choice for the study and characterization of dynamic transcriptomes, and the de novo transcriptome assembly provides a powerful research tool for differential transcriptome analysis in fish species 42 . In the past several years, transcriptomic studies in teleost fish have increased, leading to a tremendous pool of genetic knowledge (e.g. see database FISHIT [https ://www.fish-it.org/hcmr] for 20 transcriptomes). However, although Colombia is considered one of the most biodiverse countries in the world, there is little knowledge on fish species commonly found in its tropical ecosystems. The assembly of the transcriptome of Mugil incilis opens a large window of opportunities to develop toxicological studies with this organism, which can be found along the Caribbean. www.nature.com/scientificreports/ Some reports on transcriptome data for Mugilids have been focused on population genetic structure 43 and innate defense of the host against pathogens 44,45 . In this study, the transcriptome of hepatic tissue from M. incilis was generated using Illumina HiSeq 2000 sequencing platform. A total of 93,912 contigs were assembled with an average length of 974 bp, a greater value than those obtained in previous studies using other de novo assembly methods for teleosts, such as viviparous blenny (Zoarces viviparus; 395 pb) 46 , Guppy (Poecilia reticulata; 464 pb) 47 , European eel (Anguilla; 531 pb) 48 and common carp (Cyprinus carpio; 888 pb) 49 . All this transcriptome sequencing data may be sufficient and necessary to discover new genes and create new tools for the analysis of gene function and expression in M. incilis. www.nature.com/scientificreports/ To evaluate the quality and completeness of our transcriptome assembly we used the N50 reference-free metric and the reference-based BUSCO metric, which give us an idea of how good our assembly is in terms of continuity and coverage of the contigs respectively. The N50 measure showed that 50% of total sequence length is contained in sequences equal or greater than 2,019 bp, which is not a short length for the transcripts in our transcriptome assembly 47,49,50 . On the other hand, the BUSCO analysis showed that most of the Actinopterygian single-copy orthologs (61%) were found completely in our assembly and 30.2% of the 3,640 orthologs were classified as missing from our assembly. This seems to be a relatively good level of completeness, when comparing with other assemblies reported for this class of organisms, especially if we consider that BUSCO recovery tends to be higher when multiple tissues and different stages are used to generate the assemblies, and our assembly was made from a single tissue of a single specimen 50,51 . Furthermore, only 8.8% of the orthologs were recovered partially, indicating a low level of incomplete transcripts. We also ran a BUSCO analysis using the Vertebrata and Eukaryota datasets (see Table S2 in supplementary material), which showed a greater percentage of complete BUSCOs recovery (68% and 92.1% respectively) which may reflect evolutionary innovations in the Mugil genus.
Homology search for the clustered sequences was carried out using BLASTX against the NCBI non-redundant (Nr) database. The distribution of top hit species suggested that most of the annotated sequences corresponded to known nucleotide sequences of a small group of fish species, S. partitus, A. ocellaris and L. crocea (Fig. S2). It was not surprising that M. incilis contigs shared the highest percentage with S. partitus, as both of them belong to the order Perciformes, although such finding may be due to the availability of the complete genome of S. partitus, providing sufficient gene sequences and annotations for comparison analyses. However, the transcriptomic similarities between M. incilis and S. partitus only reached 13.4%, suggesting these two fish species might be distantly related in subfamilies of Perciformes, Mugilidae and Pomacentridae for M. incilis and S. partitus, respectively.
The GO functional annotation indicated genes involved in cellular process and multicellular organismal were the most represented categories for biological processes. For cellular components, the major category represented was cell. For molecular function, binding was the most strongly represented GO term, followed by catalytic activity, results that can be useful for gene functional studies. Furthermore, 16,967 contigs were annotated to 6 categories in KEGG, including cellular processes, genetic information processing, environmental information processing, metabolism, human diseases and organismal systems. Signal transduction had the most number of genes (2,007) in environmental information. For metabolism, global and overview maps had the most number of genes (1,572), whereas for organismal systems, the immune system comprises the most number of genes (1,045) (Table S4). In the present study, PI3K-Akt, MAPK, Ras and Rap1 signaling pathways were found to be the most highly enriched. These pathways are ubiquitous and crosstalk in all eukaryotic cells. MAPK signaling, for example, regulates a wide variety of cellular processes, including proliferation, motility, differentiation, stress responses, and apoptosis 52 ; as well as cellular processes related to responses against diverse environmental stressors, such as heat, cold, UV radiation, reactive oxygen species, desiccation and pathogen attack 53 . The transcription of genes related to these signaling pathways suggested M. incilis is a sensitive model for physiological studies.
The 1,045 genes in the immune system were further grouped into 21 pathways (Fig. S4). NOD-like receptor signaling pathway and chemokine signaling pathway were the top two pathways with the greatest number www.nature.com/scientificreports/ of expressed genes. Nucleotide-binding oligomerization domain (NOD)-like receptors (NLRs) are a specialized group of cytoplasmic pattern recognition receptors (PRRs) that play an important role in the detection of various pathogens and regulation of innate immune responses. In fish, three distinct subfamilies of NLRs have been identified and characterized: NLR-A, NLR-B, and fish-specific NLR-C 54 . Expression of NLR-C subfamily (NLRCs) has been reported in many different teleost fish species such as zebrafish 55 , Japanese flounder 56 and turbot 57 . These receptors are activated by a variety of bacterial pathogens or microbial ligands, including lipopolysaccharides (LPS), peptidoglycans (PGN) and polyinosinic-polycytidylic acid (Poly (I:C)), suggesting their participation in the fish innate immune response 54 . The other signaling pathway with greater annotation was related to chemokines, a superfamily of small size chemoattractant peptides that provide directional cues for cell migration and positioning, fundamental for protective host response. The chemokine-mediated immune response to pathogens involves not only leukocyte migration, but also promotes differentiation of recruited cells to trigger the first steps of the innate and acquired immune response 58 . Other important pathways include Fc gamma R-mediated phagocytosis, natural killer cell mediated cytotoxicity, leukocyte transendothelial migration, and T cell receptors, also critical in immune response. As M. incilis has been found parasitized with Contracaecum sp. and Ascocotyle (Phagicola) longa in the Colombian Caribbean 21,26,27 , the characterization of these pathways will be an essential tool to study the interaction fish-parasites in this species, especially considering the high selectivity obtained for designed primers (Fig. 3). The transcriptome assembly of M. incilis generated in this work, as well as the PCR validation for several signaling pathways, will allow further molecular toxicology research on this species, as this fish has been used to monitor different chemical and biological pollutants in the Caribbean. The highly parasitized status of the fish, together with the different pollutant levels it may receive, creates a unique opportunity to study host-parasite interactions, and how environmental contaminants modulate this process.  MH716024 TGG AGA TAA CAC GAA CGG GTG  ATT GGG GCC AGC GTG ATT C  76   Catalase  CAT  MH716025 AAA GAG TGG TGC ATG CCA AG  TGC CGA CAT GTT CAA ACA CC  102   Proteína de shock térmico de 70 KDa  HSP70  MH716026 CTG GAT GTG ACC CCC TTG TC  GTT GGG ATG GTG GTG TTC CT  86   Factor Inducible por Hipoxia-1, Subunidad Alfa  HIF1A  MH716027 TAG TTG AGA GCA GCC AAC CC  CAG CTC TGC ACA CTT GTC CT  80   Exposure to xenobiotics   Cytochrome P4501A  CYP1A1  MH716028 TGT TCT TGG CCA TCC TCG TC  CAG TGG TCA TCC CTC ACT CG  142   Cytochrome P450,  Sequence data processing and de novo assembly. Read quality was controlled using trimming threshold of Q30 bases with the program PRINSEQ 59 . After the quality filter, reads with less than 50 bases were rejected. The assembly step was carried out with the program Trinity (https ://trini tyrna seq.sourc eforg e.net) employing default settings 60 . After the assembly, only the contigs longer than 200 bases were kept.
Functional contigs annotation and classification. Sequencing data were aligned using BLASTx alignment (E-value cut-off of 10 −5 ) with protein databases, including NCBI non-redundant (Nr) database and KEGG database. BLASTx comparisons and the software MEGAN were used to filter contigs that belong to Mugil incilis.
Since Mugil sequences are scarce in the database, all sequences classified as Chordata were considered as of Mugil origin. Fish (Chordata) filtered contigs were loaded into the eggNOG database (https ://eggno g.embl.de) in order to obtain general gene ontology (GO) annotations. The contigs/transcripts annotation was done assigning the GO terms according to three main categories: molecular function; cell component; and biological process 61 , and the gene ontology annotation results were plotted using the WEGO software at GO level 2 with default parameters (https ://wego.genom ics.org.cn). The assignment of KEGG pathways to the transcripts was done through the online KEGG Automatic Annotation Sever (KAAS; https ://www.genom e.jp/kegg/kaas) using the bi-directional best-hit method. Several contigs/transcripts of the stress responses/detoxification and immune system www.nature.com/scientificreports/ were selected and annotated using the BLASTx results and the KEGG annotation. CDS prediction was carried out manually using Artemis visualizer (Sanger) 62 .   (2)(3)(4)(5)(6)(7)(8)(9), P. magdalenae (11)(12), H. malabaricus (13)(14) and M. musculus (15). All the gels were run under the same experimental conditions and are presented using cropped images. The entire gel photos of all genes evaluated can be found in Figure S5. www.nature.com/scientificreports/ PPARA and PPARG). Primers for each target gene were designed with "primer-BLAST" tool from Genbank (https ://www.ncbi.nlm.nih.gov/tools /prime r-blast ) using the transcript sequences obtained from the transcriptome sequencing. The specificity of primers PCR was checked at the end of each amplification using a DNA melt curve analysis and by the observation of single amplification products of the expected size on agarose gels. Amplification efficiency was determined using a five-point 1:10 dilution series starting with cDNA corresponding to 100 ng of input total RNA 63 . Samples of cDNA from other fish species (Prochilodus magdalenae and Hoplias malabaricus) different from the mullets were included, as well as a sample of cDNA from Mus musculus, in order to verify the specificity of the primers. The specific primers used for RT-qPCR are listed in Table 2.

Design of primers and validation
The total RNA was isolated from liver tissue using RNeasy®Mini Kit (Qiagen, California, USA) as described by the manufacturer. The RNA was quantified by spectrophotometry (A260) using a NanoDrop 2000 Spectrophotometer (Thermo Scientific), and the A260:A280 ratio (1.9-2.0) was calculated to verify the RNA purity. Furthermore, the integrity of RNA was assessed by visual inspection of 28S and 18S ribosomal RNA after electrophoresis on an agarose gel. One microgram of total RNA was reverse transcribed, for each sample, using QuantiTect® Reverse Transcription Kit (QiagenInc, Valencia, CA, USA). The complementary DNA was used as template in a 20 μL PCR reaction containing 10 pmol each of forward and reverse gene-specific primers and 2 × SYBR® Green qPCR Master Mix (Thermo Scientific). Real time-PCR reactions were run in a StepOnePlus Real-Time PCR System (Applied Biosystems, Foster City, CA), under the following conditions: Initial denaturation and enzyme activation for 10 min at 95 °C, followed by 40 cycles of 95 °C for 15 s (denaturation) and 60 °C for 1 min (annealing/extension). A melt curve analysis was performed after the 40th cycle (10 s at 95 °C, then 65-95 °C in 0.5 °C increments every 5 s). A negative control (without DNA template) was run in duplicate on each plate to verify the absence of cDNA contamination 64 . ethical statement. Protocols for the use of animals in this study were reviewed and approved by the Ethics Committee of the University of Cartagena (Act No. 123; 30-07-19). The field study did not involve endangered or protected species, and it was granted with permission to access genetic resources from the Ministry of Environment, Housing and Territorial Development from Colombia (Contract No. 148; 09- [11][12][13][14][15][16][17]. Conducted experiments were carried out in accordance with the International Guiding Principles for Biomedical Research Involving Animals (https ://www.cioms .ch/frame 1985t extso fguid eline s.html).
conclusion This is the first report on the sequencing, de novo assembly, and functional annotation of the Mugil incilis transcriptome. The identification of the CDS sequences for several of these candidate genes will help developing primers for qPCR, which would be used to assess gene expression levels in mugilid species.

Data availability
The raw reads produced in this study were deposited in the NCBI database Sequence Read Archive (https ://www. ncbi.nlm.nih.gov/Trace s/sra) under the accession number SRP157095. This Transcriptome Shotgun Assembly project has been deposited at DDBJ/ENA/GenBank under the accession GHNU00000000.