De novo assembly, characterization, functional annotation and expression patterns of the black tiger shrimp (Penaeus monodon) transcriptome

The black tiger shrimp (Penaeus monodon) remains the second most widely cultured shrimp species globally; however, issues with disease and domestication have seen production levels stagnate over the past two decades. To help identify innovative solutions needed to resolve bottlenecks hampering the culture of this species, it is important to generate genetic and genomic resources. Towards this aim, we have produced the most complete publicly available P. monodon transcriptome database to date based on nine adult tissues and eight early life-history stages (BUSCO - Complete: 98.2% [Duplicated: 51.3%], Fragmented: 0.8%, Missing: 1.0%). The assembly resulted in 236,388 contigs, which were then further segregated into 99,203 adult tissue specific and 58,678 early life-history stage specific clusters. While annotation rates were low (approximately 30%), as is typical for a non-model organisms, annotated transcript clusters were successfully mapped to several hundred functional KEGG pathways. Transcripts were clustered into groups within tissues and early life-history stages, providing initial evidence for their roles in specific tissue functions, or developmental transitions. We expect the transcriptome to provide an essential resource to investigate the molecular basis of commercially relevant-significant traits in P. monodon and other shrimp species.

variance in transcript clusters expressed across the different discrete larval stages, which appears to be strongly associated with larval development leading from embryo to post-larval stages.
The top 2,000 most variably expressed transcript clusters across all nine tissue types clustered into nine distinct groups using Pearson's correlation (Fig. 4). These groups aligned broadly with expression patterns identified to be unique to each tissues type. For example, group two comprised 208 clusters highly expressed in female gonad, which were mostly successfully annotated (81.8%) using the nrA database. Annotated transcripts included farnesoic acid O-methyltransferase (FAmET), phosphoenolpyruvate carboxykinase (PEPCK), glutathione peroxidase (GPx) and nasrat. Transcripts in each cluster and their annotation are detailed in Supplementary Table S2. Group four consisted of clusters expressed mainly in male gonad that were annotated relatively poorly (38.7%) with many (35.5%) not expressed in the early life-history stages (Table 3). Group nine was the largest and comprised 591 clusters that were mostly annotated (86.0%) and expressed predominantly in muscle tissue. Group seven consisted of 533 clusters that were also mostly annotated (85.7%) and expressed predominantly in hepatopancreatic tissue. Except for male gonad, most clusters expressed in adult tissue types were also expressed in the early life-history stages.
The same top 500 most variably expressed transcript clusters in the different larval and post-larval stages used for the PCA broadly clustered into nine distinct groups based on Pearson's correlation (Fig. 5). Irrespective of the annotation success, the analysis identified transcript clusters that shared similar expression patterns across developmental stages. Embryos and nauplii expressed a set of genes that were not expressed during any other developmental stage (groups 7 and 8). Of the 140 genes expressed exclusively within the embryo and nauplii stages (group 8), only 24.3% and 37.1%, respectively, were annotated successfully using the SWISS-PROT or nrA databases (Table 4). Of the transcript clusters that were annotated, 13 encoded orthologs of the neurotrophic factor spaetzle and another 13 encoded orthologs of cuticular proteins. Transcripts in each cluster and their annotation are detailed in Supplementary Table S3. Two large clusters of genes were expressed from zoea throughout each subsequent stage (group 1), or from mysis throughout each subsequent stage (group 4). A high percentage (61.2% and 83.1%) of transcripts in these two clusters was annotated. Since each larval stage was sequenced as a pool of individuals, differential gene expression (DGE) analysis could not be performed.
Identification of long non-coding RNAs. We used the set of 1,047 complete USCOs as the training set for classification of coding and non-coding transcripts. It was determined that a coding potential of 0.2642 was the appropriate threshold to balance classification specificity and sensitivity. In total 79,656 transcripts were classified as lncRNAs and the remaining 154,893 transcripts were classified as mRNAs.
Comparing the lncRNA annotation with the BLASTx annotation, out of the 236,388 contigs 67,960 were uniquely identified as lncRNA, while 13,535 contigs were annotated both as mRNA and lncRNA. At a cluster level, 12,079 out of 58,768 larval clusters (22.6%) and 23,645 out of the 99,203 tissue clusters (23.8%) were uniquely annotated as lncRNA. Detailed results of the lncRNA analysis can be found in Supplementary Table S4. KEGG pathway analysis. Annotated contigs were overlaid onto their respective biological pathways using the Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathways. Genes involved in general eukaryotic cellular processes such as RNA replication (Fig. 6) and basal transcription factor sequences ( Fig. 7) were well represented in the P. monodon transcriptome. As expected, assignments to KEGG pathways in prokaryotes were rare, as were ribosomal RNA assignments. The various biological processes, metabolism and signalling cascades comprising all 235 KEGG pathways to which transcripts were assigned are detailed in Supplementary Table S5. Virus discovery. Interrogating the P. monodon transcriptome against the viral subsection of the non-redundant database using BLASTx assigned viral annotations to 12,744 contigs. Detailed information on the viral blast can be found in Supplementary Table S6. Closer inspection of the data identified the vast majority (>99.8%) of these to represent short motifs conserved between eukaryote cell proteins and related homologs viruses with generally large and complex DNA genomes such as giant viruses, poxviruses, herpes viruses and was present in two out of 35 samples; however, further investigation confirmed that this was caused by a highly localized cross-contamination from other samples with a high content of a specific strain of DWV during sequencing. Lastly, over 1200 contigs with homology to phages were detected, some of which related to phage tail protein and tetracycline resistance.

Discussion
Here we report a comprehensive black tiger shrimp (Penaeus monodon) transcriptome assembled from nine tissues, four larval stages and four post-larval stages. The transcriptome was generated to expand the genetic resources available for this species to help investigate the genetic basis behind larval developmental stage transitions and tissue functioning, as well as traits with potential to be exploited commercially for the aquaculture of this and other shrimp species. The aim was therefore to generate a highly complete P. monodon transcriptome at the risk of it containing higher levels of transcript redundancy. This was confirmed by BUSCO results which demonstrated the transcriptome to be highly complete (C:98.2%) with low fragmentation (F:0.8%) or missing (M: 1.0%) genes but high levels of duplication (D:51.3%). These assembly statistics are comparable to those obtained by a transcriptome assembly from L. vannamei 15 (C:98.0%, F:0.7%, M:1.3%, D:25.5%), but greatly exceeded those of another P. monodon assembly focussing on gonadial tissue recently made available publicly 10 (C:33.7%, F:44.9%, M:21.4%, D:6.8%). As other recent NGS analyses of P. monodon have focussed on only one or two tissue types without including any larval stages or biological replicates, generated fewer total reads, or experienced data loss due to quality trimming of low quality reads or low mapping efficiencies [8][9][10][11] , these are likely to have missed many transcripts. In contrast, the sequencing and assembly strategy used here covered more tissue types at greater read depth and employed multiple de novo assembly tools to reduce assembler bias. Using the nrA database, 30.0% of transcript clusters found in the nine tissue types and 38.1% of transcript clusters found in the eight larval/post-larval stages analysed were successfully annotated. These annotation levels were comparable to those reported to date in similar studies on different crustaceans 8,15,24,31 . While transcript cluster annotation levels were lower using the SWISS-PROT database compared to the nrA database, the percentage of successful GO-term assignments was substantially higher. In addition to the annotations, analyses were undertaken to identify transcript clusters expressed differentially across tissue types or early life-history stages, irrespective of successful annotation. The identification was done to help provide initial evidence for transcript roles in specific tissue functions or developmental transitions. Despite all efforts made here to improve transcript annotation levels for P. monodon, our data reaffirms the need for dedicated functional studies to assign or confirm gene functions of both annotated and unannotated transcript clusters of non-model (crustacean) species.
To our best knowledge, to date only two Penaeid shrimp transcriptome assemblies have been made publicly available 10,15 , restricting comparative analyses of these transcriptomes. A reciprocal MegaBLAST identified 96.8% of the most recent P. monodon assembly 10 within the transcriptome described here, but only 40.0% of our assembly was found in the earlier assembly. These comparisons confirm that our transcriptome assembly contains many high quality P. monodon transcripts not discovered previously. When compared across species, a reciprocal MegaBLAST showed that the transcriptomes of P. monodon (present) and L. vannamei 15 shared approximately 48% of contigs. Since the assembly metrics of the L. vannamei transcriptome were similar to those of our P. monodon transcriptome, the low number of shared contigs could stem from considerable differences in transcript type or sequence composition between the two shrimp species. As comprehensive comparisons across crustacean species is currently impractical due to restrictions on publicly-available transcriptome assemblies, the potential value of this warrants effort to consolidate transcriptomic data and to establish both centralized and species-specific databases.
Read count data identified independent clusters of transcripts expressed uniquely within different tissues and clusters that formed distinct groups based on their tissue-specific expression patterns. An important consideration for this type of analysis is the normalized read count cutoff value for each cluster to be considered "unique", which was arbitrarily set at above 10 in a specific tissue and <10 in all others. At >100 normalized read counts,   only approximately half of the assigned unique clusters were retained, indicating that the expression levels of many of these potentially tissue-specific clusters was relatively low. Among the annotated transcript clusters most highly expressed in female gonad tissue were FAMeT, PEPCK, GPx and nasrat. Functional roles these proteins may play range from the shrimp moult cycle and reproduction 32 , the primary step of gluconeogenesis 33 , preventing oxidative stress 33 , to specifying terminal regions of the embryo 34 . Among the annotated genes expressed most highly in eyestalk tissue was hyperglycaemic hormone (CHH), a key neuropeptide hormone that regulates blood sugar, moulting and reproduction 35 . A subset of transcript clusters highly expressed in lymphoid organ tissue was also highly expressed in gill tissue, most likely due to high concentrations of haemocytes within both tissue types. The majority of genes expressed most highly in hepatopancreas were annotated, potentially reflecting the shared metabolic functions of this organ with those of other animals. Also of much interest were the non-annotated transcripts expressed uniquely in specific tissue types. For example, transcript clusters expressed highly in male gonad    were poorly annotated by both databases and included a large proportion of clusters, annotated or not, expressed exclusively in adult tissue types, indicating that male reproductive organs utilize many genes that remain poorly characterized. The grouping of genes with similar expression patterns broadly categorized these transcript clusters into potential functional groups within each tissue type, thereby guiding the selection for more targeted molecular function analyses. Based solely on gene expression patterns, the transcriptome data identified unique groups of transcripts involved in transitions between P. monodon early life-history stages. There was a major disparity between the annotation success of transcript groups upregulated in early or late stage embryogenesis, highlighting how poorly early developmental pathways have been characterized in crustaceans. Also of significance was the presence of orthologs of the Spaetzle gene, known in Drosophila flies to establish the dorso-ventral patterning of the early embryo 36 among transcript clusters detected consistently across later larval and post-larval stages. Since each larval and post-larval stage sequenced comprised a pool of several hundred individuals, quantitative and/or spatial transcript expression patterns would be required to draw further functional conclusions. Nevertheless, the data reported here will benefit from similar data on other shrimp and crustacean species, particularly for transcript clusters expressed exclusively in embryo with no significant homology to currently known genes.
Long non-coding RNAs (lncRNA) are a type of transcript that have many common features with traditional coding mRNA, including 5′ capping, splicing and 3′ polyadenylation [37][38][39] . The nature of lncRNAs is still poorly understood, and it is likely that lncRNAs are in fact a heterogeneous group of transcripts with regulatory functions that are not actively translated into proteins 40 . Thus, their main characteristics are the lack of open reading frames (ORFs) or the presence of non-canonical ORFs in the mature transcript. The biological roles of lncRNAs range from regulation of gene expression, and control of translation, to imprinting. As such, they have been linked to X chromosome inactivation in humans 41 , genomic imprinting 42 and cancer 43,44 .
Due to the lack of a known lncRNA database in shrimp that can be used for their identification, we used FEELnc which scores each transcript according to its coding potential and then selects a threshold score to classify the transcripts into coding or non-coding 45 . This software is particularly useful for non-model species because in the absence of an lncRNA training set, it generates a simulated training set using debris from high confidence coding transcripts. In fly data, this approach showed an MCC value of 0.754 with an accuracy of 0.868 45 .
In this study, 79,656 transcripts were classified as lncRNAs, of which 67,960 (85.3%) could not be aligned to any protein database. As expected, the use of a non-model organism and the lack of a set with known lncRNA for training led to the ambiguous classification of 13,535 transcripts with low protein-coding potential but clear alignments to known proteins in curated databases. Classification of these transcripts is the first step towards understanding their roles in the development and regulation of gene expression in Penaeus monodon.
Annotated transcript clusters mapped into 235 KEGG pathways (Supplementary Table S3), which have been broadly classified into functional groupings such as general metabolism (e.g. TCA cycle, xenobiotic metabolism, immunity, reproduction), nutritional metabolism (e.g. proteins, lipids, carbohydrates, vitamins), cellular processes (e.g. DNA replication, protein trafficking, apoptosis), biological processes (e.g. circadian rhythm, olfaction and taste, digestion and absorption) and signalling pathways (e.g. PI3K-Akt, MAPK, axis formation, TGF-beta). In general, core pathways such as citrate cycle, oxidative phosphorylation, ribosome biogenesis and RNA/DNA polymerases were better represented than more specific pathways such as the pentose and glucuronate interconversion pathway, or the ascorbate and aldarate metabolism pathway. Furthermore, arthropod specific pathways were generally better represented. For example, the general circadian rhythm pathway was missing several homologs, while the fly specific circadian rhythm pathway was complete. This could be explained by transcripts not sharing sufficient homology with the known genes used for the KEGG analysis and therefore failing to be annotated.
Particularly for those pathways highly-conserved among other eukaryotes, the existence of unique transcripts suggests that Penaeid shrimp and possibly crustaceans in general might use metabolic mechanisms differing from eukaryote species studied to date. Their existence also highlights the need for high-quality genome assemblies for shrimp and other crustacean species, overlaid with isoform, tissue-specific and developmental stage transcript expression data, to either help predict gene functions or direct gene knockdown studies, using RNA interference processes as an example, to empirically ascribe functions to novel genes.
Several RNA transcripts and/or genome sequences likely to be from viruses were discovered in the P. monodon transcriptome. This was not unexpected considering that it was generated from multiple individuals, tissue types and larval/post-larval stages, as shrimp are co-infected commonly with multiple viruses and as there are several viruses known to be endemic in P. monodon populations indigenous to different regions of Australia [46][47][48][49] . The presence of near full-length ssRNA genome sequences for viruses such as gill-associated virus (GAV, 26,235 nt) and and two sequences (deposited on NCBI: OM219076 and OM219077, cumulative length of 10,542 nt ) with high similarity to Whenzhou shrimp virus-2 L and M segments (When-2, KM817720.2 and KM817687.1) provided additional validation of the methods used to synthesize and assemble the transcriptome, and to its completeness as demonstrated by various metrics measuring the nature and number of endogenous gene transcripts. The detection of a ssDNA virus, hepandenovirus, within the transcriptome, presumably detected in a replicative phase, indicates the application of this technique as a tool to also detect the presence of viruses with DNA genomes.
In addition to known endemic viruses, the transcriptome contained full-length or near full-length RNA transcripts related closely to the recently-described shrimp viruses When-2 and When-8 50,51 unknown until now to occur in Australian P. monodon. A couple of long transcripts of suspected viral origin and expressed across multiple tissue types were also identified. One of these possessed significant BLASTx homology to the reverse transcriptase (RT)-like component of hypothetical protein 1 of Beihai picorna-like virus 116 discovered recently in blue swimmer crabs (Portunus pelagicus) 51 . The other possessed substantial homology to the RT component of the Mo-MuLV Pr180 polyprotein and was expressed across all tissue types except the lymphoid organ, suggesting it to be from a mobile element such as a poly(A)-type retrotransposon or retrovirus 52 . However, determining whether these transcripts containing RT sequences are viral in origin, or represent the products of endogenous retrotransposons like others now being reported in shrimp 53 will require further investigation, as will the nature of the strains, host and distribution ranges, prevalence and potential pathogenicity of the new viruses discovered in the transcriptome.
In conclusion, this study describes the assembly of a comprehensive and high quality transcriptome from nine different tissue types, and eight larval and post-larval early life-history stages of the black tiger shrimp, Penaeus monodon. It also summarizes the number and nature of specific transcript clusters differentially expressed in different tissue types and larval stages, and the Clusters were functionally annotated and mapped to 235 KEGG pathways. Unique transcript clusters and cluster groups were defined across distinct tissues and early life-history stages, providing initial evidence for their roles in specific tissue functions or developmental transitions. The current transcriptome provides a valuable resource for further investigation of directing gene-function studies to increase basic functional biology knowledge in shrimp and for investigating molecular basis of traits of relevance to the aquaculture of shrimp. While the current transcriptome already provides an improved resource for P. monodon, further effort is required using long-read sequencing data, such as provided by PacBio, to better resolve genes at isoform level. Lastly, this high-quality de novo assembly and data set are publically available and will hopefully support research projects that underpin transformational advances in how we culture shrimp globally.

Material and Methods
Sample taking and RNA extraction. Tissues of P. monodon broodstock were collected from multiple intermolt individuals, immediately snap frozen on dry ice, and stored at −80 °C until extraction ( Table 1). All tissues except lymphoid organs were collected from wild broodstock caught off coastal waters near the border between the Northern Territory and Western Australia provided, which were provided by a commercial hatchery at Flying Fish Point, North Queensland, Australia. The prawns were kept at a salinity of 27-35 ppt, pH 7.8-8.0, 28.5-29.5 °C and 5 to 7 ppm dissolved oxygen. Lymphoid organ tissue was collected from wild prawns caught off the East Coast of Queensland. Larval and post-larval stages were collected from the same hatchery in pools of approximately 400 individuals per life stage, after four hours of starvation, and preserved in RNAlater (Thermo Fisher Scientific). All tissues and early life-history stages were sub-sampled in an RNase-free laboratory and total RNA was extracted using an RNeasy Universal extraction kit (QIAGEN) following manufacturer's instructions. RNA quantity and quality was estimated using a Nanodrop UV spectrophotometer (Thermo Fisher Scientific), and purity was further assessed using an Agilent Bioanalyzer (Agilent Technologies). RNA was selected from individual sample replicates based on Nanodrop spectra, RNA concentration, and Agilent Bioanalyzer traces, in preference to using comparative tissues from the same individuals.
Illumina library preparation and sequencing. Library preparation and sequencing was carried out at the Australian Genome Research Facility (AGRF). Upon arrival at the sequencing facility, the quality of the samples was checked using a Bioanalyzer RNA 6000 nano reagent kit (Agilent) and libraries were prepared using the TruSeq Stranded mRNA Library Preparation Kit (Illumina) according to established protocols. Final libraries were again checked using Tapestation DNA 1000 TapeScreen Assay (Agilent). Cluster generation was performed on a cBot with HiSeq PE Cluster Kit v4 -cBot and sequencing was done on a HiSeq 2500 using a HiSeq SBS Kit. The Hiseq 2500 was operating with HiSeq Control Software v2.2.68 and base-calling was performed with RTA v1.18.66.3. Samples in the second sequencing run were pooled and split across two lanes to reduce sequencing bias (Table 1).
Sequence quality control, assembly and annotation. Raw sequence data was quality checked using FastQC 54 v0.11.5, and assembled loosely following the Oyster River Protocol for Transcriptome Assembly 55 . In brief, all sequences were collectively error-corrected using RCorrector 56 V3. Samples were then assembled in Trinity 57 V2.3.2; grouped by individual shrimps, i.e. all tissues from a specific shrimp were assembled together. Reads were trimmed harshly for adapters and softly for Phred score <2 using Trimmomatic 58 V0.32; and then normalized in silico within Trinity. The normalized forward and reverse reads produced by Trinity were then used in BinPacker 59 V1.0, IDBA-Tran 60 V 1.1.1 using K20, K30, K40, K50 and K60; and Bridger 61 version 2014-12-01. All resulting transcriptomes were concatenated and merged using Evidential Gene 62 , followed by clustering using Transfuse V0.5.0 (https://github.com/cboursnell/transfuse) using a similarity value of 0.98. Lastly, contigs <300 bp were removed to produce the final transcriptome. The quality of the final assembly was assessed using TransRate 63 V1.0.1, and BUSCO 64 V2 using the arthropoda_odb9 database 65 . Sequences were annotated in Blast2Go 66 using the SWISS-PROT database 67 (accessed 17/03/2017), and separately using the arthropod and viral subsections of the GenBank nr database (accessed 06/06/2017).
Identification of long non-coding RNAs. FEELnc 45 was used for the identification of long non-coding RNAs. The coding transcripts training set was constructed from the 1,047 complete universal single copy orthologous genes found with BUSCO v2.0 (database arthropoda_odb9 65 ). The mode "shuffle" was used to generate a training set of lncRNA from the debris of the known coding RNA transcripts.
Mapping and differential gene expression analysis. Before mapping, error-corrected raw sequence reads were trimmed using the same parameters as before, but without palindrome trimming used by Trinity. Sequence reads were mapped using Bowtie2 68 V2.2.8, and read counts were calculated using Corset 69 V1.0.6. Differential gene expression was analyzed using DESeq2 70 V1. 16 To reduce the number of sequences for KEGG analysis [73][74][75] , the longest contig per cluster was chosen from the combined tissue type and early life-history stage data. The KEGG Automatic Annotation Server (KAAS, http://www.genome.jp/tools/kaas/) was used to generate KEGG pathway maps for each contig using BLAST with the single-directional best hit (SBH) method. All scripts can be found on GitHub at https://github. com/R-Huerlimann/Pmono_multitissue_transcriptome.

Statistical analyses.
For data analysis, the top 2,000 variably expressed genes across the nine tissue types and the top 500 variably expressed genes across the four larval and four post-larval stages were visualized in a principal component analysis and heatmap using variance-stabilizing transformed read-count data from DESeq. 2. The gene level dendrograms in the heatmap were created using Pearson's correlation for both the tissue type larval/post-larval stages. Euclidean distance was used to cluster tissue types. All statistical analyses were performed in RStudio. More detailed information on the analyses can be found on GitHub.
Ethical approval. This study has been carried out abiding by all necessary Queensland Government legislation and James Cook University policies.