The black tiger shrimp Penaeus monodon belongs to the family Penaeidae and is the second most widely farmed shrimp species globally1. However, disease and limited progress in domestication and selective breeding of P. monodon continue to hamper further expansion of the industry2. Modern genomic technologies have significant potential to advance selective breeding programs; however, they require complete, well annotated tissue-specific transcriptomic and genomic datasets. In addition to assisting in genome assembly and creating linkage maps3, a complete transcriptome provides a potential resource for focussed differential gene-expression studies4, genome annotation5, single nucleotide polymorphism discovery6 and genome scaffolding7.

While genomic resources for Penaeid shrimp are increasing, they remain limited for many species, including P. monodon. Previous research has focussed on hepatopancreas, ovary, heart, muscle and eyestalk tissues8,9, in male and female gonads10, and in response to infection with Vibrio bacterial species capable of inducing acute hepatopancreatic necrosis disease11. In addition to such differential gene-expression studies, genomic data from next generation sequencing (NGS) methods has expanded greatly in recent years, particularly in the study of Pacific white shrimp (Litopenaeus vannamei)3,6,12,13,14,15,16,17,18,19,20,21,22,23. Moreover, a transcriptome based on eight tissues was assembled for the less well studied banana shrimp Fenneropenaeus merguiensis24, and genes involved in early embryonic specification have been studied in Marsupenaeus japonicus25. Transcriptomics has also been applied to Penaeus merguiensis26,27,28 and the Chinese white shrimp Fenneropenaeus chinensis29,30 to investigate aspects of tissue-specific expression, stress tolerance and viral infection. Despite these advances, a comprehensive transcriptome from diverse tissue types and early life-history stages of P. monodon remains unavailable.

In order to address this deficiency, we report a highly complete transcriptome for P. monodon that can be used as a broad basis for future genomics research. To this effect, we sequenced three replicates each from nine different tissues types (eyestalk, stomach, female gonad, male gonad, gill, haemolymph, hepatopancreas, lymphoid organ and tail muscle) and one pooled replicate each from four larval stages (embryo, nauplii, zoea, and mysis) and four post-larval stages ranging from days 1, 4, 10 and 15. Additionally, transcript expression profiles unique to each type and stage were determined, as well as identifying putative long non-coding RNA and transcripts originating from viruses.


Sequence read data and code availability

In total, nine tissues were sequenced in biological triplicates, as well as pools of eight early life-history stages, resulting in an average of 19.9 M ± 1.6 M (mean ± SD) read pairs per sample and 697 M reads in total (Table 1). After quality trimming, 99.5% ± 0.6% (mean ± SD) of reads were retained, indicating a high quality data set (>90% reads with ≥Q30). All read data are available on GenBank through the project ID PRJNA421400.

Table 1 List of shrimp tissue types and early life-history stages used for transcriptome sequencing.

Transcriptome assembly and quality control

The initial combined outputs of all four assemblers comprised of 6,113,055 contigs, which were reduced to 462,772 contigs after filtering with Evidential Gene and combining both “okay” and “alternative” contigs. After clustering with Transfuse, the final assembly consisted of 236,388 transcripts with an assembly size of 226 Mb. These, together with transcript annotations, are available on GenBank. The final transcriptome had a high TransRate score of 0.37, with 88% of all reads successfully mapping back to the transcriptome, and only 3.2% of bases being uncovered. Based on BUSCO, the transcriptome was highly complete with 98% of arthropod ortholog genes being present, and few fragmented or missing genes; however, 51% of the contigs were duplicated/redundant (C:98.2%[S:46.9%, D:51.3%], F:0.8%, M:1.0%, n:1066).

Annotation and gene ontology mapping

Annotation against the SwissProt database using BLASTx resulted in 47,871 successfully annotated contigs. Of these, 46,977 were successfully GO mapped, of which 41,069 were completely annotated. The top-hit species distribution was dominated by Homo sapiens with over 10,000 hits, followed by Drosophila melanogaster with just over 8,000 hits; no shrimp species made it into the list (Fig. 1A). GO terms for biological processes, molecular function and cellular components were all highly represented in annotated genes (Fig. 2).

Figure 1
figure 1

Species distribution of successfully annotated sequences across the top 29 species using the SwissProt (A) and arthropod subsection of the non-redundant (B) database.

Figure 2
figure 2

Distribution of sequence annotations based on the SWISS-PROT database across the top 20 GO terms at level 2. Determined across the entire dataset for Biological Process (BP, green), Molecular Function (MF, blue), and Cellular Component (CC, yellow).

The annotation against the non-redundant Arthropod (nrA) database using BLASTx resulted in 62,679 successfully annotated contigs, of which 48,456 had a successful GO mapping, and of which 25,201 were completely annotated. The top-hit species distribution was dominated by the freshwater amphipod Hyalella azteca with over 20,000 hits, followed by P. monodon with just over 2,500 hits (Fig. 1B). Other penaeid shrimp species included Litopenaeus vannamei, Marsupenaeus japonicus and Fenneropenaeus chinensis, which were the sixth, seventh and twelfth most highly represented species respectively.

Detailed information on the annotations can be found in Supplementary Table S1.

Sequence read mapping and differential gene expression analysis

Using Bowtie2, 67.4% ± 4.8% (mean ± SD) of the paired reads successfully mapped to the transcriptome. Using corset for read counting and additional clustering, the initial 236,388 contigs were placed into 99,203 transcript clusters for the nine tissue types and 58,678 transcript clusters for the eight early life-history stages (larval and post-larval stage). A total of 176,966 contigs were used in the clustering of tissues and larvae, with 113,435 shared contigs, 8,188 contigs unique to larvae and 55,343 contigs unique to adult tissues.

Different tissue types expressed between 9,939 and 12,255 transcript clusters (defined as >50 normalized read counts per cluster), and between 17 and 316 unique sets of transcript clusters (defined as a cluster with >10 normalized read counts and <10 normalized read counts in all other tissue types) (Table 2). The ability to annotate transcript clusters varied across tissue types (63.0% to 85.9%). In terms of unique tissue specific transcript clusters, hepatopancreas contained the largest number (316), followed by female gonad (161) and gill (153). Annotation rates of these unique tissue-specific clusters were markedly lower (12.5% to 66.8%) than with clusters shared across all tissue types (82.5% and 85.9%)

Table 2 Numbers of transcript clusters and cluster annotation rates across transcriptomes determined for the nine adult P. monodon tissue types analysed.

A principal component analysis (PCA) of the top 1,000 differentially expressed transcripts across the nine adult tissue types showed strong clustering for most tissue replicates, with the exception of stomach and eyestalk (Fig. 3A). Haemolymph, female gonad and muscle formed distinct clusters separated from other tissues, while eyestalk, gill, haemolymph, lymphoid organ, male gonad and stomach tissues were much more closely associated and showed less distinct clustering (Fig. 3A). A PCA of the top 500 differentially expressed transcripts across the eight early life-history stages showed a strong separation within PC1, with embryo and nauplii segregating substantially from the other early life-history larval stages (Fig. 3B). PC1 explained an extraordinary 77% of the variance in transcript clusters expressed across the different discrete larval stages, which appears to be strongly associated with larval development leading from embryo to post-larval stages.

Figure 3
figure 3

Principal component analysis showing the top most highly differentially expressed transcripts of (A) nine tissue types (top 1,000) and (B) eight early life-history stages (top 500). PC = principal component, PL = post-larvae.

The top 2,000 most variably expressed transcript clusters across all nine tissue types clustered into nine distinct groups using Pearson’s correlation (Fig. 4). These groups aligned broadly with expression patterns identified to be unique to each tissues type. For example, group two comprised 208 clusters highly expressed in female gonad, which were mostly successfully annotated (81.8%) using the nrA database. Annotated transcripts included farnesoic acid O-methyltransferase (FAmET), phosphoenolpyruvate carboxykinase (PEPCK), glutathione peroxidase (GPx) and nasrat. Transcripts in each cluster and their annotation are detailed in Supplementary Table S2. Group four consisted of clusters expressed mainly in male gonad that were annotated relatively poorly (38.7%) with many (35.5%) not expressed in the early life-history stages (Table 3). Group nine was the largest and comprised 591 clusters that were mostly annotated (86.0%) and expressed predominantly in muscle tissue. Group seven consisted of 533 clusters that were also mostly annotated (85.7%) and expressed predominantly in hepatopancreatic tissue. Except for male gonad, most clusters expressed in adult tissue types were also expressed in the early life-history stages.

Figure 4
figure 4

Heatmap and hierarchical grouping of the top 2,000 differentially expressed genes in the nine different tissue types. A darker colour indicates a higher expression level. Gene expression patterns (rows) were grouped into nine expression groups based on Pearson’s correlation and the three replicates of each tissue type (columns) into nine tissue groups based on Euclidean distance. EY – eyestalk; FG – female gonad; GI – gill; HL – hemolymph; HP – hepatopancreas; LO – lymphoid organ; MG – male gonad; MU – muscle; ST – stomach.

Table 3 Groupings of the top 2,000 highly variably expressed transcript clusters among all nine adult tissue types based on Pearson’s correlation.

The same top 500 most variably expressed transcript clusters in the different larval and post-larval stages used for the PCA broadly clustered into nine distinct groups based on Pearson’s correlation (Fig. 5). Irrespective of the annotation success, the analysis identified transcript clusters that shared similar expression patterns across developmental stages. Embryos and nauplii expressed a set of genes that were not expressed during any other developmental stage (groups 7 and 8). Of the 140 genes expressed exclusively within the embryo and nauplii stages (group 8), only 24.3% and 37.1%, respectively, were annotated successfully using the SWISS-PROT or nrA databases (Table 4). Of the transcript clusters that were annotated, 13 encoded orthologs of the neurotrophic factor spaetzle and another 13 encoded orthologs of cuticular proteins. Transcripts in each cluster and their annotation are detailed in Supplementary Table S3. Two large clusters of genes were expressed from zoea throughout each subsequent stage (group 1), or from mysis throughout each subsequent stage (group 4). A high percentage (61.2% and 83.1%) of transcripts in these two clusters was annotated. Since each larval stage was sequenced as a pool of individuals, differential gene expression (DGE) analysis could not be performed.

Figure 5
figure 5

Heatmap and hierarchical grouping of the top 500 differentially expressed genes in the eight larval and post-larval stages examined. A darker colour indicates a higher expression level. Gene expression patterns in each larval/post-larval stage (row) were grouped into nine expression groups based on Pearson’s correlation.

Table 4 Groupings of the top 500 highly variably expressed transcript clusters among the four larval and four post-larval stages based on Pearson’s correlation.

Identification of long non-coding RNAs

We used the set of 1,047 complete USCOs as the training set for classification of coding and non-coding transcripts. It was determined that a coding potential of 0.2642 was the appropriate threshold to balance classification specificity and sensitivity. In total 79,656 transcripts were classified as lncRNAs and the remaining 154,893 transcripts were classified as mRNAs.

Comparing the lncRNA annotation with the BLASTx annotation, out of the 236,388 contigs 67,960 were uniquely identified as lncRNA, while 13,535 contigs were annotated both as mRNA and lncRNA. At a cluster level, 12,079 out of 58,768 larval clusters (22.6%) and 23,645 out of the 99,203 tissue clusters (23.8%) were uniquely annotated as lncRNA. Detailed results of the lncRNA analysis can be found in Supplementary Table S4.

KEGG pathway analysis

Annotated contigs were overlaid onto their respective biological pathways using the Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathways. Genes involved in general eukaryotic cellular processes such as RNA replication (Fig. 6) and basal transcription factor sequences (Fig. 7) were well represented in the P. monodon transcriptome. As expected, assignments to KEGG pathways in prokaryotes were rare, as were ribosomal RNA assignments. The various biological processes, metabolism and signalling cascades comprising all 235 KEGG pathways to which transcripts were assigned are detailed in Supplementary Table S5.

Figure 6
figure 6

Presence of mRNA contigs that encode for RNA polymerase subunits based on KEGG pathway analysis73,74,75. Green shading highlights the presence of gene orthologs in the P. monodon transcriptome.

Figure 7
figure 7

Presence of eukaryotic basal transcription factor sequences based on KEGG pathway analysis73,74,75. Green shading highlights the presence of gene in the P. monodon transcriptome.

Virus discovery

Interrogating the P. monodon transcriptome against the viral subsection of the non-redundant database using BLASTx assigned viral annotations to 12,744 contigs. Detailed information on the viral blast can be found in Supplementary Table S6. Closer inspection of the data identified the vast majority (>99.8%) of these to represent short motifs conserved between eukaryote cell proteins and related homologs viruses with generally large and complex DNA genomes such as giant viruses, poxviruses, herpes viruses and baculoviruses. Additional BLASTx searches of the GenBank nr database using representative contigs confirmed them to be or likely be endogenous shrimp gene transcripts. The remaining 21 contigs had Top Hit E-value scores identifying them to be related most closely to strains of Gill-associated virus (GAV; 4 contigs, longest 26,235 nt), Penaeus chinensis hepandenovirus (PchiHDV; 4 contigs, longest 1,884 nt), Whenzhou shrimp virus 2 (When-2; RdRp, hypothetical protein and G protein contigs, longest 6,891 nt), Whenzhou shrimp virus 8 (When-8; 6 contigs, longest 4,579 nt), Beihai picorna-like virus 2 (5,277 nt), Wenzhou picorna-like virus 23 (551 nt) and Moloney murine leukaemia virus Pr180 sequence (Mo-MuLV; 2,431 nt). Additionally, Deformed wing virus (DWV; 10,133 nt) was present in two out of 35 samples; however, further investigation confirmed that this was caused by a highly localized cross-contamination from other samples with a high content of a specific strain of DWV during sequencing. Lastly, over 1200 contigs with homology to phages were detected, some of which related to phage tail protein and tetracycline resistance.


Here we report a comprehensive black tiger shrimp (Penaeus monodon) transcriptome assembled from nine tissues, four larval stages and four post-larval stages. The transcriptome was generated to expand the genetic resources available for this species to help investigate the genetic basis behind larval developmental stage transitions and tissue functioning, as well as traits with potential to be exploited commercially for the aquaculture of this and other shrimp species. The aim was therefore to generate a highly complete P. monodon transcriptome at the risk of it containing higher levels of transcript redundancy. This was confirmed by BUSCO results which demonstrated the transcriptome to be highly complete (C:98.2%) with low fragmentation (F:0.8%) or missing (M: 1.0%) genes but high levels of duplication (D:51.3%). These assembly statistics are comparable to those obtained by a transcriptome assembly from L. vannamei15 (C:98.0%, F:0.7%, M:1.3%, D:25.5%), but greatly exceeded those of another P. monodon assembly focussing on gonadial tissue recently made available publicly10 (C:33.7%, F:44.9%, M:21.4%, D:6.8%). As other recent NGS analyses of P. monodon have focussed on only one or two tissue types without including any larval stages or biological replicates, generated fewer total reads, or experienced data loss due to quality trimming of low quality reads or low mapping efficiencies8,9,10,11, these are likely to have missed many transcripts. In contrast, the sequencing and assembly strategy used here covered more tissue types at greater read depth and employed multiple de novo assembly tools to reduce assembler bias.

Using the nrA database, 30.0% of transcript clusters found in the nine tissue types and 38.1% of transcript clusters found in the eight larval/post-larval stages analysed were successfully annotated. These annotation levels were comparable to those reported to date in similar studies on different crustaceans8,15,24,31. While transcript cluster annotation levels were lower using the SWISS-PROT database compared to the nrA database, the percentage of successful GO-term assignments was substantially higher. In addition to the annotations, analyses were undertaken to identify transcript clusters expressed differentially across tissue types or early life-history stages, irrespective of successful annotation. The identification was done to help provide initial evidence for transcript roles in specific tissue functions or developmental transitions. Despite all efforts made here to improve transcript annotation levels for P. monodon, our data reaffirms the need for dedicated functional studies to assign or confirm gene functions of both annotated and unannotated transcript clusters of non-model (crustacean) species.

To our best knowledge, to date only two Penaeid shrimp transcriptome assemblies have been made publicly available10,15, restricting comparative analyses of these transcriptomes. A reciprocal MegaBLAST identified 96.8% of the most recent P. monodon assembly10 within the transcriptome described here, but only 40.0% of our assembly was found in the earlier assembly. These comparisons confirm that our transcriptome assembly contains many high quality P. monodon transcripts not discovered previously.

When compared across species, a reciprocal MegaBLAST showed that the transcriptomes of P. monodon (present) and L. vannamei15 shared approximately 48% of contigs. Since the assembly metrics of the L. vannamei transcriptome were similar to those of our P. monodon transcriptome, the low number of shared contigs could stem from considerable differences in transcript type or sequence composition between the two shrimp species. As comprehensive comparisons across crustacean species is currently impractical due to restrictions on publicly-available transcriptome assemblies, the potential value of this warrants effort to consolidate transcriptomic data and to establish both centralized and species-specific databases.

Read count data identified independent clusters of transcripts expressed uniquely within different tissues and clusters that formed distinct groups based on their tissue-specific expression patterns. An important consideration for this type of analysis is the normalized read count cutoff value for each cluster to be considered “unique”, which was arbitrarily set at above 10 in a specific tissue and <10 in all others. At >100 normalized read counts, only approximately half of the assigned unique clusters were retained, indicating that the expression levels of many of these potentially tissue-specific clusters was relatively low. Among the annotated transcript clusters most highly expressed in female gonad tissue were FAMeT, PEPCK, GPx and nasrat. Functional roles these proteins may play range from the shrimp moult cycle and reproduction32, the primary step of gluconeogenesis33, preventing oxidative stress33, to specifying terminal regions of the embryo34. Among the annotated genes expressed most highly in eyestalk tissue was hyperglycaemic hormone (CHH), a key neuropeptide hormone that regulates blood sugar, moulting and reproduction35. A subset of transcript clusters highly expressed in lymphoid organ tissue was also highly expressed in gill tissue, most likely due to high concentrations of haemocytes within both tissue types. The majority of genes expressed most highly in hepatopancreas were annotated, potentially reflecting the shared metabolic functions of this organ with those of other animals. Also of much interest were the non-annotated transcripts expressed uniquely in specific tissue types. For example, transcript clusters expressed highly in male gonad were poorly annotated by both databases and included a large proportion of clusters, annotated or not, expressed exclusively in adult tissue types, indicating that male reproductive organs utilize many genes that remain poorly characterized. The grouping of genes with similar expression patterns broadly categorized these transcript clusters into potential functional groups within each tissue type, thereby guiding the selection for more targeted molecular function analyses.

Based solely on gene expression patterns, the transcriptome data identified unique groups of transcripts involved in transitions between P. monodon early life-history stages. There was a major disparity between the annotation success of transcript groups upregulated in early or late stage embryogenesis, highlighting how poorly early developmental pathways have been characterized in crustaceans. Also of significance was the presence of orthologs of the Spaetzle gene, known in Drosophila flies to establish the dorso-ventral patterning of the early embryo36 among transcript clusters detected consistently across later larval and post-larval stages. Since each larval and post-larval stage sequenced comprised a pool of several hundred individuals, quantitative and/or spatial transcript expression patterns would be required to draw further functional conclusions. Nevertheless, the data reported here will benefit from similar data on other shrimp and crustacean species, particularly for transcript clusters expressed exclusively in embryo with no significant homology to currently known genes.

Long non-coding RNAs (lncRNA) are a type of transcript that have many common features with traditional coding mRNA, including 5′ capping, splicing and 3′ polyadenylation37,38,39. The nature of lncRNAs is still poorly understood, and it is likely that lncRNAs are in fact a heterogeneous group of transcripts with regulatory functions that are not actively translated into proteins40. Thus, their main characteristics are the lack of open reading frames (ORFs) or the presence of non-canonical ORFs in the mature transcript. The biological roles of lncRNAs range from regulation of gene expression, and control of translation, to imprinting. As such, they have been linked to X chromosome inactivation in humans41, genomic imprinting42 and cancer43,44.

Due to the lack of a known lncRNA database in shrimp that can be used for their identification, we used FEELnc which scores each transcript according to its coding potential and then selects a threshold score to classify the transcripts into coding or non-coding45. This software is particularly useful for non-model species because in the absence of an lncRNA training set, it generates a simulated training set using debris from high confidence coding transcripts. In fly data, this approach showed an MCC value of 0.754 with an accuracy of 0.86845.

In this study, 79,656 transcripts were classified as lncRNAs, of which 67,960 (85.3%) could not be aligned to any protein database. As expected, the use of a non-model organism and the lack of a set with known lncRNA for training led to the ambiguous classification of 13,535 transcripts with low protein-coding potential but clear alignments to known proteins in curated databases. Classification of these transcripts is the first step towards understanding their roles in the development and regulation of gene expression in Penaeus monodon.

Annotated transcript clusters mapped into 235 KEGG pathways (Supplementary Table S3), which have been broadly classified into functional groupings such as general metabolism (e.g. TCA cycle, xenobiotic metabolism, immunity, reproduction), nutritional metabolism (e.g. proteins, lipids, carbohydrates, vitamins), cellular processes (e.g. DNA replication, protein trafficking, apoptosis), biological processes (e.g. circadian rhythm, olfaction and taste, digestion and absorption) and signalling pathways (e.g. PI3K-Akt, MAPK, axis formation, TGF-beta). In general, core pathways such as citrate cycle, oxidative phosphorylation, ribosome biogenesis and RNA/DNA polymerases were better represented than more specific pathways such as the pentose and glucuronate interconversion pathway, or the ascorbate and aldarate metabolism pathway. Furthermore, arthropod specific pathways were generally better represented. For example, the general circadian rhythm pathway was missing several homologs, while the fly specific circadian rhythm pathway was complete. This could be explained by transcripts not sharing sufficient homology with the known genes used for the KEGG analysis and therefore failing to be annotated.

Particularly for those pathways highly-conserved among other eukaryotes, the existence of unique transcripts suggests that Penaeid shrimp and possibly crustaceans in general might use metabolic mechanisms differing from eukaryote species studied to date. Their existence also highlights the need for high-quality genome assemblies for shrimp and other crustacean species, overlaid with isoform, tissue-specific and developmental stage transcript expression data, to either help predict gene functions or direct gene knockdown studies, using RNA interference processes as an example, to empirically ascribe functions to novel genes.

Several RNA transcripts and/or genome sequences likely to be from viruses were discovered in the P. monodon transcriptome. This was not unexpected considering that it was generated from multiple individuals, tissue types and larval/post-larval stages, as shrimp are co-infected commonly with multiple viruses and as there are several viruses known to be endemic in P. monodon populations indigenous to different regions of Australia46,47,48,49. The presence of near full-length ssRNA genome sequences for viruses such as gill-associated virus (GAV, 26,235 nt) and and two sequences (deposited on NCBI: OM219076 and OM219077, cumulative length of 10,542 nt ) with high similarity to Whenzhou shrimp virus-2 L and M segments (When-2, KM817720.2 and KM817687.1) provided additional validation of the methods used to synthesize and assemble the transcriptome, and to its completeness as demonstrated by various metrics measuring the nature and number of endogenous gene transcripts. The detection of a ssDNA virus, hepandenovirus, within the transcriptome, presumably detected in a replicative phase, indicates the application of this technique as a tool to also detect the presence of viruses with DNA genomes.

In addition to known endemic viruses, the transcriptome contained full-length or near full-length RNA transcripts related closely to the recently-described shrimp viruses When-2 and When-850,51 unknown until now to occur in Australian P. monodon. A couple of long transcripts of suspected viral origin and expressed across multiple tissue types were also identified. One of these possessed significant BLASTx homology to the reverse transcriptase (RT)-like component of hypothetical protein 1 of Beihai picorna-like virus 116 discovered recently in blue swimmer crabs (Portunus pelagicus)51. The other possessed substantial homology to the RT component of the Mo-MuLV Pr180 polyprotein and was expressed across all tissue types except the lymphoid organ, suggesting it to be from a mobile element such as a poly(A)-type retrotransposon or retrovirus52. However, determining whether these transcripts containing RT sequences are viral in origin, or represent the products of endogenous retrotransposons like others now being reported in shrimp53 will require further investigation, as will the nature of the strains, host and distribution ranges, prevalence and potential pathogenicity of the new viruses discovered in the transcriptome.

In conclusion, this study describes the assembly of a comprehensive and high quality transcriptome from nine different tissue types, and eight larval and post-larval early life-history stages of the black tiger shrimp, Penaeus monodon. It also summarizes the number and nature of specific transcript clusters differentially expressed in different tissue types and larval stages, and the Clusters were functionally annotated and mapped to 235 KEGG pathways. Unique transcript clusters and cluster groups were defined across distinct tissues and early life-history stages, providing initial evidence for their roles in specific tissue functions or developmental transitions. The current transcriptome provides a valuable resource for further investigation of directing gene-function studies to increase basic functional biology knowledge in shrimp and for investigating molecular basis of traits of relevance to the aquaculture of shrimp. While the current transcriptome already provides an improved resource for P. monodon, further effort is required using long-read sequencing data, such as provided by PacBio, to better resolve genes at isoform level. Lastly, this high-quality de novo assembly and data set are publically available and will hopefully support research projects that underpin transformational advances in how we culture shrimp globally.

Material and Methods

Sample taking and RNA extraction

Tissues of P. monodon broodstock were collected from multiple intermolt individuals, immediately snap frozen on dry ice, and stored at −80 °C until extraction (Table 1). All tissues except lymphoid organs were collected from wild broodstock caught off coastal waters near the border between the Northern Territory and Western Australia provided, which were provided by a commercial hatchery at Flying Fish Point, North Queensland, Australia. The prawns were kept at a salinity of 27–35 ppt, pH 7.8–8.0, 28.5–29.5 °C and 5 to 7 ppm dissolved oxygen. Lymphoid organ tissue was collected from wild prawns caught off the East Coast of Queensland. Larval and post-larval stages were collected from the same hatchery in pools of approximately 400 individuals per life stage, after four hours of starvation, and preserved in RNAlater (Thermo Fisher Scientific). All tissues and early life-history stages were sub-sampled in an RNase-free laboratory and total RNA was extracted using an RNeasy Universal extraction kit (QIAGEN) following manufacturer’s instructions. RNA quantity and quality was estimated using a Nanodrop UV spectrophotometer (Thermo Fisher Scientific), and purity was further assessed using an Agilent Bioanalyzer (Agilent Technologies). RNA was selected from individual sample replicates based on Nanodrop spectra, RNA concentration, and Agilent Bioanalyzer traces, in preference to using comparative tissues from the same individuals.

Illumina library preparation and sequencing

Library preparation and sequencing was carried out at the Australian Genome Research Facility (AGRF). Upon arrival at the sequencing facility, the quality of the samples was checked using a Bioanalyzer RNA 6000 nano reagent kit (Agilent) and libraries were prepared using the TruSeq Stranded mRNA Library Preparation Kit (Illumina) according to established protocols. Final libraries were again checked using Tapestation DNA 1000 TapeScreen Assay (Agilent). Cluster generation was performed on a cBot with HiSeq PE Cluster Kit v4 - cBot and sequencing was done on a HiSeq 2500 using a HiSeq SBS Kit. The Hiseq 2500 was operating with HiSeq Control Software v2.2.68 and base-calling was performed with RTA v1.18.66.3. Samples in the second sequencing run were pooled and split across two lanes to reduce sequencing bias (Table 1).

Sequence quality control, assembly and annotation

Raw sequence data was quality checked using FastQC54 v0.11.5, and assembled loosely following the Oyster River Protocol for Transcriptome Assembly55. In brief, all sequences were collectively error-corrected using RCorrector56 V3. Samples were then assembled in Trinity57 V2.3.2; grouped by individual shrimps, i.e. all tissues from a specific shrimp were assembled together. Reads were trimmed harshly for adapters and softly for Phred score <2 using Trimmomatic58 V0.32; and then normalized in silico within Trinity. The normalized forward and reverse reads produced by Trinity were then used in BinPacker59 V1.0, IDBA-Tran60 V 1.1.1 using K20, K30, K40, K50 and K60; and Bridger61 version 2014-12-01. All resulting transcriptomes were concatenated and merged using Evidential Gene62, followed by clustering using Transfuse V0.5.0 ( using a similarity value of 0.98. Lastly, contigs <300 bp were removed to produce the final transcriptome. The quality of the final assembly was assessed using TransRate63 V1.0.1, and BUSCO64 V2 using the arthropoda_odb9 database65. Sequences were annotated in Blast2Go66 using the SWISS-PROT database67 (accessed 17/03/2017), and separately using the arthropod and viral subsections of the GenBank nr database (accessed 06/06/2017).

Identification of long non-coding RNAs

FEELnc45 was used for the identification of long non-coding RNAs. The coding transcripts training set was constructed from the 1,047 complete universal single copy orthologous genes found with BUSCO v2.0 (database arthropoda_odb965). The mode “shuffle” was used to generate a training set of lncRNA from the debris of the known coding RNA transcripts.

Mapping and differential gene expression analysis

Before mapping, error-corrected raw sequence reads were trimmed using the same parameters as before, but without palindrome trimming used by Trinity. Sequence reads were mapped using Bowtie268 V2.2.8, and read counts were calculated using Corset69 V1.0.6. Differential gene expression was analyzed using DESeq270 V1.16.1 in RStudio71 V1.0.143 running R72 V3.4.1.

To reduce the number of sequences for KEGG analysis73,74,75, the longest contig per cluster was chosen from the combined tissue type and early life-history stage data. The KEGG Automatic Annotation Server (KAAS, was used to generate KEGG pathway maps for each contig using BLAST with the single-directional best hit (SBH) method. All scripts can be found on GitHub at

Statistical analyses

For data analysis, the top 2,000 variably expressed genes across the nine tissue types and the top 500 variably expressed genes across the four larval and four post-larval stages were visualized in a principal component analysis and heatmap using variance-stabilizing transformed read-count data from DESeq. 2. The gene level dendrograms in the heatmap were created using Pearson’s correlation for both the tissue type larval/post-larval stages. Euclidean distance was used to cluster tissue types. All statistical analyses were performed in RStudio. More detailed information on the analyses can be found on GitHub.

Ethical approval

This study has been carried out abiding by all necessary Queensland Government legislation and James Cook University policies.