The black tiger shrimp (Penaeus monodon) remains the second most widely cultured shrimp species globally; however, issues with disease and domestication have seen production levels stagnate over the past two decades. To help identify innovative solutions needed to resolve bottlenecks hampering the culture of this species, it is important to generate genetic and genomic resources. Towards this aim, we have produced the most complete publicly available P. monodon transcriptome database to date based on nine adult tissues and eight early life-history stages (BUSCO - Complete: 98.2% [Duplicated: 51.3%], Fragmented: 0.8%, Missing: 1.0%). The assembly resulted in 236,388 contigs, which were then further segregated into 99,203 adult tissue specific and 58,678 early life-history stage specific clusters. While annotation rates were low (approximately 30%), as is typical for a non-model organisms, annotated transcript clusters were successfully mapped to several hundred functional KEGG pathways. Transcripts were clustered into groups within tissues and early life-history stages, providing initial evidence for their roles in specific tissue functions, or developmental transitions. We expect the transcriptome to provide an essential resource to investigate the molecular basis of commercially relevant-significant traits in P. monodon and other shrimp species.
The black tiger shrimp Penaeus monodon belongs to the family Penaeidae and is the second most widely farmed shrimp species globally1. However, disease and limited progress in domestication and selective breeding of P. monodon continue to hamper further expansion of the industry2. Modern genomic technologies have significant potential to advance selective breeding programs; however, they require complete, well annotated tissue-specific transcriptomic and genomic datasets. In addition to assisting in genome assembly and creating linkage maps3, a complete transcriptome provides a potential resource for focussed differential gene-expression studies4, genome annotation5, single nucleotide polymorphism discovery6 and genome scaffolding7.
While genomic resources for Penaeid shrimp are increasing, they remain limited for many species, including P. monodon. Previous research has focussed on hepatopancreas, ovary, heart, muscle and eyestalk tissues8,9, in male and female gonads10, and in response to infection with Vibrio bacterial species capable of inducing acute hepatopancreatic necrosis disease11. In addition to such differential gene-expression studies, genomic data from next generation sequencing (NGS) methods has expanded greatly in recent years, particularly in the study of Pacific white shrimp (Litopenaeus vannamei)3,6,12,13,14,15,16,17,18,19,20,21,22,23. Moreover, a transcriptome based on eight tissues was assembled for the less well studied banana shrimp Fenneropenaeus merguiensis24, and genes involved in early embryonic specification have been studied in Marsupenaeus japonicus25. Transcriptomics has also been applied to Penaeus merguiensis26,27,28 and the Chinese white shrimp Fenneropenaeus chinensis29,30 to investigate aspects of tissue-specific expression, stress tolerance and viral infection. Despite these advances, a comprehensive transcriptome from diverse tissue types and early life-history stages of P. monodon remains unavailable.
In order to address this deficiency, we report a highly complete transcriptome for P. monodon that can be used as a broad basis for future genomics research. To this effect, we sequenced three replicates each from nine different tissues types (eyestalk, stomach, female gonad, male gonad, gill, haemolymph, hepatopancreas, lymphoid organ and tail muscle) and one pooled replicate each from four larval stages (embryo, nauplii, zoea, and mysis) and four post-larval stages ranging from days 1, 4, 10 and 15. Additionally, transcript expression profiles unique to each type and stage were determined, as well as identifying putative long non-coding RNA and transcripts originating from viruses.
Sequence read data and code availability
In total, nine tissues were sequenced in biological triplicates, as well as pools of eight early life-history stages, resulting in an average of 19.9 M ± 1.6 M (mean ± SD) read pairs per sample and 697 M reads in total (Table 1). After quality trimming, 99.5% ± 0.6% (mean ± SD) of reads were retained, indicating a high quality data set (>90% reads with ≥Q30). All read data are available on GenBank through the project ID PRJNA421400.
Transcriptome assembly and quality control
The initial combined outputs of all four assemblers comprised of 6,113,055 contigs, which were reduced to 462,772 contigs after filtering with Evidential Gene and combining both “okay” and “alternative” contigs. After clustering with Transfuse, the final assembly consisted of 236,388 transcripts with an assembly size of 226 Mb. These, together with transcript annotations, are available on GenBank. The final transcriptome had a high TransRate score of 0.37, with 88% of all reads successfully mapping back to the transcriptome, and only 3.2% of bases being uncovered. Based on BUSCO, the transcriptome was highly complete with 98% of arthropod ortholog genes being present, and few fragmented or missing genes; however, 51% of the contigs were duplicated/redundant (C:98.2%[S:46.9%, D:51.3%], F:0.8%, M:1.0%, n:1066).
Annotation and gene ontology mapping
Annotation against the SwissProt database using BLASTx resulted in 47,871 successfully annotated contigs. Of these, 46,977 were successfully GO mapped, of which 41,069 were completely annotated. The top-hit species distribution was dominated by Homo sapiens with over 10,000 hits, followed by Drosophila melanogaster with just over 8,000 hits; no shrimp species made it into the list (Fig. 1A). GO terms for biological processes, molecular function and cellular components were all highly represented in annotated genes (Fig. 2).
The annotation against the non-redundant Arthropod (nrA) database using BLASTx resulted in 62,679 successfully annotated contigs, of which 48,456 had a successful GO mapping, and of which 25,201 were completely annotated. The top-hit species distribution was dominated by the freshwater amphipod Hyalella azteca with over 20,000 hits, followed by P. monodon with just over 2,500 hits (Fig. 1B). Other penaeid shrimp species included Litopenaeus vannamei, Marsupenaeus japonicus and Fenneropenaeus chinensis, which were the sixth, seventh and twelfth most highly represented species respectively.
Detailed information on the annotations can be found in Supplementary Table S1.
Sequence read mapping and differential gene expression analysis
Using Bowtie2, 67.4% ± 4.8% (mean ± SD) of the paired reads successfully mapped to the transcriptome. Using corset for read counting and additional clustering, the initial 236,388 contigs were placed into 99,203 transcript clusters for the nine tissue types and 58,678 transcript clusters for the eight early life-history stages (larval and post-larval stage). A total of 176,966 contigs were used in the clustering of tissues and larvae, with 113,435 shared contigs, 8,188 contigs unique to larvae and 55,343 contigs unique to adult tissues.
Different tissue types expressed between 9,939 and 12,255 transcript clusters (defined as >50 normalized read counts per cluster), and between 17 and 316 unique sets of transcript clusters (defined as a cluster with >10 normalized read counts and <10 normalized read counts in all other tissue types) (Table 2). The ability to annotate transcript clusters varied across tissue types (63.0% to 85.9%). In terms of unique tissue specific transcript clusters, hepatopancreas contained the largest number (316), followed by female gonad (161) and gill (153). Annotation rates of these unique tissue-specific clusters were markedly lower (12.5% to 66.8%) than with clusters shared across all tissue types (82.5% and 85.9%)
A principal component analysis (PCA) of the top 1,000 differentially expressed transcripts across the nine adult tissue types showed strong clustering for most tissue replicates, with the exception of stomach and eyestalk (Fig. 3A). Haemolymph, female gonad and muscle formed distinct clusters separated from other tissues, while eyestalk, gill, haemolymph, lymphoid organ, male gonad and stomach tissues were much more closely associated and showed less distinct clustering (Fig. 3A). A PCA of the top 500 differentially expressed transcripts across the eight early life-history stages showed a strong separation within PC1, with embryo and nauplii segregating substantially from the other early life-history larval stages (Fig. 3B). PC1 explained an extraordinary 77% of the variance in transcript clusters expressed across the different discrete larval stages, which appears to be strongly associated with larval development leading from embryo to post-larval stages.
The top 2,000 most variably expressed transcript clusters across all nine tissue types clustered into nine distinct groups using Pearson’s correlation (Fig. 4). These groups aligned broadly with expression patterns identified to be unique to each tissues type. For example, group two comprised 208 clusters highly expressed in female gonad, which were mostly successfully annotated (81.8%) using the nrA database. Annotated transcripts included farnesoic acid O-methyltransferase (FAmET), phosphoenolpyruvate carboxykinase (PEPCK), glutathione peroxidase (GPx) and nasrat. Transcripts in each cluster and their annotation are detailed in Supplementary Table S2. Group four consisted of clusters expressed mainly in male gonad that were annotated relatively poorly (38.7%) with many (35.5%) not expressed in the early life-history stages (Table 3). Group nine was the largest and comprised 591 clusters that were mostly annotated (86.0%) and expressed predominantly in muscle tissue. Group seven consisted of 533 clusters that were also mostly annotated (85.7%) and expressed predominantly in hepatopancreatic tissue. Except for male gonad, most clusters expressed in adult tissue types were also expressed in the early life-history stages.
The same top 500 most variably expressed transcript clusters in the different larval and post-larval stages used for the PCA broadly clustered into nine distinct groups based on Pearson’s correlation (Fig. 5). Irrespective of the annotation success, the analysis identified transcript clusters that shared similar expression patterns across developmental stages. Embryos and nauplii expressed a set of genes that were not expressed during any other developmental stage (groups 7 and 8). Of the 140 genes expressed exclusively within the embryo and nauplii stages (group 8), only 24.3% and 37.1%, respectively, were annotated successfully using the SWISS-PROT or nrA databases (Table 4). Of the transcript clusters that were annotated, 13 encoded orthologs of the neurotrophic factor spaetzle and another 13 encoded orthologs of cuticular proteins. Transcripts in each cluster and their annotation are detailed in Supplementary Table S3. Two large clusters of genes were expressed from zoea throughout each subsequent stage (group 1), or from mysis throughout each subsequent stage (group 4). A high percentage (61.2% and 83.1%) of transcripts in these two clusters was annotated. Since each larval stage was sequenced as a pool of individuals, differential gene expression (DGE) analysis could not be performed.
Identification of long non-coding RNAs
We used the set of 1,047 complete USCOs as the training set for classification of coding and non-coding transcripts. It was determined that a coding potential of 0.2642 was the appropriate threshold to balance classification specificity and sensitivity. In total 79,656 transcripts were classified as lncRNAs and the remaining 154,893 transcripts were classified as mRNAs.
Comparing the lncRNA annotation with the BLASTx annotation, out of the 236,388 contigs 67,960 were uniquely identified as lncRNA, while 13,535 contigs were annotated both as mRNA and lncRNA. At a cluster level, 12,079 out of 58,768 larval clusters (22.6%) and 23,645 out of the 99,203 tissue clusters (23.8%) were uniquely annotated as lncRNA. Detailed results of the lncRNA analysis can be found in Supplementary Table S4.
KEGG pathway analysis
Annotated contigs were overlaid onto their respective biological pathways using the Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathways. Genes involved in general eukaryotic cellular processes such as RNA replication (Fig. 6) and basal transcription factor sequences (Fig. 7) were well represented in the P. monodon transcriptome. As expected, assignments to KEGG pathways in prokaryotes were rare, as were ribosomal RNA assignments. The various biological processes, metabolism and signalling cascades comprising all 235 KEGG pathways to which transcripts were assigned are detailed in Supplementary Table S5.
Interrogating the P. monodon transcriptome against the viral subsection of the non-redundant database using BLASTx assigned viral annotations to 12,744 contigs. Detailed information on the viral blast can be found in Supplementary Table S6. Closer inspection of the data identified the vast majority (>99.8%) of these to represent short motifs conserved between eukaryote cell proteins and related homologs viruses with generally large and complex DNA genomes such as giant viruses, poxviruses, herpes viruses and baculoviruses. Additional BLASTx searches of the GenBank nr database using representative contigs confirmed them to be or likely be endogenous shrimp gene transcripts. The remaining 21 contigs had Top Hit E-value scores identifying them to be related most closely to strains of Gill-associated virus (GAV; 4 contigs, longest 26,235 nt), Penaeus chinensis hepandenovirus (PchiHDV; 4 contigs, longest 1,884 nt), Whenzhou shrimp virus 2 (When-2; RdRp, hypothetical protein and G protein contigs, longest 6,891 nt), Whenzhou shrimp virus 8 (When-8; 6 contigs, longest 4,579 nt), Beihai picorna-like virus 2 (5,277 nt), Wenzhou picorna-like virus 23 (551 nt) and Moloney murine leukaemia virus Pr180 sequence (Mo-MuLV; 2,431 nt). Additionally, Deformed wing virus (DWV; 10,133 nt) was present in two out of 35 samples; however, further investigation confirmed that this was caused by a highly localized cross-contamination from other samples with a high content of a specific strain of DWV during sequencing. Lastly, over 1200 contigs with homology to phages were detected, some of which related to phage tail protein and tetracycline resistance.
Here we report a comprehensive black tiger shrimp (Penaeus monodon) transcriptome assembled from nine tissues, four larval stages and four post-larval stages. The transcriptome was generated to expand the genetic resources available for this species to help investigate the genetic basis behind larval developmental stage transitions and tissue functioning, as well as traits with potential to be exploited commercially for the aquaculture of this and other shrimp species. The aim was therefore to generate a highly complete P. monodon transcriptome at the risk of it containing higher levels of transcript redundancy. This was confirmed by BUSCO results which demonstrated the transcriptome to be highly complete (C:98.2%) with low fragmentation (F:0.8%) or missing (M: 1.0%) genes but high levels of duplication (D:51.3%). These assembly statistics are comparable to those obtained by a transcriptome assembly from L. vannamei15 (C:98.0%, F:0.7%, M:1.3%, D:25.5%), but greatly exceeded those of another P. monodon assembly focussing on gonadial tissue recently made available publicly10 (C:33.7%, F:44.9%, M:21.4%, D:6.8%). As other recent NGS analyses of P. monodon have focussed on only one or two tissue types without including any larval stages or biological replicates, generated fewer total reads, or experienced data loss due to quality trimming of low quality reads or low mapping efficiencies8,9,10,11, these are likely to have missed many transcripts. In contrast, the sequencing and assembly strategy used here covered more tissue types at greater read depth and employed multiple de novo assembly tools to reduce assembler bias.
Using the nrA database, 30.0% of transcript clusters found in the nine tissue types and 38.1% of transcript clusters found in the eight larval/post-larval stages analysed were successfully annotated. These annotation levels were comparable to those reported to date in similar studies on different crustaceans8,15,24,31. While transcript cluster annotation levels were lower using the SWISS-PROT database compared to the nrA database, the percentage of successful GO-term assignments was substantially higher. In addition to the annotations, analyses were undertaken to identify transcript clusters expressed differentially across tissue types or early life-history stages, irrespective of successful annotation. The identification was done to help provide initial evidence for transcript roles in specific tissue functions or developmental transitions. Despite all efforts made here to improve transcript annotation levels for P. monodon, our data reaffirms the need for dedicated functional studies to assign or confirm gene functions of both annotated and unannotated transcript clusters of non-model (crustacean) species.
To our best knowledge, to date only two Penaeid shrimp transcriptome assemblies have been made publicly available10,15, restricting comparative analyses of these transcriptomes. A reciprocal MegaBLAST identified 96.8% of the most recent P. monodon assembly10 within the transcriptome described here, but only 40.0% of our assembly was found in the earlier assembly. These comparisons confirm that our transcriptome assembly contains many high quality P. monodon transcripts not discovered previously.
When compared across species, a reciprocal MegaBLAST showed that the transcriptomes of P. monodon (present) and L. vannamei15 shared approximately 48% of contigs. Since the assembly metrics of the L. vannamei transcriptome were similar to those of our P. monodon transcriptome, the low number of shared contigs could stem from considerable differences in transcript type or sequence composition between the two shrimp species. As comprehensive comparisons across crustacean species is currently impractical due to restrictions on publicly-available transcriptome assemblies, the potential value of this warrants effort to consolidate transcriptomic data and to establish both centralized and species-specific databases.
Read count data identified independent clusters of transcripts expressed uniquely within different tissues and clusters that formed distinct groups based on their tissue-specific expression patterns. An important consideration for this type of analysis is the normalized read count cutoff value for each cluster to be considered “unique”, which was arbitrarily set at above 10 in a specific tissue and <10 in all others. At >100 normalized read counts, only approximately half of the assigned unique clusters were retained, indicating that the expression levels of many of these potentially tissue-specific clusters was relatively low. Among the annotated transcript clusters most highly expressed in female gonad tissue were FAMeT, PEPCK, GPx and nasrat. Functional roles these proteins may play range from the shrimp moult cycle and reproduction32, the primary step of gluconeogenesis33, preventing oxidative stress33, to specifying terminal regions of the embryo34. Among the annotated genes expressed most highly in eyestalk tissue was hyperglycaemic hormone (CHH), a key neuropeptide hormone that regulates blood sugar, moulting and reproduction35. A subset of transcript clusters highly expressed in lymphoid organ tissue was also highly expressed in gill tissue, most likely due to high concentrations of haemocytes within both tissue types. The majority of genes expressed most highly in hepatopancreas were annotated, potentially reflecting the shared metabolic functions of this organ with those of other animals. Also of much interest were the non-annotated transcripts expressed uniquely in specific tissue types. For example, transcript clusters expressed highly in male gonad were poorly annotated by both databases and included a large proportion of clusters, annotated or not, expressed exclusively in adult tissue types, indicating that male reproductive organs utilize many genes that remain poorly characterized. The grouping of genes with similar expression patterns broadly categorized these transcript clusters into potential functional groups within each tissue type, thereby guiding the selection for more targeted molecular function analyses.
Based solely on gene expression patterns, the transcriptome data identified unique groups of transcripts involved in transitions between P. monodon early life-history stages. There was a major disparity between the annotation success of transcript groups upregulated in early or late stage embryogenesis, highlighting how poorly early developmental pathways have been characterized in crustaceans. Also of significance was the presence of orthologs of the Spaetzle gene, known in Drosophila flies to establish the dorso-ventral patterning of the early embryo36 among transcript clusters detected consistently across later larval and post-larval stages. Since each larval and post-larval stage sequenced comprised a pool of several hundred individuals, quantitative and/or spatial transcript expression patterns would be required to draw further functional conclusions. Nevertheless, the data reported here will benefit from similar data on other shrimp and crustacean species, particularly for transcript clusters expressed exclusively in embryo with no significant homology to currently known genes.
Long non-coding RNAs (lncRNA) are a type of transcript that have many common features with traditional coding mRNA, including 5′ capping, splicing and 3′ polyadenylation37,38,39. The nature of lncRNAs is still poorly understood, and it is likely that lncRNAs are in fact a heterogeneous group of transcripts with regulatory functions that are not actively translated into proteins40. Thus, their main characteristics are the lack of open reading frames (ORFs) or the presence of non-canonical ORFs in the mature transcript. The biological roles of lncRNAs range from regulation of gene expression, and control of translation, to imprinting. As such, they have been linked to X chromosome inactivation in humans41, genomic imprinting42 and cancer43,44.
Due to the lack of a known lncRNA database in shrimp that can be used for their identification, we used FEELnc which scores each transcript according to its coding potential and then selects a threshold score to classify the transcripts into coding or non-coding45. This software is particularly useful for non-model species because in the absence of an lncRNA training set, it generates a simulated training set using debris from high confidence coding transcripts. In fly data, this approach showed an MCC value of 0.754 with an accuracy of 0.86845.
In this study, 79,656 transcripts were classified as lncRNAs, of which 67,960 (85.3%) could not be aligned to any protein database. As expected, the use of a non-model organism and the lack of a set with known lncRNA for training led to the ambiguous classification of 13,535 transcripts with low protein-coding potential but clear alignments to known proteins in curated databases. Classification of these transcripts is the first step towards understanding their roles in the development and regulation of gene expression in Penaeus monodon.
Annotated transcript clusters mapped into 235 KEGG pathways (Supplementary Table S3), which have been broadly classified into functional groupings such as general metabolism (e.g. TCA cycle, xenobiotic metabolism, immunity, reproduction), nutritional metabolism (e.g. proteins, lipids, carbohydrates, vitamins), cellular processes (e.g. DNA replication, protein trafficking, apoptosis), biological processes (e.g. circadian rhythm, olfaction and taste, digestion and absorption) and signalling pathways (e.g. PI3K-Akt, MAPK, axis formation, TGF-beta). In general, core pathways such as citrate cycle, oxidative phosphorylation, ribosome biogenesis and RNA/DNA polymerases were better represented than more specific pathways such as the pentose and glucuronate interconversion pathway, or the ascorbate and aldarate metabolism pathway. Furthermore, arthropod specific pathways were generally better represented. For example, the general circadian rhythm pathway was missing several homologs, while the fly specific circadian rhythm pathway was complete. This could be explained by transcripts not sharing sufficient homology with the known genes used for the KEGG analysis and therefore failing to be annotated.
Particularly for those pathways highly-conserved among other eukaryotes, the existence of unique transcripts suggests that Penaeid shrimp and possibly crustaceans in general might use metabolic mechanisms differing from eukaryote species studied to date. Their existence also highlights the need for high-quality genome assemblies for shrimp and other crustacean species, overlaid with isoform, tissue-specific and developmental stage transcript expression data, to either help predict gene functions or direct gene knockdown studies, using RNA interference processes as an example, to empirically ascribe functions to novel genes.
Several RNA transcripts and/or genome sequences likely to be from viruses were discovered in the P. monodon transcriptome. This was not unexpected considering that it was generated from multiple individuals, tissue types and larval/post-larval stages, as shrimp are co-infected commonly with multiple viruses and as there are several viruses known to be endemic in P. monodon populations indigenous to different regions of Australia46,47,48,49. The presence of near full-length ssRNA genome sequences for viruses such as gill-associated virus (GAV, 26,235 nt) and and two sequences (deposited on NCBI: OM219076 and OM219077, cumulative length of 10,542 nt ) with high similarity to Whenzhou shrimp virus-2 L and M segments (When-2, KM817720.2 and KM817687.1) provided additional validation of the methods used to synthesize and assemble the transcriptome, and to its completeness as demonstrated by various metrics measuring the nature and number of endogenous gene transcripts. The detection of a ssDNA virus, hepandenovirus, within the transcriptome, presumably detected in a replicative phase, indicates the application of this technique as a tool to also detect the presence of viruses with DNA genomes.
In addition to known endemic viruses, the transcriptome contained full-length or near full-length RNA transcripts related closely to the recently-described shrimp viruses When-2 and When-850,51 unknown until now to occur in Australian P. monodon. A couple of long transcripts of suspected viral origin and expressed across multiple tissue types were also identified. One of these possessed significant BLASTx homology to the reverse transcriptase (RT)-like component of hypothetical protein 1 of Beihai picorna-like virus 116 discovered recently in blue swimmer crabs (Portunus pelagicus)51. The other possessed substantial homology to the RT component of the Mo-MuLV Pr180 polyprotein and was expressed across all tissue types except the lymphoid organ, suggesting it to be from a mobile element such as a poly(A)-type retrotransposon or retrovirus52. However, determining whether these transcripts containing RT sequences are viral in origin, or represent the products of endogenous retrotransposons like others now being reported in shrimp53 will require further investigation, as will the nature of the strains, host and distribution ranges, prevalence and potential pathogenicity of the new viruses discovered in the transcriptome.
In conclusion, this study describes the assembly of a comprehensive and high quality transcriptome from nine different tissue types, and eight larval and post-larval early life-history stages of the black tiger shrimp, Penaeus monodon. It also summarizes the number and nature of specific transcript clusters differentially expressed in different tissue types and larval stages, and the Clusters were functionally annotated and mapped to 235 KEGG pathways. Unique transcript clusters and cluster groups were defined across distinct tissues and early life-history stages, providing initial evidence for their roles in specific tissue functions or developmental transitions. The current transcriptome provides a valuable resource for further investigation of directing gene-function studies to increase basic functional biology knowledge in shrimp and for investigating molecular basis of traits of relevance to the aquaculture of shrimp. While the current transcriptome already provides an improved resource for P. monodon, further effort is required using long-read sequencing data, such as provided by PacBio, to better resolve genes at isoform level. Lastly, this high-quality de novo assembly and data set are publically available and will hopefully support research projects that underpin transformational advances in how we culture shrimp globally.
Material and Methods
Sample taking and RNA extraction
Tissues of P. monodon broodstock were collected from multiple intermolt individuals, immediately snap frozen on dry ice, and stored at −80 °C until extraction (Table 1). All tissues except lymphoid organs were collected from wild broodstock caught off coastal waters near the border between the Northern Territory and Western Australia provided, which were provided by a commercial hatchery at Flying Fish Point, North Queensland, Australia. The prawns were kept at a salinity of 27–35 ppt, pH 7.8–8.0, 28.5–29.5 °C and 5 to 7 ppm dissolved oxygen. Lymphoid organ tissue was collected from wild prawns caught off the East Coast of Queensland. Larval and post-larval stages were collected from the same hatchery in pools of approximately 400 individuals per life stage, after four hours of starvation, and preserved in RNAlater (Thermo Fisher Scientific). All tissues and early life-history stages were sub-sampled in an RNase-free laboratory and total RNA was extracted using an RNeasy Universal extraction kit (QIAGEN) following manufacturer’s instructions. RNA quantity and quality was estimated using a Nanodrop UV spectrophotometer (Thermo Fisher Scientific), and purity was further assessed using an Agilent Bioanalyzer (Agilent Technologies). RNA was selected from individual sample replicates based on Nanodrop spectra, RNA concentration, and Agilent Bioanalyzer traces, in preference to using comparative tissues from the same individuals.
Illumina library preparation and sequencing
Library preparation and sequencing was carried out at the Australian Genome Research Facility (AGRF). Upon arrival at the sequencing facility, the quality of the samples was checked using a Bioanalyzer RNA 6000 nano reagent kit (Agilent) and libraries were prepared using the TruSeq Stranded mRNA Library Preparation Kit (Illumina) according to established protocols. Final libraries were again checked using Tapestation DNA 1000 TapeScreen Assay (Agilent). Cluster generation was performed on a cBot with HiSeq PE Cluster Kit v4 - cBot and sequencing was done on a HiSeq 2500 using a HiSeq SBS Kit. The Hiseq 2500 was operating with HiSeq Control Software v2.2.68 and base-calling was performed with RTA v22.214.171.124. Samples in the second sequencing run were pooled and split across two lanes to reduce sequencing bias (Table 1).
Sequence quality control, assembly and annotation
Raw sequence data was quality checked using FastQC54 v0.11.5, and assembled loosely following the Oyster River Protocol for Transcriptome Assembly55. In brief, all sequences were collectively error-corrected using RCorrector56 V3. Samples were then assembled in Trinity57 V2.3.2; grouped by individual shrimps, i.e. all tissues from a specific shrimp were assembled together. Reads were trimmed harshly for adapters and softly for Phred score <2 using Trimmomatic58 V0.32; and then normalized in silico within Trinity. The normalized forward and reverse reads produced by Trinity were then used in BinPacker59 V1.0, IDBA-Tran60 V 1.1.1 using K20, K30, K40, K50 and K60; and Bridger61 version 2014-12-01. All resulting transcriptomes were concatenated and merged using Evidential Gene62, followed by clustering using Transfuse V0.5.0 (https://github.com/cboursnell/transfuse) using a similarity value of 0.98. Lastly, contigs <300 bp were removed to produce the final transcriptome. The quality of the final assembly was assessed using TransRate63 V1.0.1, and BUSCO64 V2 using the arthropoda_odb9 database65. Sequences were annotated in Blast2Go66 using the SWISS-PROT database67 (accessed 17/03/2017), and separately using the arthropod and viral subsections of the GenBank nr database (accessed 06/06/2017).
Identification of long non-coding RNAs
FEELnc45 was used for the identification of long non-coding RNAs. The coding transcripts training set was constructed from the 1,047 complete universal single copy orthologous genes found with BUSCO v2.0 (database arthropoda_odb965). The mode “shuffle” was used to generate a training set of lncRNA from the debris of the known coding RNA transcripts.
Mapping and differential gene expression analysis
Before mapping, error-corrected raw sequence reads were trimmed using the same parameters as before, but without palindrome trimming used by Trinity. Sequence reads were mapped using Bowtie268 V2.2.8, and read counts were calculated using Corset69 V1.0.6. Differential gene expression was analyzed using DESeq270 V1.16.1 in RStudio71 V1.0.143 running R72 V3.4.1.
To reduce the number of sequences for KEGG analysis73,74,75, the longest contig per cluster was chosen from the combined tissue type and early life-history stage data. The KEGG Automatic Annotation Server (KAAS, http://www.genome.jp/tools/kaas/) was used to generate KEGG pathway maps for each contig using BLAST with the single-directional best hit (SBH) method. All scripts can be found on GitHub at https://github.com/R-Huerlimann/Pmono_multitissue_transcriptome.
For data analysis, the top 2,000 variably expressed genes across the nine tissue types and the top 500 variably expressed genes across the four larval and four post-larval stages were visualized in a principal component analysis and heatmap using variance-stabilizing transformed read-count data from DESeq. 2. The gene level dendrograms in the heatmap were created using Pearson’s correlation for both the tissue type larval/post-larval stages. Euclidean distance was used to cluster tissue types. All statistical analyses were performed in RStudio. More detailed information on the analyses can be found on GitHub.
This study has been carried out abiding by all necessary Queensland Government legislation and James Cook University policies.
Availability of Data and Material
Raw short read data and transcriptome assembly are available on NCBI under the following accession numbers: BioProject: PRJNA421400, BioSamples: SAMN08741487-SAMN08741521, SRA: SRP127068 (RR6868116-SRR6868172), TRA: GGLH00000000. Bioinformatics scripts are available on GitHub at https://github.com/R-Huerlimann/Pmono_multitissue_transcriptome.
FAO. Fisheries and Aquaculture topics. The State of World Fisheries and Aquaculture (SOFIA) (Food and Agriculture Organization United Nations, 2016).
Gjedrem, T., Robinson, N. & Rye, M. The importance of selective breeding in aquaculture to meet future demands for animal protein: a review. Aquaculture350, 117–129 (2012).
Jones, D. B. et al. A comparative integrated gene-based linkage and locus ordering by linkage disequilibrium map for the Pacific white shrimp, Litopenaeus vannamei. Scientific Reports7 (2017).
Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews genetics10, 57–63 (2009).
Saha, S. et al. Using the transcriptome to annotate the genome. Nature biotechnology20, 508–512 (2002).
Yu, Y. et al. SNP discovery in the transcriptome of white Pacific shrimp Litopenaeus vannamei by next generation sequencing. PLoS One9, e87218 (2014).
Song, L., Shankar, D. S. & Florea, L. Rascaf: Improving Genome Assembly with RNA Sequencing Data. The Plant Genome, https://doi.org/10.3835/plantgenome2016.03.0027 (2016).
Nguyen, C. et al. De novo assembly and transcriptome characterization of major growth-related genes in various tissues of Penaeus monodon. Aquaculture464, 545–553 (2016).
Rotllant, G. et al. Identification of genes involved in reproduction and lipid pathway metabolism in wild and domesticated shrimps. Marine genomics22, 55–61 (2015).
Uengwetwanit, T. et al. Transcriptome-based discovery of pathways and genes related to reproduction of the black tiger shrimp (Penaeus monodon). Marine Genomics (2017).
Soonthornchai, W. et al. Differentially expressed transcripts in stomach of Penaeus monodon in response to AHPND infection. Developmental & Comparative Immunology65, 53–63 (2016).
Chen, K. et al. Transcriptome and molecular pathway analysis of the hepatopancreas in the Pacific White Shrimp Litopenaeus vannamei under chronic low-salinity stress. PLoS One10, e0131503 (2015).
Li, C. et al. Analysis of Litopenaeus vannamei transcriptome using the next-generation DNA sequencing technique. PloS one7, e47442 (2012).
Chen, X. et al. Transcriptome analysis of Litopenaeus vannamei in response to white spot syndrome virus infection. PLoS One8, e73218 (2013).
Ghaffari, N. et al. Novel transcriptome assembly and improved annotation of the whiteleg shrimp (Litopenaeus vannamei), a dominant crustacean in global seafood mariculture. Scientific reports4, 7081 (2014).
Guo, H. et al. Trascriptome analysis of the Pacific white shrimp Litopenaeus vannamei exposed to nitrite by RNA-seq. Fish & shellfish immunology35, 2008–2016 (2013).
Hu, D., Pan, L., Zhao, Q. & Ren, Q. Transcriptomic response to low salinity stress in gills of the Pacific white shrimp, Litopenaeus vannamei. Marine genomics24, 297–304 (2015).
Lu, X. et al. Transcriptome analysis of the hepatopancreas in the Pacific white shrimp (Litopenaeus vannamei) under acute ammonia stress. PloS one11, e0164396 (2016).
Sookruksawong, S., Sun, F., Liu, Z. & Tassanakajon, A. RNA-Seq analysis reveals genes associated with resistance to Taura syndrome virus (TSV) in the Pacific white shrimp Litopenaeus vannamei. Developmental & Comparative Immunology41, 523–533 (2013).
Wei, J., Zhang, X., Yu, Y., Li, F. & Xiang, J. RNA-Seq reveals the dynamic and diverse features of digestive enzymes during early development of Pacific white shrimp Litopenaeus vannamei. Comparative Biochemistry and Physiology Part D: Genomics and Proteomics11, 37–44 (2014).
Xue, S. et al. Sequencing and de novo analysis of the hemocytes transcriptome in Litopenaeus vannamei response to white spot syndrome virus infection. PLoS One8, e76718 (2013).
Zeng, D. et al. Transcriptome analysis of Pacific white shrimp (Litopenaeus vannamei) hepatopancreas in response to Taura syndrome Virus (TSV) experimental infection. PloS one8, e57515 (2013).
Zhang, D., Wang, F., Dong, S. & Lu, Y. De novo assembly and transcriptome analysis of osmoregulation in Litopenaeus vannamei under three cultivated conditions with different salinities. Gene578, 185–193 (2016).
Powell, D., Knibb, W., Remilton, C. & Elizur, A. De-novo transcriptome analysis of the banana shrimp (Fenneropenaeus merguiensis) and identification of genes associated with reproduction and development. Marine genomics22, 71–78 (2015).
Sellars, M. J., Trewin, C., McWilliam, S. M., Glaves, R. & Hertzler, P. L. Transcriptome profiles of Penaeus (Marsupenaeus) japonicus animal and vegetal half-embryos: identification of sex determination, germ line, mesoderm, and other developmental genes. Marine Biotechnology17, 252–265 (2015).
Powell, D., Knibb, W. & Elizur, A. In Proceedings of the 24th Plant and Animal Genome Conference. (Plant and Animal Genome (PAG) Conference).
Powell, D., Knibb, W., Nguyen, N. H. & Elizur, A. Transcriptional profiling of banana shrimp Fenneropenaeus merguiensis with differing levels of viral load. Integrative and comparative biology56, 1131–1143 (2016).
Wang, W. et al. Gill transcriptomes reveal involvement of cytoskeleton remodeling and immune defense in ammonia stress response in the banana shrimp Fenneropenaeus merguiensis. Fish & shellfish immunology71, 319–328 (2017).
Li, S., Zhang, X., Sun, Z., Li, F. & Xiang, J. Transcriptome analysis on Chinese shrimp Fenneropenaeus chinensis during WSSV acute infection. PloS one8, e58627 (2013).
Shi, X. et al. Transcriptome analysis of ‘Huanghai No. 2′ Fenneropenaeus chinensis response to WSSV using RNA-seq. Fish & Shellfish Immunology75, 132–138, https://doi.org/10.1016/j.fsi.2018.01.045 (2018).
Baranski, M. et al. The development of a high density linkage map for black tiger shrimp (Penaeus monodon) based on cSNPs. PLoS One9, e85413 (2014).
Homola, E. & Chang, E. S. Methyl farnesoate: crustacean juvenile hormone in search of functions. Comparative Biochemistry and Physiology Part B: Biochemistry and Molecular Biology117, 347–356 (1997).
Michal, G. & Schomburg, D. Biochemical pathways: an atlas of biochemistry and molecular biology. (Wiley New York, 1999).
Jiménez, G., González-Reyes, A. & Casanova, J. Cell surface proteins Nasrat and Polehole stabilize the Torso-like extracellular determinant in Drosophila oogenesis. Genes & development16, 913–918 (2002).
Webster, S. G., Keller, R. & Dircksen, H. The CHH-superfamily of multifunctional peptide hormones controlling crustacean metabolism, osmoregulation, moulting, and reproduction. General and comparative endocrinology175, 217–233 (2012).
Morisalo, D. & Anderson, K. V. Signaling pathways that establish the dorsal-ventral pattern of the Drosophila embryo. Annual review of genetics29, 371–399 (1995).
Chew, G.-L. et al. Ribosome profiling reveals resemblance between long non-coding RNAs and 5′ leaders of coding RNAs. Development140, 2828–2834 (2013).
Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome research22, 1775–1789 (2012).
Guttman, M., Russell, P., Ingolia, N. T., Weissman, J. S. & Lander, E. S. Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell154, 240–251 (2013).
Engreitz, J. M., Ollikainen, N. & Guttman, M. Long non-coding RNAs: spatial amplifiers that control nuclear structure and gene expression. Nature Reviews Molecular Cell Biology17, 756 (2016).
Brockdorff, N. et al. The product of the mouse Xist gene is a 15 kb inactive X-specific transcript containing no conserved ORF and located in the nucleus. Cell71, 515–526 (1992).
Koerner, M. V., Pauler, F. M., Huang, R. & Barlow, D. P. The function of non-coding RNAs in genomic imprinting. Development136, 1771–1783 (2009).
Leucci, E. et al. Melanoma addiction to the long non-coding RNA SAMMSON. Nature531, 518 (2016).
Prensner, J. R. & Chinnaiyan, A. M. The emergence of lncRNAs in cancer biology. Cancer discovery1, 391–407 (2011).
Wucher, V. et al. FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome. Nucleic acids research45, e57–e57 (2017).
Cowley, J. A., Dimmock, C. M., Spann, K. M. & Walker, P. J. Gill-associated virus of Penaeus monodon prawns: an invertebrate virus with ORF1a and ORF1b genes related to arteri-and coronaviruses. Journal of General Virology81, 1473–1484 (2000).
Cowley, J. A. et al. Tactical Research Fund: Aquatic Animal Health Subprogram: Viral presence, prevalence and disease management in wild populations of the Australian Black Tiger prawn (Penaeus monodon). (FRDC, 2015).
Mohr, P. G. et al. New yellow head virus genotype (YHV7) in giant tiger shrimp Penaeus monodon indigenous to northern Australia. Diseases of aquatic organisms115, 263–268 (2015).
Owens, L., La Fauce, K. & Claydon, K. The effect of Penaeus merguiensis densovirus on Penaeus merguiensis production in Queensland, Australia. Journal of fish diseases34, 509–515 (2011).
Li, C.-X. et al. Unprecedented genomic diversity of RNA viruses in arthropods reveals the ancestry of negative-sense RNA viruses. Elife4 (2015).
Shi, M. et al. Redefining the invertebrate RNA virosphere. Nature540, 539 (2016).
Shen, C.-H. & Steiner, L. A. Genome structure and thymic expression of an endogenous retrovirus in zebrafish. Journal of virology78, 899–911 (2004).
Sakaew, W., Pratoomthai, B., Pongtippatee, P., Flegel, T. W. & Withyachumnarnkul, B. Discovery and partial characterization of a non-LTR retrotransposon that may be associated with abdominal segment deformity disease (ASDD) in the whiteleg shrimp Penaeus (Litopenaeus) vannamei. BMC veterinary research9, 189 (2013).
Andrews, S. FastQC: a quality control tool for high throughput sequence data, http://www.bioinformatics.babraham.ac.uk/projects/fastqc (2010).
MacManes, M. D. Establishing evidenced-based best practice for the de novo assembly and evaluation of transcriptomes from non-model organisms. bioRxiv, 035642 (2016).
Song, L. & Florea, L. Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads. GigaScience4, 1 (2015).
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature biotechnology29, 644–652 (2011).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, btu170 (2014).
Liu, J. et al. BinPacker: Packing-Based De Novo Transcriptome Assembly from RNA-seq Data. PLoS Comput Biol12, e1004772 (2016).
Peng, Y. et al. IDBA-tran: a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels. Bioinformatics29, i326–i334 (2013).
Chang, Z. et al. Bridger: a new framework for de novo transcriptome assembly using RNA-seq data. Genome biology16, 1 (2015).
Gilbert, D. EvidentialGene: tr2aacds, mRNA Transcript Assembly Software, http://arthropods.eugenes.org/EvidentialGene/about/EvidentialGene_trassembly_pipe.html (2013).
Smith-Unna, R., Boursnell, C., Patro, R., Hibberd, J. & Kelly, S. TransRate: reference free quality assessment of de novo transcriptome assemblies. Genome research, gr.196469, 196115 (2016).
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, btv351 (2015).
Zdobnov, E. M. et al. OrthoDBv9. 1: cataloging evolutionary and functional annotations for animal, fungal, plant, archaeal, bacterial and viral orthologs. Nucleic acids research45, D744–D749 (2016).
Conesa, A. et al. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics21, 3674–3676 (2005).
Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic acids research31, 365–370 (2003).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nature methods9, 357 (2012).
Davidson, N. M. & Oshlack, A. Corset: enabling differential gene expression analysis for de novo assembled transcriptomes. Genome biology15, 1 (2014).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq. 2. Genome biology15, 550 (2014).
Racine, J. S. RStudio: A Platform‐Independent IDE for R and Sweave. Journal of Applied Econometrics27, 167–172 (2012).
R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org.
Kanehisa, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res.45, D353–D361 (2017).
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res.44, D457–D462 (2016).
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res.28, 27–30 (2000).
We thank Andrew Foote, Gopala Krishna, Sarah Berry and Tansyn Noble for assistance in organizing and collecting tissue samples. Research Funding for this project was from the Australian Research Council Industrial Transformation Research Program IH130200013. This work was funded by the Australian Research Council (ARC) Industrial Transformation Research Hub scheme, awarded to James Cook University and in collaboration with the Commonwealth Scientific Industrial Research Organisation (CSIRO), the Australian Genome Research Facility (AGRF), the University of Sydney and Seafarms Group Pty Ltd.
The authors declare no competing interests.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this Article was revised: The original version of this Article contained errors in the Results and Discussion sections. Full information regarding the corrections made can be found in the correction for this Article.
About this article
Cite this article
Huerlimann, R., Wade, N.M., Gordon, L. et al. De novo assembly, characterization, functional annotation and expression patterns of the black tiger shrimp (Penaeus monodon) transcriptome. Sci Rep 8, 13553 (2018). https://doi.org/10.1038/s41598-018-31148-4
This article is cited by
De novo transcriptome reconstruction in aquacultured early life stages of the cephalopod Octopus vulgaris
Scientific Data (2022)
Reviews in Fish Biology and Fisheries (2022)
The Hedgehog pathway in penaeid shrimp: developmental expression and evolution of splice junctions in Pancrustacea
Fine-scale population structure and evidence for local adaptation in Australian giant black tiger shrimp (Penaeus monodon) using SNP analysis
BMC Genomics (2020)
Transcriptome analysis of air-breathing land slug, Incilaria fruhstorferi reveals functional insights into growth, immunity, and reproduction
BMC Genomics (2019)