We present a new transcriptome assembly of the Pacific whiteleg shrimp (Litopenaeus vannamei), the species most farmed for human consumption. Its functional annotation, a substantial improvement over previous ones, is provided freely. RNA-Seq with Illumina HiSeq technology was used to analyze samples extracted from shrimp abdominal muscle, hepatopancreas, gills and pleopods. We used the Trinity and Trinotate software suites for transcriptome assembly and annotation, respectively. The quality of this assembly and the affiliated targeted homology searches greatly enrich the curated transcripts currently available in public databases for this species. Comparison with the model arthropod Daphnia allows some insights into defining characteristics of decapod crustaceans. This large-scale gene discovery gives the broadest depth yet to the annotated transcriptome of this important species and should be of value to ongoing genomics and immunogenetic resistance studies in this shrimp of paramount global economic importance.
Litopenaeus vannamei is a prawn native to the eastern Pacific Ocean from Sonoran Mexico to northern Peru where it has long been caught by inshore fisherman and offshore trawlers. Commonly called both the whiteleg shrimp and Pacific white shrimp, this species was first aquacultured in Florida in 1973 using a mated individual from Panama. By the early 1980's it was being cultured in Hawaii, the contiguous United States, Central and South America, and its mariculture spread to Asia in the 2000's1. It is now the dominant crustacean species grown in aquaculture worldwide, its use having surpassed that of the giant tiger prawn Penaeus monodon2. The United Nations Food and Agriculture Organization lists global production of this species growing from 146,382 tons in 2000 to 2,721,929 tons (>$11 billion US) in 2010, making the production of this species one of the fastest growing global food crops.
L. vannamei's dominant role in shellfish aquaculture has generated demand for genetic tools in this species that can be applied to nutrition, reproduction, development, and resistance to infectious disease. There have been several RNA-Seq and transcriptome reports for L. vannamei, including next-generation sequencing datasets3, studies of transcriptomic response to pollutant exposure4, different tissues5, and infection6,7,8,9,10,11. However, this report is the most extensive gene discovery work in the species to date and the first to make extensive functional annotation publicly available to the scientific community.
In this study, we used Illumina Hiseq for RNA-Seq experiments in hepatopancreas, gill, pleopod and abdominal muscle of one male shrimp to produce a robust pooled transcriptome assembly for L. vannamei using the Trinity software package. The output contigs were then functionally annotated employing Trinotate. These RNA-Seq reads have been availed to the public via these BioSample accessions: SAMN02918336, SAMN02918337, SAMN02918338, SAMN02918339 at NCBI, and our results are available at: http://repository.tamu.edu/handle/1969.1/152151. We demonstrate the utility of this resource with verification of immune transcripts, identification of novel shrimp genes, and metabolic pathway analysis.
High quality de novo assembly of L. vannamei transcriptome
A total of 400,228,040 Illuimina HiSeq reads from hepatopancreas, gill, pleopod and abdominal muscle tissues were generated. After trimming the adapters, 399,056,712 (Table 1) reads were assembled with Trinity resulting in 110,474 contigs (Supplemental Data 1) with an N50 of 2,701 bases (Table 2). The HiSeq platform, the diversity of tissues sampled and the highly efficient Trinity pipeline allowed this significant improvement over recent transcriptomics efforts yielding only N50 values < 10003,4,6.
Functional annotation of shrimp transcriptome
Following the transcriptome assembly, we annotated the contigs by the Trinotate pipeline. The Trinotate software suite automates the functional annotation of the transcriptome.
Our annotation report from the Trinotate pipeline is presented as a document with the following column headers: 1) gene_id, 2) transcript_id, 3) Top_BLASTX_hit, 4) RNAMMER, 5) prot_id, 6) prot_coords, 7) Top_BLASTP_hit, 8) Pfam, 9) SignalP, 10) TmHMM, 11) eggNOG, 12) gene_ontology, and 13) prot_seq. The first two columns are: “gene_id” and “transcript_id”, representing predicted genes and their corresponding transcripts, respectively. The columns “Top_BLASTX_hit” and “Top_BLASTP_hit” show the top BlastX and BlastP hit results of homology searches against the NCBI database. BlastX is one of the latest additions to the Trinotate annotation pipeline and compares all six open reading frames (ORF) of the query sequences against the protein database. The RNAMMER column shows information about predicted ribosomal RNA genes discovered in the transcriptome assembly that were predicted by hidden Markov models (HMM). The prot_id, prot_coords and prot_seq columns provide the ID, location and translation of the longest ORFs, respectively. The Pfam column represents the HMMER/PFAM protein domain identification search results. HMMER is used to search databases for homologs of proteins, employing hidden Markov models. The SignalP column shows the presence and location of predicted signal peptides. Similarly, the TmHMM column presents the predicted transmembrane regions. The eggNOG (Evolutionary genealogy of genes: Non-supervised Orthologous Groups) column has the search result of the database of orthologous groups of genes, which are further annotated with functional description lines. Lastly, the gene_ontology column shows the relationship of these shrimp data to the Gene Ontology (GO) terms that aim to unify the representation of genes and gene products across all species.
Using the Trinotate pipeline, a total of 165,922 annotations were determined for our Trinity assembled contigs. We have designated the last eleven columns of Trinotate output detailed above (Top_BlastX_hit through prot-seq) as Annotation Holding Output Columns (AHOC). Considering that more than 165 K annotations for our contigs are too numerous for careful examination, we have created the following three filters: Filter A selects rows that have at least 10 AHOC, Filter B selects rows that have at least 9 AHOC, and Filter C selects rows that have at least 8 AHOC. Applying these filters to our data, Filter A results in selecting 590 gene IDs, Filter B 4,843 gene IDs, and finally Filter C 21,323 gene IDs. This method ensures that predicted genes/transcripts with maximum annotation information can be selected for targeted manual curating. Supplemental Data 2 represents the results of these filtrations graphically, and the filtered datasets are available in Supplemental Data 3.
In comparing these shrimp translated transcriptome contigs to those of other animals we were also curious as to know what proteins in NCBI's non-redundant database and from which species were most highly represented in BlastX hits of these assembled contigs (Supplemental Data 4). Drosophila myosin had the most (107) hits, and Drosophila proteins dominated these BlastX results. Many of the proteins with high hits could be considered “housekeeping” gene products, but some tissue-specific proteins were in the top twenty, such as Downs syndrome cell adhesion molecule (Dscam)12. Dscam receptors are diversified through mutually exclusive alternative exon splicing13, have roles in self-recognition in immunity and neural development14, and have been characterized in this species of shrimp15 as well as other crustaceans16.
Immune gene survey
To test the depth and accuracy of this annotated transcriptome, we searched in our dataset for the immune-related transcripts discovered by Sookruksawong et al.9. These transcripts were particularly interesting to us because they were representing differentially expressed immune-related genes between shrimp lines resistant and susceptible to Taura syndrome virus (TSV) and may also represent genes important in shrimp resistance to current scourges such as early mortality syndrome (EMS), also known as acute hepatopancreatic necrosis syndrome (AHPNS)17.
We used BlastX to find similarity between our 110,474 contigs as queries and the proteins identified in the TSV transcriptomic study9. There were 4,493 hits with an e-value < 1E-4 among our 110,474 contigs, and 3,088 hits with an e-value < 1E-10. In our subsequent analysis, we used the 3,088 hits resulting from this more restrictive (<1E-10) e-value filtering. Supplemental Data 5 shows the complete list of these proteins with additional information for each protein. In the first post-processing, we selected the top 50 BlastX hits with the more stringent e-value filtering and completed their annotation information by adding corresponding NCBI information to their records. Additionally, we selected the most frequently appearing (>40 hits) of these previously identified immune related proteins among our top BlastX hits and showed their representation as a pie chart. Figure 1 shows these proportions among our 3,088 BlastX results for all contigs. The highest three hits are zinc-finger 658b-like and serine/threonine phosphatase ankyrin repeat-like both from the purple sea urchin Strongylocentrus purpatus and zinc-finger BTB domain from the bee Bombus impatiens. This initial comparison survey will springboard studies of other immune gene families18,19.
Arthropod conservation and new decapod crustacean genes
We also categorized the BlastX results to understand the distribution and frequency of the species that appear in the homology search between our contigs and the NCBI database. Figure 2 depicts all of the hits that had at least twenty BlastX hits and their frequency of appearing. Drosophila melanogaster is greatly represented (20 of the 59 top BlastX hits) as the model arthropod that has received most intensive study for decades and we suspect has the best coverage in the bioinformatic databases of tissue and developmental stage specific expression of any arthropod. Well-studied mammals are interspersed with invertebrate hits (13 human, 8 mouse, 3 rat, 3 Xenopus, and 1 zebrafish in the top 59 that hit 20 or more times).
Furthermore, we used BLAST to search similarities between our Trinity assembled contigs and the Daphnia pulex genome. The Joint Genome Institute annotation of v1.0 “Frozen Gene Catalog” was employed, which has all manual curations, as well as automatically annotated models chosen from the FilteredModels v1.1 set. In the first search, we used the BlastN tool to search for sequence similarities between our contigs and the D. pulex transcripts and CDs. We found 5,668 contigs with blast hits that had e-values < 1E-4, and 2,610 with hits scores as e-value < 1E-10 (Table 3). Table 4 shows the BlastN search results of the contigs against the D. pulex transcripts and CDs, for the latter case. The results are sorted by the most abundant hits, representing the top twenty genes, which had annotation information available at wFleaDatabase. It provides the gene IDs assigned by wFleaDatabase, as well as NCBI IDs. The complete 960 hits are available as Supplemental Data 6.
In order to find the proteins corresponding to our contigs, we employed BlastX to examine all six frames. The D. pulex protein database was used. The BlastX search resulted in 30,534 hits with e-value < 1E-4 and 26,224 hits with e-value < 1E-10. The results are represented in Table 5 for the twenty most frequents BlastX hits, with e-value < 1E-10. For each entry, the corresponding UniProt link is provided for inquiring further information.
KEGG and GO pathway analyses
We also analyzed our contigs with the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database20. Figure 3 shows an example of this analysis denoting the transcripts identified by our study involved in DNA replication. The protein products of several of these transcripts are required for White Spot Syndrome Virus (WSSV) replication21 and offer potential targets for interference with the pathogen's replication cycle22. An additional 337 pathways with shrimp orthologs represented in our transcriptome highlighted are available from the authors.
BLAST2GO (Figure 4) identified D. pulex as the top-hit species to this L. vannamei transcriptome by Blast analysis, followed by Tribolium castaneum and in third place Pediculus humanus. Chen et al6 identified the latter species as the one with most Blast hits of the White Spot Syndrome response transcriptome. Our comparison seems more consistent with the evolutionary relationship between Daphnia and Penaeid crustaceans.
At the level of molecular function, binding molecules were found to be most abundant among the transcriptome (53%) followed by proteins with catalytic activity (28%). Antioxidant responses are invoked as a key component of the shrimp immune response, and in this reference transcriptome appeared in less than one percent of the transcripts. However, it is known that they are more abundant upon an immune challenge6.
Invertebrate shrimp do not have a classical adaptive immune system based upon antibodies, T cells and specific memory of antigen23,24. Although our definitions of immunity beyond the jawed vertebrates are shifting25, there is so far only a consensus of evidence of the more ancient innate system in shrimp26. Thus, we cannot vaccinate shrimp for protection against WSSV or EMS27, making understanding of the total expressed transcriptome, genetic variability in the innate immune genes and regulatory regions of this invertebrate species all the more important. This improved transcriptome for L. vannamei is a step towards that goal.
L. vannamei is the most farmed of many important decapod crustacean food species including other shrimp and prawns, crayfish, lobster and crabs. Intensive aquaculture methods continue to improve as the global demand for shellfish increases. Thus, the application of this work in the whiteleg shrimp transcriptome will be applied back into the production organism in the aquaculture facility, and will pave the way for similar immunogenomic enhancement in other species28. For example the congener Litopenaeus setiferus of the northwest Atlantic and Gulf of Mexico is sometimes still cultured and is also susceptible to white spot syndrome virus, findings in L. vannamei will likely be readily adaptable to this sister species29,30. On even broader aquaculture industry supply and biosphere “One Health” levels, the improvements to this decapod crustacean transcriptome (and eventually other –omics) could have relevance for studies of krill. These small shrimp-like crustaceans are increasingly harvested for fish-meal and are often cited as the keystone species of the polar ecosystems31. They are threatened by climate change, and krill decline due to diatom losses in 1998 were linked to plummeting shearwater populations and the lack of that year's salmon spawn32.
Also of note is the potential ability this robust transcriptome affords in addressing the question of what it is that makes decapod crustaceans unique. Our comparison of the L. vannamei transcriptome to D. pulex also reveals the decapod shrimp transcripts that do not match any transcript from the branchiopod Daphnia. Besides the importance of these animals to fisheries, the order Decapoda provides crucial scavenging of aquatic ecosystems. Comparative transcriptomics can now address the developmental genetics that have allowed these crustaceans to diversify and dominate this ecological niche. In addition, this shrimp transcriptome can in turn serve as a point of genetic contrast for other more derived groups such as Brachyuran crabs that may have undergone carcinisation several independent times during their natural history by tagmosis of segments, anterial-posterial compression and ventral fusion of the abdomen33.
Several lines of inquiry have suggested that arthropods are capable of mounting specific immune responses subsequent to immune priming34,35,36,37,38. The gene that has drawn the most attention for possibly conferring adaptive immune capabilities in these invertebrates is DSCAM. Shrimp DSCAM employs alternative splicing to greatly diversify the expressed DSCAM repertoire, using many more possible introns for the diverse portion of the third immunoglobulin domain (79 in L. vannamei and P. monodon, versus 48 in D. melanogaster, 24 in Daphnia magna and 39 in Eriocheir sinensis)39. DSCAM transcripts with diverse immunoglobulin domains, transmembrane regions, and cytoplasmic tails are represented in the transcriptome described here (e.g. comp873 and comp34255, Supplemental Data 1).
Germane to analyses of taxa specific genes and particularly immune related genes are the assembly of transcripts encoded by multigene families employing only short NGS data de novo without a genome. The lack of introns is problematic for the more difficult to resolve multigene families. For example, a newly birthed retrogene lacking an intron could be incorrectly assembled with its progenitor gene using our NGS dataset, a confusion that would be easily avoided with genomic data. This is both a cautionary warning against over-extrapolating these data and also mandates genomic tools in this species.
With a genome of 2.5 gigabases spread over 44 pairs of chromosomes40, genomic resources for L. vannamei are still lacking. This annotated transcriptome should help to accelerate functional genomics in this species and perhaps other commercially and environmentally important decapod crustaceans while full genome sequencing projects are in progress. As food produced from global aquaculture now exceeds beef globally, shrimp is primed to play a major role in the “blue revolution” if the genetics of nutrition, growth and immunity can quickly catch up to the decades of study species such as cattle and poultry have enjoyed41.
Total RNA was prepared from one male Pacific white leg shrimp (L. vannamei) collected near Bahia Kino, Sonora, Mexico. Four tissues were targeted for distinct RNA-Seq experiments: hepatopancreas, gills, abdominal (tail) muscle, and pleopods. These four tissues allow transcriptomic profiling of cells engaged in many physiological processes in this arthropod including production of digestive enzymes, absorption of food, respiration, long-sarcomere muscle contraction (unique to invertebrates), mucosal immunity and reproduction.
RNA isolation and Illumina sequencing
Total RNA was prepared from the four tissues using Trizol Reagent (Life Technologies) per the manufacturer's instructions. From total RNA, mRNA was purified using magnetic oligo(dT) beads, then fragmented using divalent cations under elevated temperature. cDNA was synthesized from the fragmented mRNA using Superscript II (Invitrogen), followed by second strand synthesis. cDNA fragment ends were repaired and phosphorylated using Klenow, T4 DNA Polymerase and T4 Polynucleotide Kinase. Next, an adenosine was added to the 3′ end of the blunted fragments, followed by ligation of Illumina adapters via T-A mediated ligation. The ligated products were size selected by AMPure XP Beads and then PCR amplified using Illumina primers. The library creation, quality determination using an Agilent Bioanalyzer, and analysis were performed by Ambry Genomics. The Illumina HiSeq2000 was employed for sequence analysis. The libraries were seeded onto the flowcell at 9pM per lane yielding approximately 745 K pass-filter clusters per mm2 tile area. The libraries were sequenced using 116 + 10 + 101 cycles of chemistry and imaging.
Initial data processing and base calling, including extraction of cluster intensities, was done using RTA 1.13.48 (HiSeq Control Software 1.5.15). Sequence quality filtering script was executed in the Illumina CASAVA software (version 1.8.2, Illumina, Hayward, CA). Data yield (in Mbases), percent pass-filter (%PF), number of reads, percent raw clusters per lane, and quality (%Q30) was examined in Demultiplex_Stats.htm files. Other quality metrics were assessed in All.htm and IVC.htm including IVC plots and visualizations of cluster intensity over the duration of the sequencing run. The percentage of Q30 bases that passed filtering was above 77 for all lanes of the Illumina HiSeq flowcell.
De novo assembly of the L. vannamei transcriptome was performed from the RNA-Seq data from the four pooled tissues with the trimming off adaptor sequences. The February 25, 2013 release of Trinity was used for assembly, Trinity Trans-decoder for coding sequence prediction, and Trinotate for functional annotation. The overall workflow is summarized graphically in Figure 5, and the transcriptome assembly in more detail in the next paragraph.
After initial quality control steps, any reads that had adapter sequences attached to them were discarded. For adapter trimming the Cutadapt software was used42. The adapter-free reads were used for transcriptome assembly using the Trinity software43. Trinity is a de novo algorithm developed specifically for reconstructing the transcriptome, using de Bruijn graphs. Transcriptome assembly is challenging mainly because RNA-Seq data coverage levels are not evenly distributed. Furthermore, alternative splicing complicates assembly from individual genes. The goal of the Trinity package is to deliver one graph per expressed gene. Trinity consists of three parts: 1) Inchworm, 2) Chrysalis, and 3) Butterfly. During these three steps, Trinity makes linear contigs from RNA-Seq reads, generates and expands de Bruijn graphs, and finally outputs the transcripts and isoforms. The process starts by decomposing the reads into small overlapping pieces called Kmers and extending them by coverage. Finding the common sections of the intermediate transcripts determines alternative splicing, and those transcripts are re-grouped. The de Bruijn graphs are generated by integration of isoforms that are similar except one base. And finally, by finding and expanding the common section of transcripts and representing the most compact path on the graph, Trinity delivers the fully assembled transcriptome. The outputs are saved in a FASTA format, which includes all the transcripts. These can be accessed here: http://repository.tamu.edu/handle/1969.1/152151.
KEGG pathway and gene ontology analysis
The KEGG Automatic Annotation Server (KAAS, http://www.genome.jp/kaas-bin/kaas_main?mode=est_b) was employed to map KEGG pathways of assigned shrimp orthologs20. KEGG orthology (KO) assignments were performed based on the bi-directional best hit (BHH) of BLAST44. Functional annotation of the gene ontology (GO) terms was done using the BLAST2GO program (http://www.BLAST2GO.org/)45.
This work was supported by grants from Texas A&M Agrilife and Texas Veterinary Medical Diagnostic Laboratory #124151 to CDJ, SVD, and MFC, Texas A&M and Mexican CONACyT to RRS, SVD and MFC, USDA Formula Animal Health to MFC and SVD, and Mexican National Institute of Fisheries (INAPESCA) to RRS, LGB, ERP, SVD and MFC. The generation of RNA-Seq raw data was outsourced to Ambry Genetics, Aliso Viejo CA. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575, to provide compute time on the Blacklight system at the Pittsburgh Supercomputing Center (PSC).