De novo transcriptome assembly and annotation for gene discovery in avocado, macadamia and mango

Avocado (Persea americana Mill.), macadamia (Macadamia integrifolia L.) and mango (Mangifera indica L.) are important subtropical tree species grown for their edible fruits and nuts. Despite their commercial and nutritional importance, the genomic information for these species is largely lacking. Here we report the generation of avocado, macadamia and mango transcriptome assemblies from pooled leaf, stem, bud, root, floral and fruit/nut tissue. Using normalized cDNA libraries, we generated comprehensive RNA-Seq datasets from which we assembled 63420, 78871 and 82198 unigenes of avocado, macadamia and mango, respectively using a combination of de novo transcriptome assembly and redundancy reduction. These unigenes were functionally annotated using Basic Local Alignment Search Tool (BLAST) to query the Universal Protein Resource Knowledgebase (UniProtKB). A workflow encompassing RNA extraction, library preparation, transcriptome assembly, redundancy reduction, assembly validation and annotation is provided. This study provides avocado, macadamia and mango transcriptome and annotation data, which is valuable for gene discovery and gene expression profiling experiments as well as ongoing and future genome annotation and marker development applications.


Background & Summary
Fruits and nuts are an important source of vitamins and dietary fibre for consumers and a source of income for farmers. Avocado (Persea americana Mill.), macadamia (Macadamia integrifolia L.) and mango (Mangifera indica L.) are important commercial tree species grown in Australia and other tropical/sub-tropical regions. In 2013, the world production of avocado was about 4.7 million tonnes 1 . Macadamia is grown commercially for its edible nuts in tropical and subtropical regions, including Australia, Hawaii, China, Thailand, southern and central Africa and Central and South America 2 . Mangoes are produced commercially in at least 87 countries on an estimated area 5 million hectares with an annual production of over 35 million tonnes 3 . Despite their commercial and nutritional importance, these tree crops are yet to benefit from a substantial research effort required to generate significant public bioinformatic resources. These resources are essential for functional genomics studies, marker-assisted breeding, cultivar development, and genome annotation efforts. Here, we report on the generation and availing of transcriptomic resources for avocado, macadamia and mango.
Currently a few genomic resources are available for avocado, mango and macadamia. Most of the publicly available de novo transcriptome assemblies of avocado and mango are limited to either leaf or fruit tissue 4-7 . Only two studies published open-access transcriptome assemblies from several tissues of avocado and mango respectively [8][9][10] . However, these assemblies were derived from RNA-Seq libraries that were not normalised and therefore lack some essential yet lowly expressed genes and near-universal single-copy genes ( Supplementary  Fig. 1). Additionally, the 'Keitt' mango transcriptome study 9 was designed for SNP discovery and did not produce a reference transcriptome for gene discovery purposes. A reference macadamia genome assembly with its accompanying reference gene set was recently published 11 . However, this genome assembly comprises 79% of the estimated macadamia genome size 11,12 . A draft mango genome was published in 2016, although it is not yet be publicly available 13 . We believe that our de novo transcriptome assemblies derived from normalized RNA-Seq libraries are complimentary to these resources as they accentuate rare/low abundance transcripts. In eukaryotes, the high abundance transcripts (several thousand mRNA copies per cell) from as few as 5-10 genes account for 20% of the cellular mRNA 14  www.nature.com/scientificdata www.nature.com/scientificdata/ 500-2000 genes constitute about 40-60% of the cellular mRNA. The remaining 20-40% of mRNA is represented by rare, low abundance (from one to several dozen mRNA copies per cell) transcripts 14 . Such an enormous difference in transcript abundance compromises gene discovery, which results in poor detection of genes transcribed at relatively low levels.
We therefore prepared comprehensive cDNA libraries from RNA pooled from of a wide range of plant tissues (leaf, stem, axillary bud, root and flower and fruit/nut) to maximize the number of transcripts represented in each library. The essential part of the library preparation process was converting the pooled RNA into normalized cDNA using a duplex-specific nuclease (DSN) normalization protocol 15 . This was done to avoid the dilution of transcripts from lowly expressed genes by those from highly expressed genes (Fig. 1) and therefore to improve gene discovery 16 . The assemblies generated in this study can be utilized as reference gene sets for a variety of tree genomics studies requiring transcriptome information of Persea americana, Macadamia integrifolia, Mangifera indica and related species. For example, considering that Persea americana, and Mangifera indica, are both prone to alternate/biennial bearing 17,18 , identification and subsequent manipulation of genes regulating floral induction may greatly contribute to solving this problem. Our transcriptome assemblies will also assist in mRNA-based genome annotation 19 for ongoing whole genome sequencing projects of macadamia and mango 11,13 .

Methods
Sample collection. Tissue samples were collected from mature (7-15 year old) field-grown avocado cv.
"Hass", mango cv. 1243, and macadamia cv. 751 trees in Queensland, Australia. Plant tissue sampled included young and mature leaves, dormant and bursting axillary and terminal buds, mature and elongating stems and roots, a mixture of floral tissues at different stages of development and a mixture of fruit tissue in the case of avocado and mango or nuts in the case of macadamia. Fresh material was flash frozen in liquid nitrogen or dry ice and stored at −80 °C before being homogenized using an automated tissue grinder (Geno/Grinder ® , SPEX). RNA extraction. RNA was extracted from the different samples using a CTAB/PVP/SDS method developed for these types of samples as previously described 20 . Briefly, frozen powder was lysed using a CTAB/PVP buffer + 1 mM DTT for 10-15 min. One percent SDS was then added to each sample before centrifugation for 15 min at 20,000 g. The liquid phase containing the nucleic acids was up taken and added to an equal volume of isopropanol before centrifugation (20,000 g) for 45-60 min at 4 degrees. The nucleic acid pellet was then washed with 70% ethanol and resuspended in water. DNase treatment was then applied for 25 min and RNA was precipitated in an equal volume of isopropanol to form a nucleic acid pellet. The pellets were washed in 70% ethanol and then resuspended in pure water. RNA concentration was measured using a NanoVue ™ Plus Spectrophotometer (GE Healthcare Life Sciences, USA). RNA integrity check was performed by agarose gel electrophoresis.
Normalised cDNA Library preparation. One normalised cDNA library was prepared for each of avocado, macadamia and mango, from equal amounts of mRNA from the different tissue types mentioned above and as described in Fig. 1. Poly(A)-RNA was isolated using oligo(dT) magnetic beads (Invitrogen ™ Dynabeads ™ ). 0.5-1 μg of the poly(A)RNA was converted into full-length-enriched double stranded cDNA using the Mint-2 cDNA synthesis kit and following the manufacturer's instructions (Evrogen, Moscow, Russia). The double stranded cDNA was then normalized using the DSN-based Trimmer-2 cDNA normalization kit and following the manufacturer's instructions (Evrogen, Moscow, Russia). The normalized cDNA libraries were then sheared into ~300 bp fragments with a sonicator (Bioruptor ® , Diagenode) and indexed with adaptors using the NEBNext ® DNA Library Prep Master Mix Set for Illumina ® . Four technical replicates of each of the three normalized cDNA libraries were sequenced on the Illumina NextSeq. 500 platform ( Fig. 1) with the primary objective of enhancing de novo gene discovery.
De novo assembly and dataset annotation. High-quality RNA-Seq reads (sequences) were used in the subsequent de novo transcriptome assembly. Raw RNA-seq reads were pre-processed by removing adapters and low-quality sequences (<Q30) using Trimmomatic (v. 0.35) with default parameters 21 . Sequencing summary statistics showing the total number of reads before and after trimming and quality filtering is presented in Table 1. RNA-Seq read quality before and after trimming was assessed using FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and aggregated using MultiQC 22 , read quality after trimming is presented in Fig. 2. De novo transcriptome assembly was done with Trinity (v. 2.7.0) using default settings 23,24 . Coding regions of the assembled transcripts were predicted using TransDecoder (v. 5.5.0) with default settings 24 . We used selected the single best open reading frame (ORF) per transcript longer than 100 peptides. We then used the CD-HIT-EST program (v. 4.8.1) with default parameters (similarity 95%) to reduce transcript redundancy and produce unique genes ("unigenes") 25 . We used Basic Local Alignment Search Tool (BLAST) to assign functional annotations to the unigenes 26,27 .

Data Records
Nine datasets were generated in this study. The first datasets consists of RNA-seq raw reads of Persea americana, Macadamia integrifolia and Mangifera indica, which were deposited in the NCBI Sequence Read Archive database under project identification number PRJNA533518 28 . Datasets containing Persea americana, Macadamia integrifolia and Mangifera indica transcriptome assemblies were deposited in the NCBI Transcriptome Shotgun Assembly (TSA) database under TSA accession numbers GHOF0000000 29 , GHOE00000000 30 and GHOG00000000 31 . Datasets containing Persea americana, Macadamia integrifolia and Mangifera indica raw trinity transcriptome assemblies, unigenes, and functional annotation files were deposited in Figshare 32-34 .

technical Validation
Read quality assessment and by extension, read validation was done using FastQC, quality control (QC) plots were aggregated using MultiQC 22 and are presented in Fig. 2. We used HISAT2 35 to map avocado and macadamia RNA-Seq reads to their respective reference genome assemblies 10,11 . 73,7 and 79,8% of the avocado and macadamia reads mapped to their respective reference genome assemblies (Table 1). 63420, 78871 and 82198 unigenes of avocado, macadamia and mango were generated from the RNA-Seq data using a combination of de novo transcriptome assembly and redundancy reduction ( Fig. 1; Table 2). We used BLASTn (e-value cut-off of 1e-5 and an identity cut-off of 70%) to compare our avocado and macadamia unigenes to the published reference gene sets 10,11 . 22670 (92%) and 27322 (77%) of the reference avocado and macadamia genes respectively were present in our assemblies ( Table 1). The length distribution of "unigenes" was similar across the three species (Fig. 3a-c).