Genome-wide gene expression profiling of the melon fly, Zeugodacus cucurbitae, during thirteen life stages

The melon fly, Zeugodacus cucurbitae (Coquillett), is an important destructive pest worldwide. Functional studies of the genes associated with development and reproduction during different life stages are limited in Z. cucurbitae. There have yet to be comprehensive transcriptomic resources for genetic and functional genomic studies to identify the molecular mechanisms related to its development and reproduction. In this study, we comprehensively sequenced the transcriptomes of four different developmental stages: egg, larva, pupa, and adults. Using the Illumina RNA-Seq technology, we constructed 52 libraries from 13 stages with four biological replicates in each and generated 435.61 Gb clean reads. We comprehensively characterized the transcriptomes with high-coverage mapping to the reference genome. A total of 13,760 genes were mapped to the reference genome, and another 4481 genes were characterized as new genes. Finally, 14,931 genes (81.85%) were functionally annotated against six annotation databases. This study provides the first comprehensive transcriptome data of all developmental stages of Z. cucurbitae, and will serve as a valuable resource for future genetic and functional studies.


Background & Summary
The melon fly, Zeugodacus cucurbitae (Coquillett) (Diptera, Tephritidae), is an important pest of cucurbit crops 1 . Although this species is thought to be native to India, it is now widely distributed in tropical, sub-tropical and temperate regions worldwide, including the Asia-Pacific and Africa 1,2 . Z. cucurbitae attacks more than 130 plant hosts from 39 families, among which the bitter gourd, snap melon and muskmelon are the preferred hosts 3 . Because of its powerful invasive ability, Z. cucurbitae has been defined as a category A fruit fly, with polyphagous, highly destructive characteristics 4 . Hence, although the Z. cucurbitae genome is available 5 , few functional genomic and genetic studies have been carried out comparing it to other species of tephritidae fruit flies. Functional studies of the key genes underlying mechanisms associated with Z. cucurbitae development, behavior, and reproduction are limited. Gene expression during development, which are useful for various analyses, have yet to be characterized.
Transcriptome sequencing is an efficient and low-cost method by which to explore gene expression patterns at multiple developmental stages 6,7 , tissues in insects 8,9 . Many insect transcriptomes have been obtained using next-generation sequencing techniques for functional genomic studies 10,11 . For example, in Ceratitis capitata, a conserved Maleness-on-the-Y gene was isolated to be involved in the male sex determination by RNA-Seq. 12 . Comprehensive studies of gene expression during various developmental stages in many important insect pests, e.g. the oriental fruit fly Bactrocera dorsalis 13,14 and the ladybird Henosepilachna vigintioctopunctata 6 , have been performed to explore the molecular mechanisms underlying metamorphic development. In C. capitata, the gene expression at the specific embryogenesis stage was also obtained by transcriptomic sequencing technology 15 . In addition, gene expression has been studied in the fat bodies, ovaries, and testes of B. dorsalis to identify critical genes involved in female and male gonad-specific roles 8,16,17 . Comparative transcriptomic analyses have also been used to characterize the differential expression of genes putatively associated with fecundity during the maturation of B. dorsalis and C. capitata adults 18,19 . Transcriptomes have also been sequenced to investigate the mechanisms of the insect response to environmental stressors, including insecticides 20 and microbial infection 21,22 . Thus, transcriptome sequencing allows us to determine the gene expression mechanisms underlying certain biological functions and to identify potential targets associated with pest control.
Many transcriptomes have already been published and published in the National Center of Biotechnology Information (NCBI) Sequence Read Archive (SRA) database, which is the most popular repository of raw RNA-Seq data. For example, more than 1398 samples of 47 projects have been published and open accessible for another insect pest species in the Tephritidae pest, B. dorsalis. However, for Z. cucurbitae, only 23 samples from five projects have been released (before August, 2019), and four of the five projects were sequenced for genome annotation and provided a fundamental reference regarding protein-coding genes of Z. cucurbitae 5,23 , but no more direct gene expression was exhibited, and no more biological replicates were performed. The remaining project, from 2016, was sequenced only from middle-aged adults and focused on proteins associated with chemosensory perception 24 , in which a small number (13) of odorant binding proteins (OBPs) were identified. In homologous B. dorsalis fruit fly, 49 OBPs were identified and characterized 25 . There is considerable gene expression information left to collect. The fruit fly is a holometabolous insect, and its life cycle has four developmental stages. In addition to the differences gene expression across these stages, complex biochemical processes take place within the larval development, pupal development, and sex maturation during the adult stage. For example, in larval development, the juvenile hormone and ecdysone are the most important effectors, and they cooperate with other genes and pathways to regulate the molting process 26,27 . Comparative transcriptomes were performed within the adult stage to explore the candidate genes involved in fecundity of B. dorsalis 18,19 . Complex biochemical processes also take place during the pupa stage in which the tissues were remodeled, e.g. SP95 involved in integument remodeling 28 . Gene expression during the specific time strictly in the process of the molting from old larva to the early pupa was also critical to successful molting, e.g. the wondering stage 14 . Although four transcriptomes were released, they are not a good resource because of the lack of biological replicates. The genome genetic sequence information is of great help for gene expression annotation, e.g. C. capitata and Z. cucurbitae 5,29 . A global gene expression profile for Z. cucurbitae has yet to be produced for assessment of gene expression throughout development are therefore still inaccessible.
In this study, we performed the first deep transcriptomic sequencing of multiple different developmental stages of Z. cucurbitae by RNA-seq, including 13 stages, and 52 libraries (four biological replicates in each stage) ( Fig. 1, Online-only Table 1). We generated a high-coverage dataset for genome-wide gene expression profiling. We successfully mapped 13,760 genes to the reference genome, and identified 4481 novel genes. In total, 14,931 genes (81.85%) were functionally annotated against the NCBI non-redundant (NR), Swiss-Prot, Pfam, Gene Ontology (GO), Cluster of Orthologous Group (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases (Table 1). Four biological replicates of each stage provide a reliable evaluation of the gene expression during all the life stages. These data can also be used to analyze gene expression during metamorphosis to elucidate the molecular network, as well as the gene expression during the larval stages, pupal stage, and during the sexual maturation of adults. In addition, the comparative analysis of gene expression with other tephritidae fruit flies can also be performed, e.g. B. dorsalis, C. capitata, in which such data are available 14,30 . Our work represents a valuable resource, and will be helpful for future studies of functional genes associated with metamorphic development and reproduction.
Total RNA isolation. Total RNA was extracted from the 13 stages with TRIZol reagent (Invitrogen, Carlsbad, CA, USA), following the manufacturer's instructions. The concentration and purity of total RNA samples were firstly quantified using a NanoDrop One (Thermo Fisher Scientific, Madison, WI, USA), and were re-determined using a Bioanalyser 2100 (Agilent Technologies, Palo Alto, CA, USA) before library construction.
Library construction and sequencing. The transcriptome libraries were constructed using the TruSeq RNA sample preparation kit (Illumina, San Diego, CA, USA), following the standard instructions. In brief, all the mRNA was isolated using the polyA selection method with oligo (dT) magnetic beads (Illumina), and then fragmented using fragmentation buffer. Double-stranded cDNA was synthesized using a SuperScript double-stranded cDNA synthesis kit (Invitrogen, Carlsbad, CA, USA) with buffer, dNTPs, random hexamer primers, DNA polymerase I, and RNase H. The synthesized cDNA was subjected to end-repair, phosphorylation, and ' A' base www.nature.com/scientificdata www.nature.com/scientificdata/ addition following the Illumina library construction protocol. Each library was size selected for cDNA target fragments (200-300 bp) on 2% low range ultra-agarose gels, then PCR amplified using Phusion DNA polymerase (NEB, Ipswich, MA, USA). Finally, paired-end RNA sequencing libraries were sequenced with the Illumina HiSeq Xten (2 × 150 bp read length). The raw data files obtained from high-throughput sequencing were converted to the original sequence.
preprocessing of sequencing data. We constructed 52 libraries and produced 3,013,315,500 clean reads (435,605,241,804 bp) (Online-only Table 1). The raw paired-end reads were trimmed and quality controlled using  www.nature.com/scientificdata www.nature.com/scientificdata/ SeqPrep (https://github.com/jstjohn/SeqPrep) and Sickle (https://github.com/najoshi/sickle) with default parameters of the sequencing analyzer 32 . The quality of each base was evaluated by FASTX (Fig. 2a,b). The error bases with N and low quality reads were removed (Fig. 2c-f). Clean reads were aligned to the reference genome using HISAT2 (http://ccb.jhu.edu/software/hisat2/index.shtml) 33 . The mapping criteria were as follows: sequencing reads should be uniquely matched to the genome, with up to two mismatches allowed but without insertions or deletions. The mapping ratios were calculated as an index of the sequence quality. The quality assessment of the transcriptomic sequencing was evaluated using RSeQC 2.3.6 package. The highest coverage paths were assembled and constructed as transcript isoforms. All clean reads those were not mapping to the transcriptional regions were assembled into new transcripts based on overlaps using StringTie (http://ccb.jhu.edu/software/stringtie) 34 .
Gene annotation. All genes, including the newly assembled fragments, were functional annotated against several databases (NR, Swiss-Prot, and evolutionary genealogy of genes: Non-supervised Orthologous Groups) using DIAMOND v0.8.37.99. For Pfam annotations, HMMER (version 3.1b2) was used with an E-value < 10 −5 . For Gene Ontology (GO) annotations, the Blast2GO pipeline was used, with an E-value < 10 −5 35 . Pathway analysis was performed against the KEGG database with KOBAS v2.1.1 (E-value < 10 −5 ). A total of 81.58% genes were annotated in six databases (Table 1).  www.nature.com/scientificdata www.nature.com/scientificdata/  , the gray circle represents the eggs, the green circle represents the larvae, the red circle represents pupae, the yellow circle represents adult females, and the blue circle represents the adult males, respectively. www.nature.com/scientificdata www.nature.com/scientificdata/ was shown as a line plot. We measured differential gene expression between developmental stages using DESeq. 2 v1.10.1 38 . Genes with an adjusted P-value < 0.05 and a log2 expression ratio > 2 between any two stages were identified as differentially expressed. We compared gene expression levels between stages to identify changes in dynamic gene expression among the four developmental stages (egg, larva, pupa, and adult). We also compared gene expression levels during the development among three instar larval stages; three pupal stages; and three adult stages. Gene expression was compared between the two sexes in adults.

Data Records
The raw transcriptome data were uploaded to the NCBI SRA database 39 . Gene information and database annotations (e.g. NR, Swiss-Prot, GO, and KEGG pathway) were uploaded to figshare 40 . The quantitative gene expression of all genes were evaluated by TPM reads and deposited in NCBI Gene Expression Omnibus (GEO) database with an accession number of GSE139488 37 .

Technical Validation
Quality control. The raw reads were quality controlled using SeqPrep and Sickle software for all samples (Fig. 2a,b). The distribution of the error rate of each sample was evaluated (Fig. 2c,d), then the bases content was calculated (Fig. 2e,f). After removing the low quality bases, we generated the clean reads. The clean reads accounted for up to 94.64% of the raw reads on average. The error rate of each reads was determined by the Phred score, and the mean error rate was 0.026% (range: 0.024-0.029%). The average Q20 (Phred score >20) was www.nature.com/scientificdata www.nature.com/scientificdata/ 97.54% (Online-only Table 1). The distribution of A/T/G/C was thereafter inspected to analyze the GC content, which showed a low proportion (Fig. 2e,f). These results indicated that sequencing quality was high. Clean reads were separately aligned to the reference genome using HISAT2 33 . Transcriptome quality assessment. The saturation of sequencing was also analyzed using RSeQC 2.3.6 package, and the data showed a high saturation of each sample sequencing (Fig. 3a). Most of the reads were mapped to the coding sequence region (Fig. 3b). The alignment of the superreads to the reference genome resulted in a total of 41,530 transcripts, and 34.03% were longer than 2700 bp, and 75.96% were longer than 900 bp (Fig. 3c). Analyses of the structures of the transcripts suggested that 11,506 transcripts were potential novel transcript isoforms (Fig. 3d), and 22,068 transcripts were completely matched to the exons.
Gene annotation. All the transcripts were annotated to 18,241 genes, including 13,760 known genes (22,159 transcripts), and 4481 novel genes (19,371 transcripts) ( Table 1). All genes were functionally annotated to six databases. The percent of the annotated new assembled genes was 45.28%, which was lower than the reference gene of 93.76%. Finally, we functionally annotated 14,931 genes (81.85%): 14,891 in NR; 10,216 in Swiss-Prot; 13,283 in COG; 11,298 in Pfam; 6911 in GO; and 6653 in KEGG (Table 1).

Gene expression analysis.
In this study, four biological replicates were sequenced for gene expression analysis. The correlation coefficient was calculated with Pearson correlation, and the results showed a good correlation between the four biological replicates of the different samples (Fig. 4a). The TPM distribution density of all samples were calculated, and the results also showed a good repeatability and dynamic changes among developmental stages (Fig. 4b). Gene expression levels across all samples were analyzed by PCA; the PCA diagnostic screen plot suggested that the first nine components explain more than 90% of the variance, with the first two explaining 52.13% (Fig. 4c). The graph of the first two principal components (PC1 and PC2) showed that the samples were consistently clustered by developmental stage, indicating good repeatability within the biological replicates and good separation among the developmental stages (Fig. 4d). The database annotations and expression profiles for each gene were performed 37,40 .
Differential gene expression between developmental stages was analyzed using DESeq. 2 38 , with an adjusted P-value < 0.05 and a log2 expression ratio > 2. The absolute TPM value > 0.1 was considered as a reliable expression. We analyzed gene expression levels during the four developmental stages, and found that 11,698 of the expressed genes were expressed during metamorphic stages, while 5458 genes were expressed in one and/or more stages specifically (Fig. 5a-1). In these stages, a total of 15017, 12952, 15400, 14417 and 15484 genes were expressed during each stage (Fig. 5a-2). During the larval stage, 18.75% (2785) genes were differentially expressed ( Fig. 5b-1). Similarly, 18.73% (3093) genes were differently expressed during pupal development (Fig. 5b-2). In the adult stages, 13.36% and 10.09% genes were differently expressed during female and male adult development, respectively ( Fig. 5b-3,b-4).
In addition, these transcriptomes might be useful for gene expression analysis among stages. That is, 93.83% of the genes were differentially expressed among stages: 920 genes were differentially expressed between egg and 7-d-old larva; 298 genes were differentially expressed between 7-d-old larva and 5-d-old pupa; 233 genes were differentially expressed between 5-d-old pupa and 5-d-old female; and 157 genes were differentially expressed between 5-d-old pupa and 5-d-old male. However, 630 genes were differentially expressed between all pairs of adjacent stages (Fig. 5c). These data provide a comprehensive comparison of global gene expression among developmental stages. Most importantly, all the gene expression during the 13 stages can be calculated by average the TPM value (Fig. 5d).