Background & Summary

The melon fly, Zeugodacus cucurbitae (Coquillett) (Diptera, Tephritidae), is an important pest of cucurbit crops1. Although this species is thought to be native to India, it is now widely distributed in tropical, sub-tropical and temperate regions worldwide, including the Asia-Pacific and Africa1,2. Z. cucurbitae attacks more than 130 plant hosts from 39 families, among which the bitter gourd, snap melon and muskmelon are the preferred hosts3. Because of its powerful invasive ability, Z. cucurbitae has been defined as a category A fruit fly, with polyphagous, highly destructive characteristics4. Hence, although the Z. cucurbitae genome is available5, few functional genomic and genetic studies have been carried out comparing it to other species of tephritidae fruit flies. Functional studies of the key genes underlying mechanisms associated with Z. cucurbitae development, behavior, and reproduction are limited. Gene expression during development, which are useful for various analyses, have yet to be characterized.

Transcriptome sequencing is an efficient and low-cost method by which to explore gene expression patterns at multiple developmental stages6,7, tissues in insects8,9. Many insect transcriptomes have been obtained using next-generation sequencing techniques for functional genomic studies10,11. For example, in Ceratitis capitata, a conserved Maleness-on-the-Y gene was isolated to be involved in the male sex determination by RNA-Seq.12. Comprehensive studies of gene expression during various developmental stages in many important insect pests, e.g. the oriental fruit fly Bactrocera dorsalis13,14 and the ladybird Henosepilachna vigintioctopunctata6, have been performed to explore the molecular mechanisms underlying metamorphic development. In C. capitata, the gene expression at the specific embryogenesis stage was also obtained by transcriptomic sequencing technology15. In addition, gene expression has been studied in the fat bodies, ovaries, and testes of B. dorsalis to identify critical genes involved in female and male gonad-specific roles8,16,17. Comparative transcriptomic analyses have also been used to characterize the differential expression of genes putatively associated with fecundity during the maturation of B. dorsalis and C. capitata adults18,19. Transcriptomes have also been sequenced to investigate the mechanisms of the insect response to environmental stressors, including insecticides20 and microbial infection21,22. Thus, transcriptome sequencing allows us to determine the gene expression mechanisms underlying certain biological functions and to identify potential targets associated with pest control.

Many transcriptomes have already been published and published in the National Center of Biotechnology Information (NCBI) Sequence Read Archive (SRA) database, which is the most popular repository of raw RNA-Seq data. For example, more than 1398 samples of 47 projects have been published and open accessible for another insect pest species in the Tephritidae pest, B. dorsalis. However, for Z. cucurbitae, only 23 samples from five projects have been released (before August, 2019), and four of the five projects were sequenced for genome annotation and provided a fundamental reference regarding protein-coding genes of Z. cucurbitae5,23, but no more direct gene expression was exhibited, and no more biological replicates were performed. The remaining project, from 2016, was sequenced only from middle-aged adults and focused on proteins associated with chemosensory perception24, in which a small number (13) of odorant binding proteins (OBPs) were identified. In homologous B. dorsalis fruit fly, 49 OBPs were identified and characterized25. There is considerable gene expression information left to collect. The fruit fly is a holometabolous insect, and its life cycle has four developmental stages. In addition to the differences gene expression across these stages, complex biochemical processes take place within the larval development, pupal development, and sex maturation during the adult stage. For example, in larval development, the juvenile hormone and ecdysone are the most important effectors, and they cooperate with other genes and pathways to regulate the molting process26,27. Comparative transcriptomes were performed within the adult stage to explore the candidate genes involved in fecundity of B. dorsalis18,19. Complex biochemical processes also take place during the pupa stage in which the tissues were remodeled, e.g. SP95 involved in integument remodeling28. Gene expression during the specific time strictly in the process of the molting from old larva to the early pupa was also critical to successful molting, e.g. the wondering stage14. Although four transcriptomes were released, they are not a good resource because of the lack of biological replicates. The genome genetic sequence information is of great help for gene expression annotation, e.g. C. capitata and Z. cucurbitae5,29. A global gene expression profile for Z. cucurbitae has yet to be produced for assessment of gene expression throughout development are therefore still inaccessible.

In this study, we performed the first deep transcriptomic sequencing of multiple different developmental stages of Z. cucurbitae by RNA-seq, including 13 stages, and 52 libraries (four biological replicates in each stage) (Fig. 1, Online-only Table 1). We generated a high-coverage dataset for genome-wide gene expression profiling. We successfully mapped 13,760 genes to the reference genome, and identified 4481 novel genes. In total, 14,931 genes (81.85%) were functionally annotated against the NCBI non- redundant (NR), Swiss-Prot, Pfam, Gene Ontology (GO), Cluster of Orthologous Group (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases (Table 1). Four biological replicates of each stage provide a reliable evaluation of the gene expression during all the life stages. These data can also be used to analyze gene expression during metamorphosis to elucidate the molecular network, as well as the gene expression during the larval stages, pupal stage, and during the sexual maturation of adults. In addition, the comparative analysis of gene expression with other tephritidae fruit flies can also be performed, e.g. B. dorsalis, C. capitata, in which such data are available14,30. Our work represents a valuable resource, and will be helpful for future studies of functional genes associated with metamorphic development and reproduction.

Fig. 1
figure 1

Schematic work flow of the transcriptomes from the sample preparation to sequencing. (a) Sample collection of Zeugodacus cucurbitae, and the 13 stages analyzed in this study are shown as photos. (b) Schematic overview of the sample collection, sequencing and analysis.

Table 1 Gene assembly and database annotations.

Methods

Insect culture, sample collection

Z. cucurbitae were collected as pupae from Haikou, Hainan Province, China, in 2016, and were reared at 26–27 °C with 70 ± 5% relative humidity and a photoperiod of 14:10 h light: dark. Similar to B. dorsalis artificial diet31, Z. cucurbitae larvae were reared on an artificial diet consists of bitter gourd, corn, yeast powder, wheat flour, and sucrose; adults were fed a mixture of sucrose, yeast powder, and honey. Samples were collected at several different developmental stages: egg (~100 individuals < 8 h post-laying), 1st instar larva (100 individuals at 1-d-old), 2nd instar larva (20 individuals at 3-d-old), 3rd instar larva (10 individuals at 7-d-old), pupa (three individuals at 1-, 5-, and 9-d-old, representing the early-, middle-, and late-stage of pupae respectively), virgin female and male adults (three individuals at 1-, 5-, and 9-d-old) (Fig. 1a). As preliminary experiments showed that adults were sexually mature at 9-day-old, the time points 1-, 5-, and 9-d-old represented the newly-emerged, vitellogenic, and sex-matured stages, respectively. Four biological replicates of each stage were prepared.

Total RNA isolation

Total RNA was extracted from the 13 stages with TRIZol reagent (Invitrogen, Carlsbad, CA, USA), following the manufacturer’s instructions. The concentration and purity of total RNA samples were firstly quantified using a NanoDrop One (Thermo Fisher Scientific, Madison, WI, USA), and were re-determined using a Bioanalyser 2100 (Agilent Technologies, Palo Alto, CA, USA) before library construction.

Library construction and sequencing

The transcriptome libraries were constructed using the TruSeq RNA sample preparation kit (Illumina, San Diego, CA, USA), following the standard instructions. In brief, all the mRNA was isolated using the polyA selection method with oligo (dT) magnetic beads (Illumina), and then fragmented using fragmentation buffer. Double-stranded cDNA was synthesized using a SuperScript double-stranded cDNA synthesis kit (Invitrogen, Carlsbad, CA, USA) with buffer, dNTPs, random hexamer primers, DNA polymerase I, and RNase H. The synthesized cDNA was subjected to end-repair, phosphorylation, and ‘A’ base addition following the Illumina library construction protocol. Each library was size selected for cDNA target fragments (200–300 bp) on 2% low range ultra-agarose gels, then PCR amplified using Phusion DNA polymerase (NEB, Ipswich, MA, USA). Finally, paired-end RNA sequencing libraries were sequenced with the Illumina HiSeq Xten (2 × 150 bp read length). The raw data files obtained from high-throughput sequencing were converted to the original sequence.

Preprocessing of sequencing data

We constructed 52 libraries and produced 3,013,315,500 clean reads (435,605,241,804 bp) (Online-only Table 1). The raw paired-end reads were trimmed and quality controlled using SeqPrep (https://github.com/jstjohn/SeqPrep) and Sickle (https://github.com/najoshi/sickle) with default parameters of the sequencing analyzer32. The quality of each base was evaluated by FASTX (Fig. 2a,b). The error bases with N and low quality reads were removed (Fig. 2c–f). Clean reads were aligned to the reference genome using HISAT2 (http://ccb.jhu.edu/software/hisat2/index.shtml)33. The mapping criteria were as follows: sequencing reads should be uniquely matched to the genome, with up to two mismatches allowed but without insertions or deletions. The mapping ratios were calculated as an index of the sequence quality. The quality assessment of the transcriptomic sequencing was evaluated using RSeQC 2.3.6 package. The highest coverage paths were assembled and constructed as transcript isoforms. All clean reads those were not mapping to the transcriptional regions were assembled into new transcripts based on overlaps using StringTie (http://ccb.jhu.edu/software/stringtie)34.

Fig. 2
figure 2

Quality assessment of the RNA-Seq. Example of the base quality of the raw reads (a) and clean reads (b). Error rate distribution of the raw reads (c) and clean reads (d). Base content of the raw reads (e) and clean reads (f).

Gene annotation

All genes, including the newly assembled fragments, were functional annotated against several databases (NR, Swiss-Prot, and evolutionary genealogy of genes: Non-supervised Orthologous Groups) using DIAMOND v0.8.37.99. For Pfam annotations, HMMER (version 3.1b2) was used with an E-value < 10−5. For Gene Ontology (GO) annotations, the Blast2GO pipeline was used, with an E-value < 10−535. Pathway analysis was performed against the KEGG database with KOBAS v2.1.1 (E-value < 10−5). A total of 81.58% genes were annotated in six databases (Table 1).

Gene expression analysis

Gene expression analysis was conducted based on the sequenced reads using RNA-Seq with the Expectation-Maximization (RSEM, v1.3.1) method36. Global gene expression was quantified as transcripts per million reads (TPM) in the four biological replicates. The average TPM value after normalization represented the corresponding quantitative gene expression level at each developmental stage37. The average TPM values for each stage were also used in a principal component analysis (PCA). A log10 transformation was performed to reduce the TPM values for gene expression. The PCA was carried out using an RSEM pipeline and was shown as a line plot. We measured differential gene expression between developmental stages using DESeq. 2 v1.10.138. Genes with an adjusted P-value < 0.05 and a log2 expression ratio > 2 between any two stages were identified as differentially expressed. We compared gene expression levels between stages to identify changes in dynamic gene expression among the four developmental stages (egg, larva, pupa, and adult). We also compared gene expression levels during the development among three instar larval stages; three pupal stages; and three adult stages. Gene expression was compared between the two sexes in adults.

Data Records

The raw transcriptome data were uploaded to the NCBI SRA database39. Gene information and database annotations (e.g. NR, Swiss-Prot, GO, and KEGG pathway) were uploaded to figshare40. The quantitative gene expression of all genes were evaluated by TPM reads and deposited in NCBI Gene Expression Omnibus (GEO) database with an accession number of GSE13948837.

Technical Validation

Quality control

The raw reads were quality controlled using SeqPrep and Sickle software for all samples  (Fig. 2a,b). The distribution of the error rate of each sample was evaluated (Fig. 2c,d), then the bases content was calculated (Fig. 2e,f). After removing the low quality bases, we generated the clean reads. The clean reads accounted for up to 94.64% of the raw reads on average. The error rate of each reads was determined by the Phred score, and the mean error rate was 0.026% (range: 0.024–0.029%). The average Q20 (Phred score >20) was 97.54% (Online-only Table 1). The distribution of A/T/G/C was thereafter inspected to analyze the GC content, which showed a low proportion (Fig. 2e,f). These results indicated that sequencing quality was high. Clean reads were separately aligned to the reference genome using HISAT233.

Transcriptome quality assessment

The saturation of sequencing was also analyzed using RSeQC 2.3.6 package, and the data showed a high saturation of each sample sequencing (Fig. 3a). Most of the reads were mapped to the coding sequence region (Fig. 3b). The alignment of the superreads to the reference genome resulted in a total of 41,530 transcripts, and 34.03% were longer than 2700 bp, and 75.96% were longer than 900 bp (Fig. 3c). Analyses of the structures of the transcripts suggested that 11,506 transcripts were potential novel transcript isoforms (Fig. 3d), and 22,068 transcripts were completely matched to the exons.

Fig. 3
figure 3

Sequencing results. An example of the sequencing saturation (a), and the mapping percentage of the reads against the genome (b). The length distribution of the assembled transcripts (c). The type of the assembled transcripts (d).

Gene annotation

All the transcripts were annotated to 18,241 genes, including 13,760 known genes (22,159 transcripts), and 4481 novel genes (19,371 transcripts) (Table 1). All genes were functionally annotated to six databases. The percent of the annotated new assembled genes was 45.28%, which was lower than the reference gene of 93.76%. Finally, we functionally annotated 14,931 genes (81.85%): 14,891 in NR; 10,216 in Swiss-Prot; 13,283 in COG; 11,298 in Pfam; 6911 in GO; and 6653 in KEGG (Table 1).

Gene expression analysis

In this study, four biological replicates were sequenced for gene expression analysis. The correlation coefficient was calculated with Pearson correlation, and the results showed a good correlation between the four biological replicates of the different samples (Fig. 4a). The TPM distribution density of all samples were calculated, and the results also showed a good repeatability and dynamic changes among developmental stages (Fig. 4b). Gene expression levels across all samples were analyzed by PCA; the PCA diagnostic screen plot suggested that the first nine components explain more than 90% of the variance, with the first two explaining 52.13% (Fig. 4c). The graph of the first two principal components (PC1 and PC2) showed that the samples were consistently clustered by developmental stage, indicating good repeatability within the biological replicates and good separation among the developmental stages (Fig. 4d). The database annotations and expression profiles for each gene were performed37,40.

Fig. 4
figure 4

Global gene expression patterns during the 13 life stages. (a) Biological replicate sample correlation analysis. (b) Gene expression boxplot of samples during different stages, plotted as log10-transformed transcripts per million reads (TPM) values (log10(TPM + 1)). (c) The line plot of the variance explained by the first 20 principal components. (d) The scatter plot of PC1 versus PC2 in the principal component analyses (PCA), the gray circle represents the eggs, the green circle represents the larvae, the red circle represents pupae, the yellow circle represents adult females, and the blue circle represents the adult males, respectively.

Differential gene expression between developmental stages was analyzed using DESeq. 238, with an adjusted P-value < 0.05 and a log2 expression ratio > 2. The absolute TPM value > 0.1 was considered as a reliable expression. We analyzed gene expression levels during the four developmental stages, and found that 11,698 of the expressed genes were expressed during metamorphic stages, while 5458 genes were expressed in one and/or more stages specifically (Fig. 5a-1). In these stages, a total of 15017, 12952, 15400, 14417 and 15484 genes were expressed during each stage (Fig. 5a-2). During the larval stage, 18.75% (2785) genes were differentially expressed (Fig. 5b-1). Similarly, 18.73% (3093) genes were differently expressed during pupal development (Fig. 5b-2). In the adult stages, 13.36% and 10.09% genes were differently expressed during female and male adult development, respectively (Fig. 5b-3,b-4).

Fig. 5
figure 5

Gene expression during the various developmental stages. (a) Veen plot of gene expression during the four metamorphic stages (egg, larva, pupa, female and male adult), and total genes in each stage. (b) Veen plot of gene expression within larva (b-1), pupa (b-2), female (b-3) and male adult (b-4) stage, respectively. (c) Venn plot of the differentially expressed genes in each of two developmental stages. Each ellipse circle represents the differentially expressed genes between two connection stages. (d) An example of gene100 expression during the whole 13 stages.

In addition, these transcriptomes might be useful for gene expression analysis among stages. That is, 93.83% of the genes were differentially expressed among stages: 920 genes were differentially expressed between egg and 7-d-old larva; 298 genes were differentially expressed between 7-d-old larva and 5-d-old pupa; 233 genes were differentially expressed between 5-d-old pupa and 5-d-old female; and 157 genes were differentially expressed between 5-d-old pupa and 5-d-old male. However, 630 genes were differentially expressed between all pairs of adjacent stages (Fig. 5c). These data provide a comprehensive comparison of global gene expression among developmental stages. Most importantly, all the gene expression during the 13 stages can be calculated by average the TPM value (Fig. 5d).